- Roman Kazinnik
Convolutional Neural Networks for NLP
Updated: Dec 2, 2019
The goal of this post is to compare NLP approaches for encoding words:
(1) bag of words (2) tfidf (3) google GloVe emeddings
Training set is a set of tagged sequences. I run experiments for 5K and 15K long training sets. Experiments confirm the basic intuition that in order to improve prediction accuracy with training for more unknowns would require more data. Benchmark experiment is done with word representation as one-hot encoding, and word2vec improves classification accuracy. Further improvement can be achieved by introducing convolutional training layers and with training for larger training data. Perhaps most surprising observation is that classification accuracy of 90%+ (F1) can be achieved, which is high considering the high number of classes.
More technical details:
padding for a fixed word length used to allow for fixed length convolutions
my Keras model has convolutional layer optionally
Google word2vec dictionary used for word embedding
Most comprehensive word embedding includes word2vec, dense from chars convolution, word casing. Code example: # create input matrices: each word is a vector train_set = createMatrices(trainSentences,word2Idx, label2Idx, case2Idx,char2Idx) Training data: tagged NLP sequences
Code: code:https://github.com/romanonly/romankazinnik_blog/blob/master/NLP/CNN
Training: Keras, bi-directional LSTM

