This is a joint article authored in collaboration between Kannan Sankaran and Win Suen.
The Problem
Over the past few years, there has been burgeoning interest in neural networks from data science and engineering communities. The advent of ever larger datasets, efficient commodity hardware, and powerful open source libraries have opened up exploration of neural networks. Yet, neural networks are one of several models in the data science toolkit. Should we believe the hype?
We decided the best way to learn more was to try for ourselves. Our motivating example focuses on sentiment analysis, an area teeming with real-world applications from recommender systems to market predictions. We focus on a single task — classifying positive and negative sentiment in the IMDB movie reviews dataset. Are neural networks just a different approach to modeling, or do they actually perform better than an alternative model?
In this post, we:
- Compare the performance of a neural network against support vector machines (SVMs), which have a proven track record for many Natural Language Processing (NLP) tasks.
- Summarize challenges we encountered and ideas for improving model performance.
- Share some best practices for implementing your own models.
The challenger: Neural Networks (NN)
Neural networks are inspired and modeled after the structure of the human brain. The artificial neuron is the primary unit of a neural network, and consists of the following:
- The input – this could be one or more inputs x1, x2,..xn, e.g images, or text in vector form. The weights w1, w2,..wn represent the strength of each input, and are learned during the training. The bias b influences the output without contributing to the input. The bias is added to the weighted sums of the inputs, and passed on to the activation function.
- The activation function – this determines the output of a neuron, and is usually a non-linear function like sigmoid, tanh, or rectified linear unit (ReLU).
Multilayer Feedforward Neural Network
If the neurons are connected in the manner shown below, where every neuron in each layer connects to every other neuron in the next layer, we get a feedforward neural network. The hidden layer and output layers make up the 2 layers in this network.
The defending champion: Support Vector Machines (SVMs)
[We assume some familiarity with SVMs, as full explanation is beyond scope of this talk. ]
A support vector machine (SVM) is a discriminative model that was popularized in the 1990s. SVMs classify samples by drawing an optimal hyperplane that divides samples into predicted classes. The soft-margin SVM allows us to assign penalties to samples are on the wrong side of the optimal hyperplane (C parameter, a penalty modifier applied to error term).
What makes SVMs really powerful are kernel functions because they allow SVMs to draw the optimal hyperplane across feature spaces that are not linearly-separable. Kernel functions allow us to warp/transform the input feature space in such as way that it is separable. Two parameters to tune here in our SVM:
- kernel function: the selection of the actual function type.
- gamma: kernel coefficient.
While an analysis of the kernel trick and each type of kernel function is out of scope, we suggest some useful resources for learning more at the end of this talk.
The Showdown
Raw Data
We used the IMDb sentiment dataset, which consists of 50000 movie reviews split 50-50 into training and validation sets. The goal is to classify movie reviews into positive or negative sentiment. Classes are balanced in this dataset, so positive reviews make up 50% and negative reviews make up the other 50% of the data.
Feature Engineering
For NLP tasks, feature engineering is non-trivial! To even the playing field, we used the same feature engineering steps for both the NN and the SVM. After removing html tags and numbers, we used the word2vec embedding in python’s spacy library to encode the text of the reviews. For details on why we selected this feature processing, please see our writeup on Github (forthcoming).
One caveat here: word2vec generates a dense vector of 300 features. This is much more compact than features such as TF-IDF, which produce individual features that are very predictive of one class at the expense of being a very sparse vector. For the IMDb problem, we selected a dense feature encoding; depending on the specific problem, a sparse representation or a sparse-dense hybrid representation may be more suitable.
SVM performance and tradeoffs
We used scikit-learn’s implementation of the SVM classifier. To select the best model, I used a grid search across 80 different parameter combinations:
- kernel function (RBF, linear, sigmoid, polynomial)
- C (0.001, 0.01, 0.1, 1, 10)
- gamma (0.001, 0.01, 0.1, 1
One drawback of SVMs: the parameter tuning process can be lengthy; we tested 4 different kernels (some of which are computationally expensive). Optimizing on accuracy, the best set of parameters from the training data was: C= 1, gamma = 1, using an RBF kernel.
Here is performance of SVM using best parameters trained using the entire training set, and evaluated on the holdout set:
The training and holdout set performance are quite comparable (0.82-0.83 precision and recall, with AUC = 0.91). But how do they match up against the neural network?
Neural network performance and tradeoffs
Neural networks can be designed in a variety of different architectures, depending on the type of problem we are solving. We considered a multilayer feedforward neural network with 300 neurons in the input layer, 1 hidden layer, and 1 output layer.
To challenge the SVM model and do better in model evaluation, we ran the training several times by adjusting hyperparameters such as batch size, total epochs, dropout, and total neurons in the hidden layer for better accuracy. To prevent model overfitting, we added a dropout layer as regularization. The network is trained using back propagation of error.
Here is a graph showing how we chose 16 neurons for the hidden layer based on the best validation accuracy. As you can see, training accuracy increased with the number of neurons, but validation accuracy stayed nearly the same.
Finally, we chose the following architecture:
Input Layer | 300 neurons, ReLU activation |
Hidden Layer | 16 neurons, ReLU activation, Dropout = 0.6 |
Output Layer | 1 layer, Sigmoid activation |
Loss | Binary Cross-entropy |
Batch size / Total epochs | 128 / 50 |
Optimization | Adam |
Like SVM, this model was optimized on accuracy, and the best set of parameters from the training on validation data was:
Here is performance of the Neural Network using best parameters trained using the entire training set, and evaluated on the holdout set:
The holdout set performance is nearly the same as the training set performance, with 0.81 precision and recall, and AUC = 0.89.
Here is how the chosen neural network model performed against the holdout set.
SVM | Neural network | |
Precision | 0.82 | 0.81 |
Recall | 0.82 | 0.81 |
Accuracy | 0.82 | 0.81 |
AUC | 0.91 | 0.89 |
It’s a tie…almost!
Learnings
How do you think we did? Some key takeaways from our experiment are as follows:
- This was far from an exhaustive comparison. SVM optimization deserves a series of talks in its own right, and there are many topics we did not cover. We tested only one way of encoding data in order to reduce the problem space, but there are many feature engineering options that may impact performance.
- Do you have big data (be honest here)? NNs outperform traditional ML methods as data size grows. Our dataset was extremely small, so this may account for why the models were nearly tied. Also data was shuffled, so that may add some variations. Before reaching for the NNs, consider whether it is suited to your data and your problem!
- Decide whether training and prediction time are important to you. For instance, if you are designing a low-latency filtration system, you may get more mileage from less computationally complex and faster algorithms.
- Neural network design and tuning is as much of an art as it is a science (at this point). This is an area of (very) active development! Expect to spend time tuning for performance by turning those hyperparameter knobs.
- We did not fully explore other Neural Network architectures: In the last few years, Recurrent Neural Networks (RNN) with Long Short Term Memory (LSTM) have become a popular way to learn more about the word sequences. We did not explore Convolutional Neural Networks (CNN) that work well with images, but could be used for training words as well.
There are many questions we want to continue exploring. We’ve barely scratched the surface!
Recommended Reading
- Brandon Rohrer demystifying SVM black box:
https://www.youtube.com/watch?v=-Z4aojJ-pdg
- A Simple Introduction to SVMs:
https://www.cise.ufl.edu/class/cis4930sp11dtm/notes/intro_svm_new.pdf
- SO answer to kernel function types:
https://stackoverflow.com/questions/33778297/support-vector-machine-kernel-types
- Deep Learning by Ian Goodfellow, Yoshua Bengio, Aaron Courvill
http://www.deeplearningbook.org
- Make Your Own Neural Network by Tariq Rashid
https://www.amazon.com/Make-Your-Own-Neural-Network-ebook/dp/B01EER4Z4G