If you search around the internet looking for applying Naive Bayes classification on text, you’ll find a ton of articles that talk about the intuition behind the algorithm, maybe some slides from a lecture about the math and some notation behind it, and a bunch of articles I’m not going to link here that pretty much just paste some code and call it an explanation.
So I’m going to try to do a little more here, by hopefully writing and explaining enough, is let you yourself write a working Naive Bayes classifier.
There are three sections here. First is setup, and what format I’m expecting your text to be in for the classification. Second, I’ll talk about how to run naive Bayes on your own, using slow Python data structures. Finally, we’ll use Python’s NLTK and it’s classifier so you can see how to use that, since, let’s be honest, it’s gonna be quicker. Note that you wouldn’t want to use either of these in production, so look for a follow up post about how you might go about doing that.
As always, twitter, and check out the full code on github.
Setup
Data from this is going to be from this UCSD Amazon review data set. I swear one of the biggest issues with running these algorithms on your own is finding a data set big and varied enough to get interesting results. Otherwise you’ll spend most of your time scraping and cleaning data that by the time you get to the ML part of the project, you’re sufficiently annoyed. So big thanks that this data already exists.
You’ll notice that this set has millions of reviews for products across 24 different classes. In order to keep the complexity down here (this is a tutorial post after all), I’m sticking with two classes, and ones that are somewhat far enough different from each other to show that classification works, we’ll be classifying baby reviews against tools and home improvement reviews.
Preprocessing
First thing I want to do now, after unpacking the .gz file, is to get a train and test set that’s smaller than the 160,792 and 134,476 of baby and tool reviews respectively. For purposes here, I’m going to use 1000 of each, with 800 used for training, and 200 used for testing. The algorithms are able to support any number of training and test reviews, but for demonstration purposes, we’re making that number lower.
Check the github repo if you want to see the code, but I wrote a script that just takes the full file, picks 1000 random numbers, segments 800 into the training set, and 200 into the test set, and saves them to files with the names “train_CLASSNAME.json” and “test_CLASSNAME.json” where classname is either “baby” or “tool”.
Also, the files from that dataset are really nice, in that they’re already python objects. So to get them into a script, all you have to do is run “eval” on each line of the file if you want the dict object.
Features
There really wasn’t a good place to talk about this, so I’ll mention it here before getting into either of the self, and nltk running of the algorithm. The features we’re going to use are simply the lowercased version of all the words in the review. This means, in order to get a list of these words from the block of text, we remove punctuation, lowercase every word, split on spaces, and then remove words that are in the NLTK corpus of stopwords (basically boring words that don’t have any information about class).
from nltk.corpus import stopwords STOP_WORDS = set(stopwords.words('english')) STOP_WORDS.add('') def clean_review(review): exclude = set(string.punctuation) review = ''.join(ch for ch in review if ch not in exclude) split_sentence = review.lower().split(" ") clean = [word for word in split_sentence if word not in STOP_WORDS] return clean
Realize here that there are tons of different ways to do this, and ways to get more sophisticated that hopefully can get you better results! Things like stemming, which takes words down to their root word (wikipedia gives the example of “stems”, “stemmer”, “stemming”, “stemmed” as based on “stem”). You might want to include n-grams, for an n larger than 1 in our case as well.
Basically, there’s tons of processing on the text that you could do here. But since this I’m just talking about how Naive Bayes works, I’m sticking with simplicity. Maybe in the future I can get fancy and see how well I can do in classifying these reviews.
Ok, on to the actual algorithm.
Self Naive Bayes
Now that we’ve got training and testing reviews all set up, along with a scheme for tokenizing the text, it’s time to run our custom Naive Bayes algorithm, slow, and in all it’s glory. Based on all the setup we did, the only input we need to this function is the classnames, ‘baby’ and ‘tool’.
Getting Word Counts
First step in the set up here, if you couldn’t tell from that appropriately named header just above, is getting counts for all the words in the reviews for each class. The code below, in English, is as follows.
For each class (‘baby’ and ‘tool’), we get the training filename that we created above, we want a dictionary where the key is the word, and the value is the number of times we saw that word in the class corpus of reviews. That’s the counters_from_file
function. Luckily, Python comes with a nifty Counter
class so we don’t have to deal with the logic for a normal dict.
Secondly, we want to get initial probabilities for each class, in this case, that’s the doc_counts
variable. Going back to Bayes Rules, if you remember, the difference between that and frequentist probabilities, is that Bayes takes into account the probability of getting a document of a certain class. In our case I specifically made sure that there are 800 training samples, and 200 test for each class, so the probability in test that we get a review of either class if 50%. But let’s say people love buying tools a lot more than baby products on Amazon, and for every five tool reviews, there’s only one baby review, we want to know that and take that into account. So we need to do a little work here to get that probability before running the algorithm.
In real life, getting this number is kind of difficult, because you can’t really know the probability of getting a review of each class since it’s only based on past reviews; those numbers might change. But at least this is a good estimate.
Finally, combined_bag
variable holds the counter dictionary for the entire corpus.
from collections import Counter def counters_from_file(filename): reviews = read_reviews(filename) texts = [review["reviewText"] for review in reviews] tokens = [clean_review(review_text) for review_text in texts] flattened_tokens = [val for sublist in tokens for val in sublist] counter = Counter(flattened_tokens) return counter #gets line count from the file, for initial probabilities def line_count_from_file(filename): return sum(1 for line in open(filename)) counters = [] doc_counts = [] for label in class_labels: filename = "train_%s.json" % label doc_counts.append(line_count_from_file(filename)) counter = counters_from_file(filename) counters.append(counter) probabilities = [float(doc_count) / sum(doc_counts) for doc_count in doc_counts] combined_bag = Counter() for counter in counters: combined_bag += counter combined_vocab_count = len(combined_bag.keys())
Record Keeping
Few things here for record keeping purposes. First are correct and incorrect counters, so at the end we can know what percentage we got correct. Second, a confusion matrix. This serves as a good way of figuring out specifically which classes are getting confused with each other. Very aptly named. I also have a nice function for printing out the confusion matrix in a readable form.
def print_confusion_matrix(matrix, class_labels): lines = [" for i in range(len(class_labels)+1)] for index, c in enumerate(class_labels): lines[0] += "t" lines[0] += c lines[index+1] += c for index, result in enumerate(matrix): for amount in result: lines[index+1] += "t" lines[index+1] += str(amount) for line in lines: print line def initialize_conversion_matrix(num_labels): return [[0 for i in range(num_labels)] for y in range(num_labels)] correct = 0 incorrect = 0 confusion_matrix = initialize_conversion_matrix(len(class_labels))
Algorithm Time
Remember when I said that most ML is just getting data set up and processed for learning? Yeah, we’re finally about to run the algorithm itself. Like I mentioned above, when I went through other articles online, I found a lot of math, and then some code, but nothing that explained what was going on very well. So I’m going to try to bridge that gap here. So I’ll English the algorithm in case you don’t want to jump right to code.
For each review text that we clean using the cleaning and tokenizing method I mentioned above, we go through each class, and calculate the conditional probability of that class, given the text. In the naive Bayes world which assumes independnce, that value is:
P(class) * P(word1| class) * P(word2 | class) * ... * P(wordn | class
)
The key term here, and the one I didn’t find explained very well elsewhere is the P(word | class)
, which in English is the probability you’d expect to see that word, given the corpus of words within that class.
This number is the number of times you see that word in all of the documents in that class (found using the counter dictionary) divided by the total number of words seen in that class (found by counting the values in that counter dictionary for the class). This makes sense when you think about it. You have X total words seen from that class, and Y of them are the word you’re looking at. So the probability you see that word in that class is Y / X.
The issue becomes if you haven’t seen that word before. In that case, Y will be 0, that term will be 0, and quickly the value for P(class | text) will be 0. To fix this, we’re going to do something called additive smoothing, where we make sure that the numerator in that statement will never be 0.
I’m not going to go into the math or reasoning behind this since you can find that elsewhere, but in the end, that statement turns into (Y + 1) / (X + number_of_words_in_vocab) where number_of_words_in_vocab is the number of unique words that we’ve seen from all the words in every review regardless of text. If you look above, we can get that information combined_vocab_count
variable in the setup information above. In the code, you can see the key line as:
cond_prob = float((word_count + 1)) / (class_total_word_count + combined_vocab_count)
Note about the math.log
function calls you see in the code as well. The math in naive Bayes calls for multiplying the the conditional probabilities together. But if you look at those numbers, they’re less than 1, somewhere in the range of 0.00X. This means when strung together and multiplied, you’re ending up with scores very close to 0, and in some cases I noticed, Python runs out of decimal spaces and that number turns to 0 when there are many words in the text. We don’t want that obviously because we lose all information! Thankfully logs exist, and since we’re only looking at comparing magnitudes, we can turn that above multiplied line into:
log(P(class)) + log(P(word1| class)) + log(P(word2 | class)) + ... + log(P(wordn | class))
and we end up with positive and comparable numbers. So if you ever hear someone complaining about how learning logs is pointless, you can point to this example and shut them up.
That should be everything unique, so here’s the code, complete with all loops for the classes, as well as checks for the guesses and record keeping.
for index, class_name in enumerate(class_labels): filename = "test_%s.json" % class_name texts = get_review_texts(filename) for text in texts: tokens = clean_review(text) scores = [] for cindex, bag in enumerate(counters): #for each class score = math.log1p(probabilities[cindex]) for word in tokens: #for each word, we need the probablity that word given the class / bag word_count = bag[word] class_total_word_count = sum(bag.values()) cond_prob = float((word_count + 1)) / (class_total_word_count + combined_vocab_count) score += math.log(cond_prob) scores.append(score) max_index, max_value = max(enumerate(scores), key=lambda p: p[1]) confusion_matrix[index][max_index] += 1 if index == max_index: correct += 1 else: incorrect += 1
When I run the full code on the 200 test samples for each class, I get the following output!
0.9625 baby tool baby 192 8 tool 7 193
meaning we got 95% of the documents correct. We thought 11 of the baby reviews were tool reviews, and 6 of the tool reviews were baby reviews.
If you look at some of the mistakes, you can somewhat see what the reasoning was. For example, this review was guessed as being a tool reivew, even though it was a baby review:
“These tweezers are a far more useful tool than those disgusting snot suction things! You have to be quick and precise with these but they do the trick. The tips are small enough to fit into a newborn’s nose and blunt enough that if you do touch skin or nose it doesn’t hurt. The tweezer is the length of a normal tweezer so it fits nicely in my hand. It is easy to clean and you don’t need any filters or hoses. The only downside is I could imagine the tip breaking off if I dropped it enough times since it’s plastic. On the other hand, you wouldn’t want metal in the baby’s nose anyway.”
Just looking at the words, naively some might say, you can see why the model got confused.
And there you go, Naive Bayes done ourselves. Nice and quick if you run it with 1000 examples and two classes, but add more, and you’ll find the classification part of the code to slow down big time. And plus, we don’t want to do the code ourselves!
Naive Bayes using NLTK
We’ve already seen use of the NLTK by using their stopword list for removal. It’s a huge
First part of using the NLTK classifier is training the classifier. Same thing we did
One thing when looking online about how to use the NLTK here is that basically none of the tutorials talk about the form of the data needed to pass into the classifier. Check out the comment at the top of the code block to see how we need to modify the training data. For each text, we need a tuple where the first item is a dictionary of word counts for that text, and the second item is the class string.
#note, training set needs to be in form of #train_set = [ #({'I': 3, 'like': 1, 'this': 1, 'product': 2}, 'class_name_1') #({'This': 2, 'is': 1, 'really': 1, 'great': 2}, 'class_name_1') #... #({'Big': 1, 'fan': 1, 'of': 1, 'this': 1}, 'class_name_X') #] train_set = [] for class_name in class_labels: filename = "train_%s.json" % class_name texts = get_review_texts(filename) for text in texts: tokens = clean_review(text) counter = Counter(tokens) train_set.append((dict(counter), class_name)) classifier = nltk.NaiveBayesClassifier.train(train_set)
Now in order to classify the texts, we again create the counter dictionary mapping words to frequency in the document, and for each of them, pass it into classifier.classify
which returns the string value of the class it guesses.
for index, class_name in enumerate(class_labels): filename = "test_%s.json" % class_name reviews = read_reviews(filename) texts = [review["reviewText"] for review in reviews] for text in texts: tokens = clean_review(text) counter = dict(Counter(tokens)) guess = classifier.classify(counter) lindex = class_labels.index(guess) confusion_matrix[index][lindex] += 1 if guess == class_name: correct += 1 else: incorrect += 1
Sweet! We can run that code and get the results that follow. Note a couple things. First that they use a slightly different math than the additive smoothing we used above, so the percentage correct and confusion matrix are slightly different. They talk about the math here, but looks like there’s some text formatting issues so it’s a little confusing. You’ll have to look at the code specifically if you want to know the math they use.
And also, the NLTK classifier has a cool function that will return the most informative features. Basically, showing which words are most indicative of a text being from a certain class. So if you see ‘baby’, ‘wash’, ‘seat’, etc. in the text, you’re probably looking at baby.
0.9575
Most Informative Features
baby = 1 baby : tool = 49.3 : 1.0
wash = 1 baby : tool = 34.3 : 1.0
seat = 1 baby : tool = 30.3 : 1.0
child = 1 baby : tool = 29.7 : 1.0
likes = 1 baby : tool = 28.3 : 1.0
babies = 1 baby : tool = 28.3 : 1.0
led = 1 tool : baby = 27.0 : 1.0
solid = 1 tool : baby = 26.3 : 1.0
tool = 1 tool : baby = 26.2 : 1.0
toys = 1 baby : tool = 25.7 : 1.0
baby tool
baby 189 11
tool 6 194
And there you have it! Adding more classes to the NLTK classifier is really simple and quick which is the key. Using more training samples or more classes quickly increases the run time of the custom Naive Bayes code. The NLTK implementation runs much quicker, so use that for real world applications.
Final Thoughts
The key takeaway to remember here about how Naive Bayes works is thinking about the P(word | class)
term, and the heuristic behind it. All we’re doing here, with some math in between, is counting the frequency of occurrence of each word in the test document, compared to all the documents we’ve used in the training set. If the test word occurs more frequently in the test documents of one class compared to the other, the P(word | class)
term will have a greater value, and make the final value for that class greater than for the other classes.
And it works! With two classes here, we’re getting like 95% accuracy on classifying these reviews. That’s pretty good for a computer. Using different features, adding more training reviews can all help with accuracy, but it’s pretty cool what a little code can get you.
Originally posted at bigishdata.com/