A common question in data science interviews is “How would you measure the performance of a classification model when 99% of your data belongs to one class?”
This is a straightforward question, yet many people stumble and don’t know how to respond.
In this article, we will explore the answer to the above question by discussing Recall, Precision, and the harmonic mean of the two – the F1 score.
Issue With Accuracy
Let’s begin by explaining why accuracy may not be a good measure. If 99% of your dataset belongs to one class, then by simple guessing, any model could achieve an accuracy score of 99%!
The model will effectively learn to predict that single class, hence achieving an accuracy score that is equal to the distribution of classes.
Even without a machine learning model, I can purely guess what class a sample belongs to and achieve an accuracy of 99%.
Therefore, for this particular problem, accuracy is not a good metric to measure model performance. Hence, we have to seek alternative measurements.
Note: Sometimes accuracy is a good measure, but it all depends on the context of the work.
Confusion Matrix
Before we continue, it is important to understand the Confusion Matrix. This breaks down your results into correct and incorrect classifications for both classes and is more informative in measuring performance than just accuracy.
Image generated by author.
Each of the quadrants tells how well we classify the two classes. We may have a 99% accuracy but we may always incorrectly classify the 1% class!
This is clearly no good and we can infer this information from the Confusion Matrix to ensure we are adequately predicting both classes.
Below is a generated Confusion Matrix I coded in Python for a basic Logistic Regression Classifier using Sci-Kit Learn’s Confusion Matrix function:
Image produced by author in Python.
The full code used to generate the above plot can be found on my GitHub here.
Precision
Precision measures: of the samples we predicted to be True, how many of those were actually True.
Mathematically, this is calculated as:
Image produced by author in LaTeX.
False Positives are known as Type I errors in statistics.
Most algorithms assign a sample as positive if its probability is ≥ 0.5 and negative if < 0.5. This boundary value is known as the threshold. As we increase the threshold, the precision also increases and tends to 1.
This makes sense as samples with a probability of 0.9 are much likelier to be positive than ones with 0.5. So, we are more selective and precise of the samples we put in the positive class.
Precision is used when we want to be highly selective and correct for which sample we classify as true such as email spam detection.
Recall
Recall measures: out of all the True samples how many did we classify correctly as True.
Mathematically, it is calculated as:
Image produced by author in LaTeX.
False Negatives are known as Type II errors in statistics.
As the threshold decreases, Recall increases and tends to 1. This is because we start classifying more and more samples as positive and eventually, at a threshold of 0, everything is positive. Therefore, we would have no False Negatives and so Recall will equal 1.
Recall is useful when we want to ensure we capture all the True Positives even if that means increasing our False Positives (false alarms). This is important, for example, in the case of a cancer detection where it is very important to flag it rather than not.
F1 Score
However, we now have a problem. Increasing the threshold maximizes Precision but minimizes Recall. So what is the solution?
F1 score is a value that compromises Recall and Precision by taking the harmonic mean between them:
Image produced by author in LaTeX.
The reason we use the harmonic mean is to amplify the effects of extreme values. For example, say we had a Precision of 0.9 and a Recall of 0. The arithmetic mean of this would be 0.45. However, the corresponding model is clearly no good as we will be capturing very few True Positives.
Recall, Precision and F1 vs Threshold Plot
Below is a plot of Recall, Precision, and F1 score as a function of the threshold for my basic Logistic Regression model using Sci-Kit Learn’s Precision Recall Curve function:
Image generated by author in Python.
The full code that generated the above plot can be found on my GitHub here.
In the above plot, we see the Recall being 1 when the threshold is lower, however, Precision is 1 when the threshold is higher. Additionally, notice how the F1 score significantly drops when Recall starts to decrease, this shows how the harmonic mean punishes extreme values.
Conclusion
In this article, we have discussed why accuracy is not always the best performance metric for your model. Instead, you should always determine its Precision and Recall score along with a Confusion Matrix to fully analyze your results. This is particularly important when you have an unbalanced dataset to ensure your model is performing as expected.
Article originally posted here by Egor Howell. Reposted with permission.