Have you ever seen a machine learning model with 99% accuracy and you see the one who did this model bragging? Well, I’ve been there. Apart from overfitting, accuracy is just one metric that you cannot depend upon for specific data. In this article, I will explain why a high accuracy can be misleading from simple interpretations on a simple classification problem. Let’s try a stupid model, a very stupid one that can classify nothing correctly i.e. every spam is predicted nonspam and every nonspam is predicted spam.
$$ y $$ | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
---|---|---|---|---|---|---|---|---|---|---|
$$ p $$ | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
where y
is the actual label and p
is the predicted label: 1 for spam and 0 for nonspam comments on social media.
This classifier has neither precision nor recall i.e. precision = 0 and recall = 0 (stupid one I told you). Let’s take another classifier:
$$y$$ | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
---|---|---|---|---|---|---|---|---|---|---|
$$p$$ | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
That one is able to classify all spams correctly but classifies all nonspams as spams. The precision of nonspams is the lowest in this case (equals zero) because the classifier predicts each nonspam as a spam. While recall of spams is the highest (equals 1) because each spam is predicted as a spam. But what is precision and recall?
Examples in Table
$$y$$ | $$p_1$$ | $$p_2$$ | $$p_3$$ | $$p_4$$ | $$p_5$$ | $$p_6$$ | $$p_7$$ |
---|---|---|---|---|---|---|---|
0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 |
1 | 0 | 1 | 0 | 1 | 1 | 1 | 0 |
1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 |
0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 |
0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 |
0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 |
0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 |
0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
Precision(0) | 0 | 0 | 0.8 | 1 | 1 | 1 | 0.89 |
Recall(0) | 0 | 0 | 1 | 0.38 | 0.75 | 0.88 | 1 |
F1-score(0) | 0 | 0 | 0.98 | 0.55 | 0.86 | 0.93 | 0.94 |
Precision(1) | 0 | 0.2 | 0 | 0.29 | 0.5 | 0.67 | 1 |
Recall(1) | 0 | 1 | 0 | 1 | 1 | 1 | 0.5 |
F1-score(1) | 0 | 0.3 | 0 | 0.44 | 0.67 | 0.8 | 0.67 |
Accuracy | 0 | 0.2 | 0.8 | 0.5 | 0.8 | 0.9 | 0.9 |
where is True or False and is Positive or Negative. To understand what I’m doing in the next lines. Start with whether it is Positive or Negative. Let’s take a closer look at predicted spams (class 1). This means Positive will be 1 and Negative will be 0 while True will be the same as Y and False will be the opposite of .
is how many 1’s predicted and actually are 1’s. is how many 1’s predicted and are 0’s inactual data. is how many 0’s predicted and are 0’s in actual data. is how many 0’s predicted and are actually 1’s.
For interpretations, replace each 1 by spam and each 0 by nonspam.
F1-score is the harmonic mean of precision and recall. Accuracy is number of correctly predicted values over total size.