# Confusion Matrix or its two types of error.

Classification is a supervised learning approach in which a target variable is discrete (or categorical). Evaluating a machine learning model is as important as building it. We are creating models to perform on new, previously unseen data. Hence, a thorough and versatile evaluation is required to create a robust model. When it comes to classification models, evaluation process gets somewhat tricky.

# — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -

# The Motivation Behind the Confusion Matrix

Accuracy shows the ratio of correct predictions to all predictions:

In some cases, accuracy is not enough to evaluate a model. Assume we build a model for a binary classification task and the distribution of target variable is unbalanced (93% of data points are in class A and 7% in class B).

We have a model that only predicts class A. It is hard to even call it a “model” because it predicts class A without any calculation. However, since 93% of the samples are in class A, the accuracy of our model is 93%.

What if it is crucial to detect class B correctly and we cannot afford to misclassify any class B samples (e.g. cancer prediction)? This is where confusion matrix comes into play.

# — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -

# Confusion Matrix

A confusion matrix is not a metric to evaluate a model, but it provides insight into the predictions. It is important to learn confusion matrix in order to comprehend other classification metrics such as precision and recall.

Confusion matrix goes deeper than classification accuracy by showing the correct and incorrect (i.e. true or false) predictions on each class. In case of a binary classification task, a confusion matrix is a 2x2 matrix. If there are three different classes, it is a 3x3 matrix and so on.

Let’s assume class A is positive class and class B is negative class. The key terms of confusion matrix are as follows:

- True positive (TP): Predicting positive class as positive (ok)
- False positive (FP): Predicting negative class as positive (not ok)
- False negative (FN): Predicting positive class as negative (not ok)
- True negative (TN): Predicting negative class as negative (ok)

The desired outcome is that the prediction and actual class are the same. It may look confusing but you can come up with a trick to remember. Mine is as follows:

# — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -

Second word is what the model predicts.

First word indicates whether the prediction is correct.

Note: False positive is also known as type I error. False negative is also known as type II error.

Confusion matrix is used to calculate precision and recall.

# — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -

# Precision and Recall

Precision and recall metrics take the classification accuracy one step further and allow us to get a more specific understanding of model evaluation. Which one to prefer depends on the task and what we aim to achieve.

Precision measures how good our model is when the prediction is positive. It is the ratio of correct positive predictions to all positive predictions:

Recall measures how good our model is at correctly predicting positive classes. It is the ratio of correct positive predictions to all positive classes.

The focus of precision is positive predictions so it indicates how many positive predictions are true. The focus of recall is actual positive classes so it indicates how many of the positive classes the model is able to predict correctly.

# — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -

Precision or Recall ?

We cannot try to maximize both precision and recall because there is a trade-off between them. Increasing precision decreases recall and vice versa. We can aim to maximize precision or recall depending on the task.

Consider an email spam detection model, we try to maximize precision because we want to be correct when an email is detected as spam. We do not want to label a normal email as spam (i.e. false positive). It is acceptable if the model cannot catch a few spam emails. However, if a very important email is marked as spam, the consequences might be severe.

On the other hand, for a cancel cell detection task, we need to maximize recall because we want to detect every positive class (malignant cell). If the model predicts a malignant cell as benign (i.e. false negative), it would be a crucial mistake.

There is another measure that combines precision and recall into a single number and that is F1 score.

# — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -

# F1 Score

F1 score is the weighted average of precision and recall.

F1 score is a more useful measure than accuracy for problems with uneven class distribution because it takes into account both false positive and false negatives.

The best value for f1 score is 1 and the worst is 0.

# — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -

# Sensitivity and Specificity

Sensitivity, also known as the true positive rate (TPR), is the same as recall. Hence, it measures the proportion of positive class that is correctly predicted as positive.

Specificity is similar to sensitivity but focused on negative class. It measures the proportion of negative class that is correctly predicted as negative.

# — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -

# Conclusion

No free lunch theorem applies here too. There is not an optimal and easy-to-find choice for all tasks. We need to clearly define the requirements and choose a metric based on these requirements.

=================================================================