FOUNDATIONS OF MACHINE LEARNING

Course info

Course materials

FAQ

Course materials

/

Section

Imbalanced Classes


Introduction

As our world is becoming more and more connected, reliable cyber security is getting increaingly important for our society. One part of this are so called intrusion detection systems, a type of software that aims to detect malicious behaviour over a network. This is a problem that can be addressed using machine learning. In this exercise we will consider a synthetic example, albeit motivated by the application to cyber security, to make the exercise a bit more concrete.

Assume that a binary classifier has been trained to distinguish malicious network traffic $(y=1)$ from normal traffic $(y=-1)$ based on a set of relevant features $\mathbf{x}$. For the application example, the feature vector could for instance include the number of bytes and packets transmitted, the average response time for a request, and so on (see for instance this page for a number of relevant datasets of this type provided by the Canadian Institute for Cybersecurity).

In our synthethic example, we assume that the trained classifier is evaluated on a test set, resulting in the confusion matrix below.

$y=-1$ $y=1$ total
$\hat{y}=-1$ 109 112 804 109 916
$\hat{y}=1$ 14 212 3805 18 017
total 123 324 4609 127 933

We will use the numbers in this matrix to solve a number of exercises below. For this, Chapter 4.5 in the course book might come in handy.


Question A: Accuracy

Compute the test accuracy of the classifier, based on the confusion matrix given above. Also, compute the test accuracy for a naive classifier that always predicts $\hat y=-1$.

Accuracy isn't enough!

If solved correctly, you should have seen in the previous question that the accuracy of the naive classifier is in fact larger than the accuracy of the trained classifier. Does this mean that the classifier is useless?

Well, not necessarily. The problem is that accuracy does not take into consideration that the classification problem is highly imbalanced, in the sense that the negative class (normal traffic) in this case is much more common than the positive class (malicious traffic). Since accuracy only cares about the total number of correct predictions, it favors the negative class.

Since the purpose of the intrusion detection system is to detect malicious behaviour, the naive classifier is clearly of no use. The trained classifier, on the other hand, actually manages to catch many of the attacks so that it can warn the user. The price to pay is that it will also cause quite a few false alarms (or false positives, in the language of the confusion matrix). However, assuming that the actual cost of letting an attack slip through is much larger than the cost of raising a false alarm, this is perhaps a price that we are willing to pay. In fact, in this regard, this is not only an imbalanced problem but also an asymmetric one, meaning that the cost of a false positive is larger than the cost of a false negative.


Question B: Precision and Recall

As an alternative to using accuracy as an evaluation metric, we can consider the precision and the recall of the classifier. Complete the code below to compute the precision and recall based on the confusion matrix.

Precision and Recall

The precision and recall give a more nuanced picture of the performance of the classifer. The precision tells us how reliable a positive prediction is, or put differently, it is the probability that an alarm raised by the classifier is legit. The recall, on the other hand, tells us how large proportion of the actual attacks that are caught by the classifier. Ideally, both of these numbers should be one, but in practice there is usually a trade-off between the two.

Note that the naive classifier from above would have a recall of zero, and a precision that is undefined. If we would instead consider a (just as naive) classifier that always predicts $\hat y = +1$ we would get a recall of one, but the precision would be only $P/n = 0.036$.


Conclusion

You have now seen that accuracy is not the only evaluation metric to care about. This is particularly important to keep in mind when the problem at hand is imbalanced or asymmetric. When evaluating a classifier, it is always a good idea to draw the confusion matrix (note that the matrix can be generalized to multiple classes, but it becomes a bit harder to interpret). This will give you a more complete picture of the performance of your model, compared to relying only on a single number evaluation metric.

In the next example, we will continue to look at the model selection problem, and consider choosing between a non-parametric k-NN model and a parametric linear regression model.

This webpage contains the course materials for the course ETE370 Foundations of Machine Learning.
The content is licensed under Creative Commons Attribution 4.0 International.
Copyright © 2021, Joel Oskarsson, Amanda Olmin & Fredrik Lindsten