Classification Metrics

Classification Threshold

To determine the answer a binary classification model typically computes the probability of classes. As we have only two classes (zero and one), if the value is greater than 0.5, the observation is positive; otherwise, it's negative.

The line where the negative class ends and the positive class begins is called the threshold, by default it's 0.5.

Threshold Adjustment

In sklearn, the class probability can be calculated with the predict_proba() function. It takes features of the observations and returns the probabilities:


1probabilities = model.predict_proba(features)

Strings correspond to observations. The first column indicates the negative class probability, and the second indicates the positive class probability (the two probabilities sum to unity).

To create a loop with the desired range, we use the arange() function from the numpy library.


1for value in np.arange(first, last, step):

Four Basic Outcomes of Binary Classification

By combining responses with predictions, we will get the following division:

True Positive answers (TP): the model labeled an object as "1", and its real value is also "1";

True Negative answers (TN): the model labeled an object as "0", and its real values is also "0";

False Positive answers (FP): the model labeled an object as "1", but its actual value is "0";

False Negative answers (FN): the model labeled an object as "0", but its actual value is "1".

Confusion Matrix

When TP, FP, TN, FN are collected into a table, it is called a confusion matrix. The matrix is formed as follows:

The algorithm‘s labels (0 and 1) are placed on the horizontal axis ("Predictions").
True labels of the class (0 and 1) are placed on the vertical axis ("Answers").

What we get:

The correct predictions are on the main diagonal (from the upper-left corner):
- TN in the upper-left corner
- TP in the lower right corner
Incorrect predictions are outside of the main diagonal:
- FP in the upper right corner
- FN in the lower left corner

See how this works with a dataset below:

The confusion_matrix() function takes correct answers and predictions and returns a confusion matrix.


1from sklearn.metrics import confusion_matrix
2
3print(confusion_matrix(target, predictions))

Recall

Recall reveals what portion of positive answers the model has identified among all answers. It is calculated using this formula:

\text{Recall} = \frac{\text{TP}}{\text{TP}+\text{FN}}

Recall is an evaluation metric that measures the share of TP answers among all answers that actually have a label 1. We want the recall value to be close to 1. This would mean that the model is good at identifying true positives. If it is closer to zero, the model needs to be checked and fixed.

The recall_score() function takes correct answers and predictions and returns the recall value.


1from sklearn.metrics import recall_score
2
3print(recall_score(target, predictions))

Precision

Precision measures how many negative answers the model found while searching for positive ones. The more negative answers found, the lower the precision.

\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}

We want the precision value to be close to 1. The precision_score() function takes correct answers and predictions and returns the precision value.


1from sklearn.metrics import precision_score
2
3print(precision_score(target, predictions))

Here you can see how precision and recall are interrelated on a dataset:

F1 Score

Recall and precision evaluate the quality of predictions of the positive class from different angles. Recall describes how well the model understood the properties of this class and how well it recognized the class. Precision detects whether the model is overdoing it by assigning too many positive labels.

The F1 score helps control both recall and precision simultaneously. In F1 1 means that the ratio of recall to precision is 1:1.

F_1 =\frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

When recall or precision are close to zero, the harmonic mean itself approaches 0.

If a positive class is poorly predicted on one of the scales, then an F1 score close to zero will show that the prediction of class 1 has failed.

Below we can see how changing our threshold will affect the resulting F1 score:

The f1_score() function takes correct answers and predictions and returns the harmonic mean of recall and precision.


1from sklearn.metrics import f1_score
2
3print(f1_score(target, predictions))

PR curve

On the graph the precision value is plotted vertically and horizontally. A curve plotted from Precision and Recall values is called a PR curve. The higher the curve, the better the model.

TPR & FPR

You can't calculate the precision when there are no positive observations.

Before moving on to the new curve, let's define a few important terms.

True Positive Rate (TPR) is the result of TP answers divided by all positive answers.

\text{TPR} = \frac{\text{TP}}{\text{P}}

The False Positive Rate (FPR), is the result of the FP answers divided by all negative answers.

\text{FPR} = \frac{\text{FP}}{\text{N}}

The denominators are constant values that don't depend on changes in the model.

ROC curve

We put the FPR values along the horizontal axis, and TPR values along the vertical axis. Then we iterate over the logistic regression threshold values and plot a curve. This is the ROC curve (Receiver Operating Characteristic)

For a model that always answers randomly, the ROC curve is a diagonal line going from the lower left to the upper right. The higher the curve, the greater the TPR value and the better the model's quality.

The AUC-ROC value (Area Under Curve ROC) is an evaluation metric with values in the range from 0 to 1. The AUC-ROC value for a random model is 0.5.

We can plot a ROC curve with the roc_curve() variable from the sklearn.metrics module:


1from sklearn.metrics import roc_curve
2
3fpr, tpr, thresholds = roc_curve(target, probabilities)

To calculate AUC-ROC, use the roc_auc_score() function from the sklearn library:


1from sklearn.metrics import roc_auc_score

Unlike other metrics, it takes class "1" probabilities instead of predictions:


1auc_roc = roc_auc_score(target_valid, probabilities_one_valid)