Machine Learning Algorithms

Linear Regression and Error Functions

Regression is a technique in supervised learning. Here the model must identify a relationship between the objects' features and the target variable value. With regression, the target variable is a number, rather than a class.

$n$ — the number of observations in the dataset

$k$ — the number of features

\vec{y} \text{ -- the vector (list) of target variable values for our data}

$X$ — an object-feature matrix ( $n x k$ ) where rows are observations and columns are features

Inside the matrix, each observation has a corresponding vector or set of feature values

\vec{x} = (x_1, x_2, ..., x_k)

The $X$ matrix is as follows:

X = \begin{pmatrix} x_{1_1} & \cdots & {x_{1_k}}\\ \vdots & \ddots & \vdots \\ x_{n_1} & \cdots & {x_{n_k}} \end{pmatrix}

We're looking for a unique formula that will make a forecast (estimate) for any object using the values of $k$ features:

f(\vec{x}) = f(x_1, x_2, ..., x_k) = \hat{y}

If we find a formula and apply it to all the dataset objects, we'll get a forecast vector (set) for all objects:

\vec{\hat{y}} = (\hat{y_1}, \hat{y_2}, ..., \hat{y_n}) \text{ -- a forecast vector (list)}

Where:

\hat{y_i} \text{}

The resulting vector can be compared with the original vector of target variable values in order to figure out how good the formula is. This can be done by calculating the mean error for all observations. First we'll think of how the formula might look, then we'll adjust its parameters. The simplest way is to assume that the target variable is described with a linear function of features:

y(x) = w_0 + w_1 * x_1 + w_2 * x_2 + ... + w_k * x_k

Here:

(w_0, w_1, w_2,..., w_k) \text{ -- numerical coefficients}

These coefficients are called weights: they determine the contribution of each feature to the sum that defines our forecast. So we need to find weight values that will enable the function to yield results that are very close to the real values.

Let's assume we have only one feature for each observation. We'll plot a graph where each observation has a dot, with the feature along the X axis and the target variable value along the Y axis:

We need to find a function $y = w_0 + wx$ that will produce a line graph, and this line must describe our observations well. Some lines characterize a set, or data cloud, better than others. For example, you can plot a graph for the function $y = 3 + 1 *x$ or $y = 2 + 2* x$ :

The green line describes the observations better than the red one.

The easiest way is to estimate how wrong we are in our forecasts compared with the real value of the target variable.

Let's introduce the error function $Q(w)$ . It will depend on the weights we've chosen.

The difference between the expectation and the reality can be negative, but the error cannot. If the forecast is perfect, the error is 0. If a forecast value is less or greater than the real value, the error will be positive.

To ensure that the error is a positive value, we square it:

(y_i - \hat{y_i})^2

With certain objects the model will err more, with others less. To estimate overall error, we need to find the mean error for all objects:

Q(w) = \frac{\sum_{i = 1}^{n}(y_i - \hat{y_i})^2}{n}

We've defined the error function. Now we have to find weights ( $w$ ) that will minimize the value of the error function.

Gradient Descent

The error a model makes with a given set of weights is defined by the formula:

Q(w) = \frac{\sum_{i = 1}^{n}(y_i - \hat{y_i})^2}{n}

where $y(x) = w_0 + w_1 *x_1 + w_2* x_2 + ... + w_k * x_k$

Let's return to the case where we have only one feature and two coefficients for which we need to choose values: $y = w_0 + w_1 * x$ . We can calculate the values of the error function for a variety of $w_0$ and $w_1$ values. Then the graph of the error function will look like this:

This is a tri-surface plot. Let's assume you've chosen the values of the two coefficients $w_{0_a}$ and $w_{1_a}$ and gotten the value of the $Q_a$ function. In the image, this corresponds to point A. We need to find the point where the value of the function will be the smallest. We'll mark it as B. Point B corresponds to the set of weights for which the mean square error of the observations is the smallest. If we think back to the example from the previous lesson, we can say that point A corresponds to the red line and point B to the green one.

The search for the local minimum is the crux of optimization. In machine learning, a special optimizing method is applied: gradient descent. We choose an initial point (in our case, point A) with starting weight values. It gets shifted step by step toward the error function's minimum.

The important questions to ask now are:

In which direction should we step?
How large should the step be?

It would make sense to move in the direction where the descent to the minimum is the steepest. This means we need to move in the direction that gives the greatest decrease in the function's value for a step of a given width. If we're dealing with two dimensions, the shift is along the tangent from the current point downward. If we fix the value of the zero coefficient $w_0$ , then the surface slice describing the dependence of the error $Q$ on the $w_1$ weight given one feature will look like a parabolic curve:

We move from the starting point downwards along the tangent. In this case, we're moving along the error function's derivative with respect to weights (the X axis):

\frac{\partial Q}{\partial w_1}

If we have several dimensions, the tangent to the surface is the vector of the derivatives for each weight:

\nabla Q (w) = (\frac{\partial Q}{\partial w_0}, ..., \frac{\partial Q}{\partial w_k})

Such a vector is called a gradient. It's positive. We need to move down to the function's minimum. This means we'll move in the opposite direction, toward the negative gradient.

We've figured out the direction. Now we have to find out how to move quickly: how wide the steps must be to reach the "floor." That's what the learning rate is responsible for. We'll call it $\alpha$ . Then we can formulate the weight correction at each step as follows:

`\vec {w}_{t+1} = \vec{w}_t - \alpha\nabla Q (\vec{w}_t)`

The step shouldn't be too wide, or we might jump over the minimum. But it shouldn't be tiny, either, or training will take too long. Usually, a hyperparameter is chosen in experiments. This term is meant to distinguish it from the $w$ function parameters. The initial hyperparameter value is relatively small, e.g. 0.001 or 0.003. Here the step (i.e. hyperparameter) is the learning rate.

In practice, the basic or "vanilla" gradient descent is rarely used, mainly because of the following disadvantages:

If the function surface has too many "cavities," depending on the initial weight selection we might end up in a local minimum, rather than the function's global minimum.
Computing the gradient for all samples takes a long time. To speed up convergence and try to prevent overfitting, random (stochastic) gradient descent is applied. Here the gradient is calculated for subsets, or mini-batches, rather than for the entire set.
The surface of the error function might be complex, so we apply not just directions, but further derivatives of the error function.

Preprocessing: Scaling Features

When you're carrying out a linear regression, keep two things in mind:

Features should be standardized
The linear regression model is prone to overfitting and is unstable when features are mutually correlated

The ranges of features' values may differ significantly. If you start selecting coefficients, you'll see that the weights for the different features must account for the scale of these features. Smaller-scale features must have bigger weights, and the others must have smaller weights. At the same time, the values of coefficients must make it possible to estimate the contribution of each feature to the overall weighted sum.

To compare weights, we must unify the scale of various features. This can be done through normalization, which involves converting the feature values to the range from 0 to 1. However, scaling is typically carried out through standardization, meaning that we make the distribution have a mean of 0 and a standard deviation of 1. (This is the standard normal distribution.) The reason for this is that only once we have such a distribution does it make sense to minimize the mean square error.

Standardization transforms a feature value so that it looks like this: for each observation, the initial feature value is decreased by the mean, and the resulting difference is divided by the standard deviation. To solve tasks with linear regression methodsthe data must be standardized.

The preprocessing module in sklearn has the ready-made MinMaxScaler() and StandardScaler() classes for normalizing and standardizing data, respectively. They resemble models: first they need to be trained, or shown what values a feature takes on train data. Only then can they be applied to new observations. Normalization and standardization have the same syntax:


1from sklearn.preprocessing import StandardScaler
2
3scaler = StandardScaler() # creating a scaler class object
4scaler.fit(X) # training standardizer
5X_sc = scaler.transform(X) # transforming the data set

In line with the standard machine learning pipeline, you split the data into train and validation sets. To standardize features, you have to know their mean and standard deviation for observations, but you can't predict feature values. You should account for this fact when validating.

Since a model sees only the mean and standard deviation while it's training on train data, standardization is done in the following steps:

Calculate the mean and standard deviation on train data (by training StandardScaler()) and transform the data.
Apply the trained StandardScaler() to validation data (taking the train data's mean and standard deviation into account).
Train the model on standardized train data.
Make a forecast for standardized data on the validation set.

Regularization

Simple linear models are easy to create and interpret, but there can be problems with them. This could be due to multicollinearity, for instance.

Multicollinearity occurs when we have a group of features that are linearly related. If the correlation coefficient is too high (higher than 0.8, generally), you may have problems with linear regression. But why?

Here's an illustrative example. We have two identical features, the ultimate case of high mutual correlation:

x_1 = x_2

And now look at these three equations:

\begin{aligned} y = w_0 + w_1*x_1 + w_2*x_2 \\ y = w_0 + (w_1+w_2)*x_1 + 0*x_2 \\ y = w_0 + 100000*(w_1+w_2)*x_1 - 99999*(w_1+w_2)*x_2 \end{aligned}

They all have the same result. If the algorithm uses them, it will fail to give an adequate estimate of the impact an individual feature has on the target variable.

We deal with this by keeping only those features whose correlations are below the threshold (0.8). For instance, we can calculate a correlation matrix using the corr() method and, whenever we find a pair of strongly correlated features, remove one of them manually:


1cm = df.corr() # calculating the correlation matrix

If you don't want to remove correlated features by hand, you can carry out regularization. Generally speaking, this refers to any additional limitation on a model, or to an action that will decrease the model's complexity and minimize the impact of overfitting. In the case of linear models, regularization involves a limit on weights.

We can set the condition directly as follows:

\sum_{i=1}^{m}{w_i^2} < b

A similar limitation for the two features and b=3 would look like this:

Another way to limit the scale of weights is to limit their absolute value:

\sum_{i=1}^{m}{|w_i|} < a

A similar limitation for the two features and for $a=3$ would look like this:

Such limitations are often set directly through the error function. This way we try to minimize the error with the smallest feature weights (in terms of the sum of absolute values or the sum of squares). We forbid the model to choose high feature weights so that it doesn't calibrate itself based on multicollinear features.

We can create limitations in two ways.

L1 regularization, or Lasso regression. They use this formula:

Q_{\text{Lasso}} = \frac{1}{n}*(\sum_{i=1}^{n}{(y_i - \hat{y_i})^2 + \lambda * ||w||_1})

where $Q_{\text{Lasso}}$ is the error function, $y_i$ is the real target variable value for object $i$ , $\hat{y_i}$ is the forecast for object $i$ , $\lambda$ is the regularization coefficient, and $||w||_1$ is the total sum of weights' absolute values.

(That is, when we minimize the error by choosing optimal weights, we also try to minimize the sum of the absolute values of these weights.)

L2 regularization, or Ridge regression:

Q_{\text{Ridge}} = \frac{1}{n}*(\sum_{i=1}^{n}{(y_i - \hat{y_i})^2 + \lambda * ||w||_2^2})

where $Q_{\text{Ridge}}$ is the error function, $y_i$ is the the real target variable value for object $i$ , $\hat{y_i}$ is the forecast for object $i$ , $\lambda$ is the regularization coefficient, and $||w||_2^2$ is the total sum of squared weights.

(That is, while we're minimizing the error by choosing optimal weights, we also try to minimize the sum of the squares of these weights.)

sklearn has several linear regression algorithms that already include regularization:

Lasso regression uses L1 regularization. This algorithm zeroes all the weights of strongly correlated features, except for one. Thus, this algorithm has a built-in mechanism for feature selection. It's used when we need to decrease dimensionality and get rid of duplicate features.
Ridge regression uses L2 regularization. In this case, the weights between the correlated features get distributed almost equally.

Implementing Linear Models

The sklearn library has a linear_model module containing linear regression algorithms.

linear_model.LinearRegression(): the model for standard linear regression. Before applying it, you need to normalize the data and filter features by hand to avoid multicollinearity.
linear_model.Lasso(): a linear regression model with built-in L1 weight regularization (limiting the total sum of weights' absolute values).
This also requires normalized data, but here the built-in regularization does all the filtering for you. You don't need to get rid of strongly correlated features yourself.
linear_model.Ridge(): a linear regression model with built-in L2 weight regularization (limiting the total sum of squared weights).
Here, too, you have to normalize the data, as the features' scales must be unified. With Ridge regression, as with Lasso, you don't have to get rid of correlated features. Ridge regression alone won't filter features for you, since it doesn't zero the values of duplicate features. The regularization function is designed to distribute weights evenly between similar features.

You can study the syntax of each of the functions in their official documentation.

With linear regression tasks, we train a model and make predictions using standard fit() and predict() methods. For instance, for Ridge regression, the code looks like this:


1from sklearn.linear_model import Ridge
2
3model = Ridge() # creating a model of the Ridge regression class
4model.fit(X_train, y_train)
5y_pred = model.predict(X_val)

Linear regression algorithms are different in the sense that they allow us to interpret weights for each feature. After standardization, the features' coefficient values reflect the impact each of them has on the final prediction. The higher the coefficient's absolute value, the greater the impact it has. Having trained the model, you'll be able to directly output all the coefficients, which correspond to the final, optimal function. The feature weights are printed by means of the model's .coef_ method:


1print(model.coef_)

...and the value of the zero coefficient, or intercept, is printed with the .intercept_. method.


1print(model.intercept_)

Regression Metrics

Metrics are studied on validation data in order to check whether the model performs well on data it hasn't seen in training. This is what happens in real-life situations: you apply a model when the answers aren't known. All the formulas for calculating metrics given here are for validation data.

The ultimate goal of a regression task is to come as close as possible to the real value.

Let:

\begin{aligned} y_i \text{ be the real target variable value for observation i,} \\ \hat{y_i} \text{ be the forecast for observation i, and} \\ n \text{ be the number of observations in the validation set.} \end{aligned}

Let's look at the most popular metrics in business:

Mean absolute error, MAE
Mean square error, MSE, and its root, RMSE
Determination coefficient, or R-squared (R2)
Mean absolute percentage error, MAPE

Mean absolute error (MAE)

The name of this metric reveals how it's calculated. First we estimate the extent to which our forecasts deviate from the real values of a target variable (using absolute values). Then we find the absolute value of the mean error for all observations in the validation set. We take the absolute value of the difference, so it doesn't matter whether the deviation is positive or negative:

MAE = \frac{1}{n} * \sum_{i=1}^{n}{|y_i - \hat{y_i}}|

Mean square error (MSE) and root mean square error (RMSE)

We can also make the metric reflect the model's sensitivity to outliers. If the model has larger errors for some values, we have to make this amplify this error further. Then we'll need to find the mean square error:

MSE = \frac{1}{n} * \sum_{i=1}^{n}{(y_i - \hat{y_i}})^2

... or its root:

RMSE = \sqrt{\frac{1}{n} * \sum_{i=1}^{n}{(y_i - \hat{y_i}})^2}

Determination coefficient, or R-squared (R2)

You are already familiar with this metric. It shows the share of the target variable's variation that the model accounts for. The determination coefficient formula is:

R^2 = 1 - \frac{\sum_{i=1}^{n}{(y_i - \hat{y_i}})^2}{\sum_{i=1}^{n}{(\bar{y} - y_i})^2}

where:

\bar{y} = \frac{1}{n} * \sum_{i=1}^{n}{y_i}

The determination coefficient takes values from 0 to 1. The closer it is to 1, the more accurate your forecast is at matching real target variable values.

Mean absolute percentage error (MAPE)

Knowing error as a percentage can be more important than knowing it as an absolute value (such as MAE, MSE, and RMSE). Mean absolute percentage error is found with the formula:

MAPE = 100\% * \frac{1}{n} * \sum_{i=1}^{n}{\frac{|y_i - \hat{y_i}|}{|y_i|}}

MAPE is clearer intuitively. A mean absolute percentage error of 2% inspires optimism. In some cases accuracy of 80% is enough, but in others even 95% will be too little.

Metrics and the error function

When searching for the optimal function, all algorithms from standard Python libraries minimize the loss function as they select weights. This function often coincides with the metric you are estimating on the validation set. If the default function and intended metric don't coincide, then the algorithm will "adjust" for the things you don't need.

In almost all algorithm implementations, you can specify the function to be optimized right in the training stage.

Calculating metrics in sklearn

sklearn developers created the metrics module, which allows you to calculate several metrics:

MAE — metrics.mean_absolute_error
MSE — metrics.mean_squared_error
R2 — metrics.r2_score

The RMSE and MAPE metrics are not part of this module, but you can write the functions for them yourself if need be.


1from sklearn.metrics import mean_absolute_error
2
3print('MAE:', mean_absolute_error(y_true, y_pred))

We pass the vector of true target variable values (y_true) and the vector of predicted values (y_pred) as parameters. The metrics.mean_squared_error and metrics.r2_score metrics are calculated in an analogous way.

Real input data given to the model after it's developed and trained is called test data. However before testing the model in real life situations, analysts fine-tune it on validation data.

Logistic Regression

This is an algorithm for solving binary classification tasks. The target variable here is called a binary variable.

Models trained with binary classification algorithms can not only predict the final value of a class for a certain object or client, but also estimate the probability of the event in question.

A common algorithm for solving such tasks is logistic regression. It's implemented as the LogisticRegression() class inside the sklearn$ $linear_model module, alongside a linear regression algorithm:


1from sklearn.linear_model import LogisticRegression
2
3model = LogisticRegression()

Your function must return the probability that a certain event will happen. Let's assume it's found with the following formula:

f(z) = \displaystyle \mathbb {P} \{y=1\mid x\}

This equation means that the probability that $y$ equals 1, given that the features take the values $x$ , is determined by the function $f$ applied to the variable $z$ .

Here $z$ is a linear function of feature values (the $x$ vector), as in linear regression:

z(x) = w_0 + w_1*x_1 + w_2 * x_2 + ... + w_n*x_n

In logistic regression, we assume that $f$ is a logistic function. It looks as follows:

f(z) = \frac{1}{1+ e ^{-z(x)}}

The graph of the function looks like this:

The graph of the logistic function is called a sigmoid curve. The graph clearly shows that the function takes values from 0 to 1. If the weighted sum has high values, for which the probability of event occurrence is, say, 0.95, further increases in these features won't make this high probability significantly higher. As features increase, it approaches 1. For small probabilities, it approaches 0.

Then an algorithm (e.g. for gradient descent) selects the $w$ coefficients for the function z(x), so that f(z) can again accept answers that are the closest to the real target variable values.

In fact, when we are working on a binary classification task using logistic regression, we transform linear regression to find the probability that an event enters class 1. That's why logistic as well as linear (and its Lasso and Ridge variations) regressions belong to linear models. It can be found in the same linear_model module of the sklearn library.

To make a prediction for a class after training such a model, you'll need the predict() method:


1y_pred = model.predict(X_test)

And to obtain the probability that an object enters the first or the second class, use the predict_proba() method:


1y_probas = model.predict_proba(X_test)

y_probas is a bidimensional array where each object of the validation set has two corresponding values: the probability (or rather confidence) of belonging to the 0 class ("the event won’t occur") and the probability of belonging to the 1 class ("the event will occur"). The sum of these two values is 1 for each of the objects.

Classification Metrics: Working with Labels

Classification metrics based on predicted class

Let's begin with metrics that take into account only the final predicted value (0 or 1):

Confusion matrix
Accuracy
Precision and recall
F1_score

Confusion matrix

The possible class values are 0 and 1, and your model can output a final prediction in the form of one of these values. Thus, the prediction for each object enters one of four groups:

Model's prediction = 1, real value = 1. Such predictions are called True Positive, or TP.
Model's prediction = 1, real value = 0. Such predictions are called False Positive, or FP.
Model's prediction = 0, real value = 1. Such predictions are called False Negative, or FN.
Model's prediction = 0, real value = 0. Such predictions are called True Negative, or TN.

If a model is good, the majority of its predictions enter the TP and TN groups.

A confusion matrix displays the number of observations for each group. It looks like this:

The confusion matrix is calculated in sklearn's metrics module. You can define the TN, FP, FN, and TP variables, too:


1from sklearn.metrics import confusion_matrix
2
3cm = confusion_matrix(y_true,y_pred)
4tn, fp, fn, tp = cm.ravel() # "evening out" the matrix to get the intended values

You can tell from the confusion matrix which errors your algorithm tends to make. Maybe it has a bias toward the positive class (excessive optimism, too many FPs), or it might be a pessimist and expand the FN group to be on the safe side.

The terms that we introduced for the confusion matrix are true for the remaining metrics of class forecasting.

Accuracy

Accuracy can be found as follows:

\text{Accuracy} = \frac {TP+TN}{n} = \frac {TP+TN}{TP+TN+FP+FN}

This is the share of accurate predictions among all predictions. The closer we are to 100% accuracy, the better.

This metric can be calculated with the accuracy_score function in the metrics module. The function takes actual and predicted class values for the validation data as input:


1acc = accuracy_score(y_true, y_pred)

Accuracy only works when classes are balanced, i.e. when objects are distributed almost evenly between the classes, approximately 50/50.

Precision and recall

To evaluate a model without looking at the balance of classes, you can calculate the metrics:

\text{Precision} = \frac {TP}{TP+FP}

\text{Recall} = \frac {TP}{TP+FN}

Precision tells us what share of predictions in class $1$ are true. We look at the share of correct answers only in the target class. In business, this metric is needed only when each alert of the model costs resources. And you obviously don't want the model to alert for no good reason.

The second metric aims at minimizing the opposite risks. Recall demonstrates the number of real class $1$ objects you were able to discover with your model.

Each metric takes values from 0 to 1. The closer to 1, the better. Optimizing one metric can often lead to the deterioration of the other. For that reason, you have to balance between these two metrics as you adjust your model. Your final product should be based on the goals of your work.

The precision and recall metrics are generated by the functions precision_score and recall_score in the metrics module. Their syntax is similar to that of other metrics:


1precision = precision_score (y_true, y_pred)
2recall = recall_score (y_true, y_pred)

Both functions return numeric values from 0 to 1.

F1 score

Since precision and recall are aimed at avoiding opposing risks, we need a harmonizing metric that takes into account the balance between the metrics. This metric is the F1 score.

\text{F1} = \frac {2*\text{precision} * \text{recall}}{\text{precision} + \text{recall}}

In sklearn.metrics the F1 score is found via the f1_score method:


1f1= f1_score(y_true, y_pred)

The function also returns a value between 0 and 1. The closer to 1, the better.

Classification Metrics: Studying Probabilities

In addition to the answer for each observation ( $0$ or $1$ ), we can also calculate the probability that an observation belongs to one of the classes ( $0$ or $1$ ). That's why we use the roc_auc score, or the area under the receiver operating characteristic curve, to evaluate the quality of a classifier (a classification model). AUC-ROC stands for Area Under Curve and Receiver Operating Characteristic.

This metric uses the following logic: "If we sort objects by the probability predicted by the model, will we see a clear distinction between the real classes of these objects?" If the probabilities sort our objects badly, other metrics won't improve no matter what threshold we select. But when the objects are sorted well, we can conclude that the model performs adequately, and then go about setting the threshold.

This feature of AUC-ROC makes this metric a great tool, even for working with imbalanced classes.

Let's have a closer look at how it works. We'll introduce the terms True Positive Rate (TPR corresponds to recall) and False Positive Rate (FPR is for those objects that the algorithm incorrectly placed into the $1$ class instead of $0$ ):

TPR = \frac {TP}{TP+FN}

FPR = \frac {FP}{FP+TN}

We'll sort the objects in ascending order by predicted probability and iteratively select the threshold for placement into classes. We'll stop at each observation to calculate TPR and FPR for the given threshold. This is how the objects in a perfect model would look:

But in real life, neither data nor models are perfect. We'll never be able to achieve perfect sorting by probability. So in most cases the ROC curve looks like this:

The higher the line curves in the direction of the upper left corner the better. The closer it is to the TPR=FPR line the worse. This corresponds to the "random" model pattern, when you guess the class.

The roc_auc metric can be calculated in the metrics module and is called roc_auc_score:


1roc_auc = roc_auc_score(y_true, probabilities[:,1])

This function's input is the vector of true answers y_true and the vector of the probability of assigning class $1$ . This can be done by taking only the second numbers (probabilities[:,1]) from the vector of paired probabilities of assigning the $0$ or $1$ classes. The function outputs a single numeric value between 0 (more often 0.5) and 1.

Thresholds and Class Balance

As it selects the best possible weights for the z(x) function by means of the sigmoid function f(z), the logistic regression algorithm calculates the probability that an event will occur. It then calculates the final answer, $0$ or $1$ , based on this probability. This is done through comparison with a certain threshold.

For many algorithms, the predict_proba() method outputs not the probability of belonging to a class, but the model's degree of confidence that the object belongs to the class. If the model's certainty in the object's belonging to class $1$ is higher than 0.5, then the answer is $1$ ; otherwise, it's $0$ . Let's see in which cases the probability and "confidence" can be considered the same thing, and where they differ.

Imagine half of the objects in the train set belong to class $0$ , and the other half to $1$ . Such sets are called balanced.

Without algorithms and knowledge of the new object, you can't say for sure which class it belongs to. Since the initial basic probability that an object belongs to either class is 0.5. This probability can then be considered the degree of confidence that an object belongs either to $1$ or $0$ . If you can tell that the probability of belonging to class $1$ is greater than 0.5 based on features that contain additional information on the object, then the confidence that the object belongs to the $1$ class is greater than that for $0$ . This explains the threshold: we assign $1$ to all objects for which predict_proba() for that class yields a result higher than 0.5.

It's imbalanced classes that reveal problems with metrics based on binary responses and make it impossible to interpret the result of predict_proba() as the probability of belonging to a class. However, many algorithms normalize probabilities and apply other mathematical methods in such a way that predict_proba() reflects the model's confidence in the final binary response.

Decision Tree

Decision trees use scenarios to select an answer. Advantages of decision trees:

They're easy to interpret.
They work with both classification and regression.
They are pretty fast.
They don't require considerable data preprocessing.

Disadvantages:

They are strongly prone to overfitting.

To avoid this, we prune trees. Pruning involves cutting a branch at a certain point to prevent the tree from overfitting. Small samples are especially prone to overfit.

sklearn has a tree module that includes DecisionTreeClassifier and DecisionTreeRegressor algorithms:


1tree_model = DecisionTreeClassifier()
2tree_model.fit(X_train, y_train)
3y_pred = tree_model.fit(X_test)

In some cases you'll need to visualize the resulting decision tree. You can use the plot_tree() function from the same module:


1plt.figure(figsize = (20,15)) # set the figure size to get a larger image
2plot_tree(tree_model, filled=True, feature_names = X_train.columns, class_names = ['not fault', 'fault'])
3plt.show()

The result will be something like this:

This is the tree of the model tree_model =DecisionTreeClassifier(min_samples_leaf=500), which solves a binary classification task ("fault"$/$"not fault"). The sample contains about 10,000 objects. Each node has a condition for further branch construction, as well as the most probable result in the sample corresponding to this node.

Decision Tree Ensembles: Random Forest and Gradient Boosting

Ensembles are powerful machine learning models. These algorithms are sometimes called black boxes, since it's hard to tell how a particular feature impacted the prediction for an object.

More trees are better than one

Some models are said to be strong ("a strong classifier"), while other models are weak ("weak classifiers").

There are two basic types of ensemble models: random forest and gradient boosting. With these, decision trees are usually taken as the weak prediction models.

Averaging works better when models differ and compensate one another's errors. But if you average similar models whose flaws are similar, you won't get this effect.

Random forests

The first type of ensemble is called random forest. It generates many mutually independent trees in slightly different ways (taking different subsets or features) and reaches a final decision based on their answers. The random forest algorithm averages the responses of all trees (in regression) or uses voting (in classification) to select the answer that most of the trees determine to be true.

Bagging (from "bootstrap aggregation") involves averaging models trained on different subsets. The approach itself, which consists in estimating various characteristics or predictions using a multitude of different subsets obtained from the initial set, is called bootstrapping.

When working with random forests, remember that the trees are trained in parallel, or independently of one another. The training of each tree doesn't depend on the results of others. This represents a conceptual difference between random forests and gradient boosting.

Gradient boosting

The underlying concept is pretty much the same: we collect many simple models (trees) so that they make up for one another's errors and randomness. Only in this case the errors are compensated through training, rather than averaging.

In gradient boosting, trees are trained consecutively: each tree is built with reference to the results of the preceding one. In gradient boosting we have a model at a given step, and it makes a prediction with some error. Then we build another model to predict the error of the first model instead of the original value and adjust the final prediction on this basis. Then a third model predicts the error of the second model, and so on. Finally, we use the entire sequence to make a prediction for a specific observation. We say that the first model produced this result, but we added to it the correction from the second model, then from the third, and so on, right up to the very last model. Thus, with each step (each new model), we improve our previous predictions. "Boosting" refers to this process of improvement, and "gradient" boosting means the improvement takes place step by step.

Using ensembles

In sklearn, random forests are implemented in the ensemble module. The module's most popular algorithms are RandomForestClassifier() and RandomForestRegressor(). We need them to solve classification and regression problems, respectively.

Take a look at the syntax for declaring, training, and predicting in a regression problem:


1from sklearn.ensemble import RandomForestRegressor
2
3# defining the new model's algorithm based on the random forest algorithm
4rf_model = RandomForestRegressor(n_estimators = 100)
5rf_model.fit(X_train, y_train)
6y_pred = rf_model.predict(X_val)

When declaring our model, we defined n_estimators, the number of trees in our forest. For the remaining parameters of the tree (its depth max_depth, the size of the feature subset max_features, and the minimum number of objects per node min_samples_leaf), we leave the default values unchanged.

Gradient boosting is implemented in sklearn's ensemble module. It has GradientBoostingClassifier() and GradientBoostingRegressor() algorithms for solving classification and regression problems, respectively.


1from sklearn.ensemble import GradientBoostingRegressor
2
3# defining the new model's algorithm based on gradient boosting algorithm
4gb_model = GradientBoostingRegressor(n_estimators = 100)
5gb_model.fit(X_train, y_train)
6y_pred = gb_model.predict(X_val)

xgboost is an algorithm from another library, and it's often used in machine learning competitions. You can read more about it on the pages below:

Unsupervised Learning Algorithms: Clustering

In general, unsupervised learning is about finding the similarity between objects. Once the model has detected it, you can group the objects themselves, i.e. solve a clustering problem. You may find similarities between features, too. This will be a dimensionality reduction problem.

Let's take a closer look at clustering. It's very common in business. For instance, clustering is used:

For customer segmentation
For product segmentation
For topic modeling

What Does All This Have to Do with Distance?

Similar objects share similar features. We just substitute the concept of "similarity" with that of "distance," which is easier to work with on an intuitive level.

The first step in a clustering problem is to determine the distance between the objects, i.e. to describe their proximity in numbers. Then you can group the countries on the basis of this proximity. We'll define the function for distance between objects as:

\rho(x_1, x_2)

Let each country have only two features. We can plot a graph with the nature ratings along the X axis and the number of cultural institutions along the Y axis. We'll end up with something like this:

Feature visualization of standardized data may look like this:

Apply the standard distance formula (sometimes called Euclidean distance):

\rho(x_1, x_2) = \sqrt{\sum_{i=1}^{n}{(x_{1_i} - x_{2_i})^2}}

Here $n$ is the number of dimensions (countries). Then, if we only have two features, the distance between the countries will be defined as follows:

\rho(x_1, x_2) = \sqrt{{(x_{1_1} - x_{2_1})^2 + (x_{1_2} - x_{2_2})^2 }}

K-Means and Agglomerative Hierarchical Clustering

The most popular clustering algorithms are:

K-means
Agglomerative hierarchical clustering

K-means

K-means groups objects step by step. The algorithm is based on the assumption that the number of clusters (groups) is already known. This is a rather strong assumption, and finding the optimal number of clusters is often worth solving separate tasks.

Here's the algorithm for K-means:

We have K clusters. The algorithm takes their centers one by one and places each object in the cluster whose center is closest.
The centers are corrected (they are moved) until the mean distance between the objects within each cluster and its center is minimized.
When the distance from objects to the center stops decreasing or decreases only insignificantly, the algorithm stops and fixes the division, recognizing it as the optimal one.

Now that we know how the algorithm works, let's run it in Python.

The algorithm called KMeans is implemented in the sklearn.cluster module. Here's how its syntax looks:


1from sklearn.cluster import KMeans
2
3# the obligatory standardization of data before passing it to the algorithm
4sc = StandardScaler()
5X_sc = sc.fit_transform(X)
6
7km = KMeans(n_clusters = 5) # setting the number of clusters as 5
8labels = km.fit_predict(X_sc) # applying the algorithm to the data and forming a cluster

The variable labels stores the indices of the groups suggested by the algorithm. The most important thing here is that objects with the same index refer to the same cluster.

Agglomerative hierarchical clustering

When we define the distance function, we can calculate the matrix of distances between all objects. Its cells will contain the distance between pairs of objects. Moreover, it will take into account all of the objects' features, not just two of them.

We can use an object-feature matrix to merge nearby clusters in a consistent way. The distance between the objects and the agglomerative hierarchical clustering itself can be visualized with special plots called dendrograms.

The distance between clusters goes on the Y axis and the objects on the X axis. Each horizontal link corresponds to the distance between the objects that are united. If we move from the bottom to the top of the graph, first we'll see links that unite separate objects. These are followed by links that unite objects with groups, or groups with groups. If you don't stop this process intentionally, it will end when one giant cluster is left. That's how agglomerative hierarchical clustering actually works. This clustering method is really intuitive. At each step, the algorithm enlarges clusters by adding neighboring ones. That's why it's called agglomerative: it turns clusters into agglomerations. (That's also what we call a group of neighboring residential areas.) The important thing is to define the moment when we need to stop the process of combining.

We can use dendrograms to visually estimate the necessary number of clusters. We can also get a sense of the distance beyond which we stop uniting objects. Here's an example of how we can select different distance thresholds for a large number of objects:

The K value is the number of lines crossed by the dash line.

Let's plot a dendrogram in Python. First we'll import the clustering model classes linkage() and dendrogram() from the clustering module hierarchy:


1from scipy.cluster.hierarchy import dendrogram, linkage

Then we'll standardize the data and pass the resulting table as a parameter to the linkage() function. To make our plot more representative, we'll pass 'ward' to the method parameter:


1# obligatory standardization of data before passing it to the algorithm
2sc = StandardScaler()
3X_sc = sc.fit_transform(X)
4
5linked = linkage(X_sc, method = 'ward')

The variable linked stores the table with the linked bundles of objects. It can be visualized as a dendrogram:


1plt.figure(figsize=(15, 10))
2dendrogram(linked, orientation='top')
3plt.title('Hierarchial clustering for GYM')
4plt.show()

We'll get the following plot:

The suggested optimal number of clusters (5) corresponds to the five different colors on the plot.

The complexity of agglomerative clustering has to do with the computations the machine makes to plot a dendrogram, rather than the arrangement of the algorithm as such. Computing paired distances can take a long time, so when solving a clustering problem, you can plot a dendrogram on a random subset, estimate the optimal number of clusters, and launch the faster K-means algorithm.

Metrics for Unsupervised Learning Problems

If we take the distance between objects as the basis of "similarity," we can calculate metrics in terms of distance, which will help evaluate how well our models split objects.

You've divided customers into groups using K-means and hierarchical clustering. How can we determine which algorithm did a better job? Assume you have only two features and the final clustering looks like this:

You're going to look at how "distinguishable" the classes are. Is there any visual difference between the objects from different clusters in these images?

To achieve a more objective solution, we need intracluster and intercluster distances. Intracluster distance is the distance between objects within one cluster, while intercluster distance is the distance between objects from different clusters. The greater the difference between intracluster and intercluster distances, the better the clustering.

There's a popular metric called silhouette score. The silhouette score shows the extent to which an object from a cluster is similar to its cluster, rather than to another one. Here's the formula:

\text{Silhouette} = \frac {1} {n} * \sum_{c_k\in{C}}{\sum_{x_i\in{c_k}}}{\frac{b(x_i,c_k) - a(x_i, c_k)}{\max[a(x_i, c_k), b(x_i, c_k)]}}

where $n$ in the number of observations,

$c_k$ is a specific cluster,

$x_i$ is a specific object from $c_k$ ,

$a(x_i, c_k)$ is the mean distance from $x_i$ to the other objects in the cluster (a measure of cluster density), and

$b(x_i, c_k)$ is the mean distance from $x_i$ to objects from another cluster (a measure of cluster separability).

The syntax of this method is as follows:


1from sklearn.metrics import silhouette_score
2
3silhouette_score(x_sc, labels)

The input data is the normalized or standardized feature matrix and a list of labels predicted by our clustering algorithm.

The silhouette score takes values from -1 to 1. The closer to 1, the better the clustering.