Gradient Boosting

Ensembles and boosting

An ensemble is a set of models for solving the same problem. The strength of ensembles is that the mean error of a group of models is less significant than their individual errors.

Another approach to ensemble building is boosting, where each subsequent model takes into account the errors of the previous one and, in the final prediction, the forecasts of basic learners. Take a look:

a_N(x)=\sum_{k=1}^N \gamma_kb_k(x)

where $aN(x)$ is the ensemble prediction, $N$ is the number of base learners, $bk(x)$ is the base learner prediction, and $\gamma n$ is the model weight.

For example, we are dealing with a regression task. We have $n$ observations with features $x$ and correct answers $y$ . Our task is to minimize the MSE loss function:

\text{MSE }(y,a)=\frac{1}{n}\sum_{i=1}^n(a(x_i)-y_i)^2\rightarrow\min_{a(x)}

For convenience, equate the model weights to unity:

\gamma_k=1, \text{for all } k =1,\dots,N

We get:

a_N(x)=\sum_{k=1}^Nb_k(X)

Now we create an ensemble of sequential models.

First, build the base learner $b_1$ by solving the minimization task:

b_1=\argmin_b\frac{1}{n}\sum_{i=1}^n(b(x_i)-y_i)^2

The result is this ensemble:

a_1(x)=b_1(x)

Indicate the residual. It is the difference between the prediction at the first step and the correct answers:

e_{1,i}=y_i-b_1(x_i)

At the second step, we build the model like this:

b_2=\argmin_b\frac{1}{n}\sum_{i=1}^n(b(x_i)-e_{1,i})^2

The ensemble will take the following form:

a_2(x)=\sum_{k=1}^2b_k(x)=b_1(x)+b_2(x)

At each subsequent step, the algorithm minimizes the ensemble error from the preceding step.

Let's summarize the formulas. At step $N-1$ , the residual is calculated as follows:

e_{N-1,i}=y_i-a_{N-1}(x_i)

The ensemble itself is represented as the sum of predictions of all the base learners combined up to this step:

a_{N-1}(x)=\sum_{k=1}^{N-1}b_k(x)

So, at step $N$ , the algorithm will pick the model with the ensemble error at step $N-1$ :

b_N(x)=\argmin_b\frac{1}{n}\sum_{i=1}^n(b(x_i)-e_{N-1,i})^2

Gradient boosting

If our loss function is $L(y, a)$ , and it has a derivative. Let’s recall the ensemble formula:

a_N(x)=a_{N-1}(x)+\gamma_nb_N(x)

At each step, select the answers that will minimize the function:

L(y,a(x))\rightarrow\min_a

Minimize the function with gradient descent. To do so, at each step, calculate the negative gradient of the loss function for prediction $g_N$ :

g_N(x)=-\nabla L(y,a_{N-1}(x)+a)

To push the predictions towards correct answers, the base learner learns to predict $g_N$ :

b_N(x)=\argmin_b\frac{1}{N}\sum_{i=1}^N(b(x_i)+g_N(x_i))^2

Obtain the weight for $b_N$ from the minimization task by iterating various numbers:

\gamma_N=\argmin_\gamma L(y,a_{N-1}(x)+\gamma b_N(x)

It is the coefficient for the base learner that helps adjust the ensemble to make predictions as accurate as possible.

Gradient boosting is suitable for different loss functions that have derivatives — for example, the mean square in a regression task or logarithmic in a binary classification task.

Gradient boosting regularization

Regularization can be used to reduce overfitting during gradient boosting. If the weights in a linear regression have been reduced, then the gradient boosting regularization is:

step size reduction
adjustment of tree parameters
subsample randomization for base learners $b_i$ .

Reduce the step size. Revise the formula for calculating predictions at step $N$ :

a_N(x)=a_{N-1}(x)+\gamma_Nb_N(x)

Introduce the $\eta$ coefficient. It controls the learning rate and can be used to reduce the step size:

a_N(x)=a_{N-1}(x)+\eta\times\gamma_Nb_N(x)

The value for this coefficient is picked by iterating over different values in the range from 0 to 1. A smaller value means a smaller step towards the negative gradient and a higher accuracy of the ensemble. But if the learning rate is too low, the training process will take too long.

Another way to regularize gradient boosting is to adjust tree parameters. We can limit the tree depth or number of elements in each node, try different values, and see how it affects the result.

A third method of regularization is working with subsamples. The algorithm works with subsamples instead of the whole set. This algorithm version is similar to SGD and is called stochastic gradient boosting.

Libraries for gradient boosting

XGBoost (extreme gradient boosting) is a popular gradient boosting library on Kaggle. Open source. Released in 2014.
LightGBM (light gradient boosting machine). Developed by Microsoft. Fast and accurate gradient boosting training. Directly works with categorical features. Released in 2017. Comparison with XGBoost: https://lightgbm.readthedocs.io/en/latest/Experiments.html
CatBoost (categorical boosting). Developed by Yandex. Superior to other algorithms in terms of evaluation metrics. Applies various encoding techniques for categorical features (LabelEncoding, One-Hot Encoding). Released in 2017. Comparison with XGBoost and LightGBM: https://catboost.ai/#benchmark

Import CatBoostClassifier from the library and create a model. Since we have a classification problem, specify the logistic loss function. Take 10 iterations so that we don't have to wait too long.


1from catboost import CatBoostClassifier
2
3model = CatBoostClassifier(loss_function="Logloss", iterations=10)

Train the model with the fit() method. In addition to target and features, pass the categorical features to the model:


1# cat_features - categorical features
2
3model.fit(features_train, target_train, cat_features=cat_features)

When we have many iterations and don't want to output information for each one, use the verbose argument:


1model = CatBoostClassifier(loss_function="Logloss", iterations=50)
2model.fit(features_train, target_train, cat_features=cat_features, verbose=10)

Calculate the prediction with predict()


1pred_valid = model.predict(features_valid)