Knowledge Base

Gradient Boosting

Ensembles and boosting

An ensemble is a set of models for solving the same problem. The strength of ensembles is that the mean error of a group of models is less significant than their individual errors.

Another approach to ensemble building is boosting, where each subsequent model takes into account the errors of the previous one and, in the final prediction, the forecasts of basic learners. Take a look:

aN(x)=k=1Nγkbk(x)a_N(x)=\sum_{k=1}^N \gamma_kb_k(x)

where aN(x)aN(x) is the ensemble prediction, NN is the number of base learners, bk(x)bk(x) is the base learner prediction, and γn\gamma n is the model weight.

For example, we are dealing with a regression task. We have nn observations with features xx and correct answers yy. Our task is to minimize the MSE loss function:

MSE (y,a)=1ni=1n(a(xi)yi)2mina(x)\text{MSE }(y,a)=\frac{1}{n}\sum_{i=1}^n(a(x_i)-y_i)^2\rightarrow\min_{a(x)}

For convenience, equate the model weights to unity:

γk=1,for all k=1,,N\gamma_k=1, \text{for all } k =1,\dots,N

We get:

aN(x)=k=1Nbk(X)a_N(x)=\sum_{k=1}^Nb_k(X)

Now we create an ensemble of sequential models.

First, build the base learner b1b_1 by solving the minimization task:

b1=arg minb1ni=1n(b(xi)yi)2b_1=\argmin_b\frac{1}{n}\sum_{i=1}^n(b(x_i)-y_i)^2

The result is this ensemble:

a1(x)=b1(x)a_1(x)=b_1(x)

Indicate the residual. It is the difference between the prediction at the first step and the correct answers:

e1,i=yib1(xi)e_{1,i}=y_i-b_1(x_i)

At the second step, we build the model like this:

b2=arg minb1ni=1n(b(xi)e1,i)2b_2=\argmin_b\frac{1}{n}\sum_{i=1}^n(b(x_i)-e_{1,i})^2

The ensemble will take the following form:

a2(x)=k=12bk(x)=b1(x)+b2(x)a_2(x)=\sum_{k=1}^2b_k(x)=b_1(x)+b_2(x)

At each subsequent step, the algorithm minimizes the ensemble error from the preceding step.

Let's summarize the formulas. At step N1N-1, the residual is calculated as follows:

eN1,i=yiaN1(xi)e_{N-1,i}=y_i-a_{N-1}(x_i)

The ensemble itself is represented as the sum of predictions of all the base learners combined up to this step:

aN1(x)=k=1N1bk(x)a_{N-1}(x)=\sum_{k=1}^{N-1}b_k(x)

So, at step NN, the algorithm will pick the model with the ensemble error at step N1N-1:

bN(x)=arg minb1ni=1n(b(xi)eN1,i)2b_N(x)=\argmin_b\frac{1}{n}\sum_{i=1}^n(b(x_i)-e_{N-1,i})^2

Gradient boosting

If our loss function is L(y,a)L(y, a), and it has a derivative. Let’s recall the ensemble formula:

aN(x)=aN1(x)+γnbN(x)a_N(x)=a_{N-1}(x)+\gamma_nb_N(x)

At each step, select the answers that will minimize the function:

L(y,a(x))minaL(y,a(x))\rightarrow\min_a

Minimize the function with gradient descent. To do so, at each step, calculate the negative gradient of the loss function for prediction gNg_N:

gN(x)=L(y,aN1(x)+a)g_N(x)=-\nabla L(y,a_{N-1}(x)+a)

To push the predictions towards correct answers, the base learner learns to predict gNg_N:

bN(x)=arg minb1Ni=1N(b(xi)+gN(xi))2b_N(x)=\argmin_b\frac{1}{N}\sum_{i=1}^N(b(x_i)+g_N(x_i))^2

Obtain the weight for bNb_N from the minimization task by iterating various numbers:

γN=arg minγL(y,aN1(x)+γbN(x)\gamma_N=\argmin_\gamma L(y,a_{N-1}(x)+\gamma b_N(x)

It is the coefficient for the base learner that helps adjust the ensemble to make predictions as accurate as possible.

Gradient boosting is suitable for different loss functions that have derivatives — for example, the mean square in a regression task or logarithmic in a binary classification task.

Gradient boosting regularization

Regularization can be used to reduce overfitting during gradient boosting. If the weights in a linear regression have been reduced, then the gradient boosting regularization is:

  1. step size reduction
  2. adjustment of tree parameters
  3. subsample randomization for base learners bib_i.

Reduce the step size. Revise the formula for calculating predictions at step NN:

aN(x)=aN1(x)+γNbN(x)a_N(x)=a_{N-1}(x)+\gamma_Nb_N(x)

Introduce the η\eta coefficient. It controls the learning rate and can be used to reduce the step size:

aN(x)=aN1(x)+η×γNbN(x)a_N(x)=a_{N-1}(x)+\eta\times\gamma_Nb_N(x)

The value for this coefficient is picked by iterating over different values in the range from 0 to 1. A smaller value means a smaller step towards the negative gradient and a higher accuracy of the ensemble. But if the learning rate is too low, the training process will take too long.

Another way to regularize gradient boosting is to adjust tree parameters. We can limit the tree depth or number of elements in each node, try different values, and see how it affects the result.

A third method of regularization is working with subsamples. The algorithm works with subsamples instead of the whole set. This algorithm version is similar to SGD and is called stochastic gradient boosting.

Libraries for gradient boosting

  1. XGBoost (extreme gradient boosting) is a popular gradient boosting library on Kaggle. Open source. Released in 2014.
  2. LightGBM (light gradient boosting machine). Developed by Microsoft. Fast and accurate gradient boosting training. Directly works with categorical features. Released in 2017. Comparison with XGBoost: https://lightgbm.readthedocs.io/en/latest/Experiments.html
  3. CatBoost (categorical boosting). Developed by Yandex. Superior to other algorithms in terms of evaluation metrics. Applies various encoding techniques for categorical features (LabelEncoding, One-Hot Encoding). Released in 2017. Comparison with XGBoost and LightGBM: https://catboost.ai/#benchmark

Import CatBoostClassifier from the library and create a model. Since we have a classification problem, specify the logistic loss function. Take 10 iterations so that we don't have to wait too long.

1from catboost import CatBoostClassifier
2
3model = CatBoostClassifier(loss_function="Logloss", iterations=10)

Train the model with the fit() method. In addition to target and features, pass the categorical features to the model:

1# cat_features - categorical features
2
3model.fit(features_train, target_train, cat_features=cat_features)

When we have many iterations and don't want to output information for each one, use the verbose argument:

1model = CatBoostClassifier(loss_function="Logloss", iterations=50)
2model.fit(features_train, target_train, cat_features=cat_features, verbose=10)

Calculate the prediction with predict()

1pred_valid = model.predict(features_valid)
Send Feedback
close
  • Bug
  • Improvement
  • Feature
Send Feedback
,