Knowledge Base

Gradient Descent

Loss function minimization

Denote the loss function as L(y,a)L( y, a ), where vector yy represents the correct answers, and vector $a4 represents predictions, and write the model training task via loss function as follows:

w=arg minw L(y,a)\text w = \argmin_w \text{ L}(y,a)

To calculate MSE, or the quadratic loss function, square the difference between correct answers and predictions.

L(y,a)=i=1n(aiyi)2\text{L}(y,a)=\sum_{i=1}^n (a_i - y_i)^2

If we want the function in the task to be less sensitive to outliers, then instead of MSE, we use the absolute loss function (MAE) which you are already familiar with. Just like in the case with MSE, we use it as a loss function, not an evaluation metric.

L(y,a)=i=1naiyi\text{L}(y,a)=\sum_{i=1}^n |a_i - y_i|

For classification problems, the accuracy metric is often used. But it is rarely suitable as a loss function. This is because the accuracy calculation function does not have a derivative that shows the change in the function at small changes in the argument.

Replace accuracy with the negative logarithmic likelihood, or the logistic loss function. The loss function adds up the logarithms of probabilities depending on the observation. If the correct answer is one, log10ai\log_{10}a_i is added. If it is zero, log10(1ai)log_{10} (1 - a_i) is added. Logarithm is the power to which the base (in our case, 10) is raised to extract the argument.

L(y,a)=i=1n{log10aiif yi=1log10(1ai)if yi=0\text{L}(y,a)=-\sum_{i=1}^n \begin{cases} \log_{10}a_i &\text{if } y_i=1\\ \log_{10} \,(1-a_i) &\text{if } y_i=0 \end{cases}

aia_i is probability of class 1 for observation with index ii. That is, the value of aia_i should be as high as possible for a positive class observation, and as low as possible for a negative class observation.

The likelihood function calculates the probability of the model giving correct answers for all observations if it takes the values of aia_i for the answer.

Likelihood (y,a)=i=1n{aiif yi=11aiif yi=0\text{Likelihood }(y,a)=\prod_{i=1}^n \begin{cases} &a_i &\text{if } y_i=1\\ &1-a_i &\text{if } y_i=0 \end{cases}

Taking the logarithm "stretches" the values over a wider range.

Unlike accuracy logarithmic loss function has a derivative.

Function gradient

The loss function minimum cannot always be found manually. The function gradient can help to find the direction.

The loss function depends on the parameters of the algorithm. This function is a vector-valued function, it takes a vector and returns a scalar.

The gradient of a vector-valued function is a vector consisting of derivatives of the answer for each argument. It is denoted as \nabla. The gradient of the ff function from an nn-dimensional vector xx is calculated as follows:

f(x)=(fx1,fx2,,fxn)\nabla f(x) = \Bigg(\frac{\partial f}{\partial x_1},\frac{\partial f}{\partial x_2},\dots,\frac{\partial f}{\partial x_n}\Bigg)

where fxi\frac{\partial f}{\partial x_i} is the partial derivative of the function ff for the argument xix_i. The function gradient for one argument is the derivative.

The gradient indicates the direction in which the function grows the fastest. We need the opposite vector, the one that shows the fastest decrease — the negative gradient. It can be found as follows:

f(x)-\nabla f(x)

Gradient descent

Gradient descent is an iterative algorithm for finding the loss function minimum. It moves in the direction of the negative gradient and gradually approximates the minimum.

To begin the descent, pick the initial value of the argument (vector xx). From there, the first gradient descent step will be made. The next point, x1x^1 is calculated as follows: add the negative gradient multiplied by the size of the gradient descent step (μ\mu) to point x0x^0.

x1=x0+μ×(f(x))x^1=x^0+\mu\times(-\nabla f(x))

The value μ\mu controls the size of the gradient descent step. If the step is small, the descent will take many iterations, but each one will bring us closer to the loss function minimum. When the step is too big, we can miss the minimum.

Repeat to obtain the values of arguments at subsequent iterations. The number of iterations is denoted as tt. To get the new values of xtx^t, multiply the negative gradient by the step size and add this product to the previous point

xt=xt1μ×f(xt1)x^t=x^{t-1}-\mu\times\nabla f(x^{t-1})

Gradient descent is completed when the algorithm completes the required number of iterations, or the xx value is no longer changing.

Gradient descent in Python

  1. In the arguments of the algorithm, set the initial value, x0x^0.

  2. Calculate the gradient of the loss function (this is the vector of partial derivatives with respect to each argument that takes vector xx as input).

  3. Find the new value using the formula:

    xt=xt1μ×f(xt1)x^t=x^{t-1}-\mu\times\nabla f(x^{t-1})

    where μ\mu is the step size; set in the argument of the algorithm.

  4. Perform the number of iterations specified in the arguments.

    1import numpy as np
    2
    3def func(x):
    4 # function to be minimized
    5
    6def gradient(x):
    7 # func function gradient
    8
    9def gradient_descent(initialization, step_size, iterations):
    10 x = initialization
    11 for i in range(iterations):
    12 x = x - step_size * gradient(x)
    13 return x
Send Feedback
close
  • Bug
  • Improvement
  • Feature
Send Feedback
,