Gradient Descent

Loss function minimization

Denote the loss function as $L( y, a )$ , where vector $y$ represents the correct answers, and vector $a4 represents predictions, and write the model training task via loss function as follows:

\text w = \argmin_w \text{ L}(y,a)

To calculate MSE, or the quadratic loss function, square the difference between correct answers and predictions.

\text{L}(y,a)=\sum_{i=1}^n (a_i - y_i)^2

If we want the function in the task to be less sensitive to outliers, then instead of MSE, we use the absolute loss function (MAE) which you are already familiar with. Just like in the case with MSE, we use it as a loss function, not an evaluation metric.

\text{L}(y,a)=\sum_{i=1}^n |a_i - y_i|

For classification problems, the accuracy metric is often used. But it is rarely suitable as a loss function. This is because the accuracy calculation function does not have a derivative that shows the change in the function at small changes in the argument.

Replace accuracy with the negative logarithmic likelihood, or the logistic loss function. The loss function adds up the logarithms of probabilities depending on the observation. If the correct answer is one, $\log_{10}a_i$ is added. If it is zero, $log_{10} (1 - a_i)$ is added. Logarithm is the power to which the base (in our case, 10) is raised to extract the argument.

\text{L}(y,a)=-\sum_{i=1}^n \begin{cases} \log_{10}a_i &\text{if } y_i=1\\ \log_{10} \,(1-a_i) &\text{if } y_i=0 \end{cases}

$a_i$ is probability of class 1 for observation with index $i$ . That is, the value of $a_i$ should be as high as possible for a positive class observation, and as low as possible for a negative class observation.

The likelihood function calculates the probability of the model giving correct answers for all observations if it takes the values of $a_i$ for the answer.

\text{Likelihood }(y,a)=\prod_{i=1}^n \begin{cases} &a_i &\text{if } y_i=1\\ &1-a_i &\text{if } y_i=0 \end{cases}

Taking the logarithm "stretches" the values over a wider range.

Unlike accuracy logarithmic loss function has a derivative.

Function gradient

The loss function minimum cannot always be found manually. The function gradient can help to find the direction.

The loss function depends on the parameters of the algorithm. This function is a vector-valued function, it takes a vector and returns a scalar.

The gradient of a vector-valued function is a vector consisting of derivatives of the answer for each argument. It is denoted as $\nabla$ . The gradient of the $f$ function from an $n$ -dimensional vector $x$ is calculated as follows:

\nabla f(x) = \Bigg(\frac{\partial f}{\partial x_1},\frac{\partial f}{\partial x_2},\dots,\frac{\partial f}{\partial x_n}\Bigg)

where $\frac{\partial f}{\partial x_i}$ is the partial derivative of the function $f$ for the argument $x_i$ . The function gradient for one argument is the derivative.

The gradient indicates the direction in which the function grows the fastest. We need the opposite vector, the one that shows the fastest decrease — the negative gradient. It can be found as follows:

-\nabla f(x)

Gradient descent

Gradient descent is an iterative algorithm for finding the loss function minimum. It moves in the direction of the negative gradient and gradually approximates the minimum.

To begin the descent, pick the initial value of the argument (vector $x$ ). From there, the first gradient descent step will be made. The next point, $x^1$ is calculated as follows: add the negative gradient multiplied by the size of the gradient descent step ( $\mu$ ) to point $x^0$ .

x^1=x^0+\mu\times(-\nabla f(x))

The value $\mu$ controls the size of the gradient descent step. If the step is small, the descent will take many iterations, but each one will bring us closer to the loss function minimum. When the step is too big, we can miss the minimum.

Repeat to obtain the values of arguments at subsequent iterations. The number of iterations is denoted as $t$ . To get the new values of $x^t$ , multiply the negative gradient by the step size and add this product to the previous point

x^t=x^{t-1}-\mu\times\nabla f(x^{t-1})

Gradient descent is completed when the algorithm completes the required number of iterations, or the $x$ value is no longer changing.

Gradient descent in Python

In the arguments of the algorithm, set the initial value, $x^0$ .
Calculate the gradient of the loss function (this is the vector of partial derivatives with respect to each argument that takes vector $x$ as input).
Find the new value using the formula:
$x^t=x^{t-1}-\mu\times\nabla f(x^{t-1})$
where $\mu$ is the step size; set in the argument of the algorithm.

Perform the number of iterations specified in the arguments.


1import numpy as np
2
3def func(x):
4    # function to be minimized
5
6def gradient(x):
7    # func function gradient
8
9def gradient_descent(initialization, step_size, iterations):
10    x = initialization
11    for i in range(iterations):
12        x = x - step_size * gradient(x)
13    return x