Gradient Descent Training

Gradient descent for linear regression

Write down the loss function in vector form to find its gradient. Express MSE as a scalar product of the difference of vectors:

\text{MSE }(y,a)=\frac{1}{n}\sum_{i=1}^n (a_i-y_i)^2=\frac{1}{n}(a-y_i,a-y)

where $y$ is the correct answer vector, and $a$ is the prediction vector.

After transposing the vector we can multiply it by another vector.

Combine the MSE and linear regression formulas:

\text{MSE }(Xw,y)=\frac{1}{n} (Xw-y)^\top(Xw-y)

Find the function gradient for parameter vector $w$ . Gradients of vector-valued functions are calculated similar way as derivatives. When dealing with vectors, only the factor of $w$ remains from the first parenthesis ( $X^\top$ ):

\nabla\text{MSE }(Xw,y)=\frac{2}{n}X^\top(Xw-y)

Stochastic gradient descent

We can calculate the gradient using batches/mini-batches. For the algorithm to "see" the whole training set, its batches at each iteration should be changed randomly. Here we need the mini-batch stochastic gradient descent, or stochastic gradient descent, SGD.

One batch should contain an average of 100-200 observations (batch size). When the SGD algorithm has gone through all the batches one time, it means that one epoch has ended. The number of batches is equal to the number of iterations for the completion of one epoch.

Here’s how SGD works

Input hyperparameters: batch size, number of epochs, and step size
Define the initial values of the model weights
Split the training set into batches for each epoch
For each batch:
1. Calculate the loss function gradient
2. Update the model weights (add the negative gradient multiplied by the step size to the current weights)
The algorithm returns the final model weights

You can find the computational complexity of SGD with the following definitions:

n — the number of observations in the whole training set
b — the batch size
p — the number of features

SGD in Python

Declare the model class and add the class initializer (__init__):


1
2class SGDLinearRegression:
3    def __init__(self):
4        ...

Add one hyperparameter step_size to the class initializer:


1class SGDLinearRegression:
2    def __init__(self, step_size):
3        ...

Now we can pass the step size to the model when creating a class:


1# you can choose step size arbitrarily
2model = SGDLinearRegression(0.01)

Save the step size as an attribute:


1class SGDLinearRegression:
2    def __init__(self, step_size):
3        self.step_size = step_size

Here's the full implementation of the SGD algorithm:


1class SGDLinearRegression:
2    def __init__(self, step_size, epochs, batch_size):
3        self.step_size = step_size
4        self.epochs = epochs
5        self.batch_size = batch_size
6
7    def fit(self, train_features, train_target):
8        X = np.concatenate(
9          (np.ones((train_features.shape[0], 1)),
10           train_features
11           ),
12          axis=1
13        )
14        y = train_target
15        w = np.zeros(X.shape[1])
16
17        for _ in range(self.epochs):
18            batches_count = X.shape[0] // self.batch_size
19            for i in range(batches_count):
20                begin = i * self.batch_size
21                end = (i + 1) * self.batch_size
22                X_batch = X[begin:end, :]
23                y_batch = y[begin:end]
24
25                gradient = 2 * X_batch.T.dot(X_batch.dot(w) - y_batch) / X_batch.shape[0]
26
27                w -= self.step_size * gradient
28
29        self.w = w[1:]
30        self.w0 = w[0]
31
32    def predict(self, test_features):
33        return test_features.dot(self.w) + self.w0

Linear regression regularization

Regularization reduces overfitting. For a linear regression model, regularization implies the limitation of weights. To find out how large the weights are, we calculate the distance between the weight vector and the vector consisting of zeros.

To limit the weight values, include the scalar product of the weights in the loss function formula:

L(w)=\text{MSE }(Xw,y)+(w,w)

The derivative $(w, w)$ is equal to $2w$ . Calculate the loss function gradient:

\nabla L(w)=\frac{2}{n}X^\top(Xw-y)+2w

To control the magnitude of regularization, add the regularization weight to the loss function formula. It is denoted as $\lambda$ .

L(w)=\text{MSE }(Xw,y)+\lambda(w,w)

The regularization weight is also added to the gradient calculation formula:

\nabla L(w)=\frac{2}{n}X^\top(Xw-y)+2\lambda w

When we use Euclidean distance for weight regularization we call it ridge regression.

Here's how the SGD will look if we take into account regularization:


1class SGDLinearRegression:
2    def __init__(self, step_size, epochs, batch_size, reg_weight):
3        self.step_size = step_size
4        self.epochs = epochs
5        self.batch_size = batch_size
6        self.reg_weight = reg_weight
7
8    def fit(self, train_features, train_target):
9        X = np.concatenate(
10          (np.ones((train_features.shape[0], 1)), train_features),
11          axis=1)
12        y = train_target
13        w = np.zeros(X.shape[1])
14
15        for _ in range(self.epochs):
16            batches_count = X.shape[0] // self.batch_size
17            for i in range(batches_count):
18                begin = i * self.batch_size
19                end = (i + 1) * self.batch_size
20                X_batch = X[begin:end, :]
21                y_batch = y[begin:end]
22
23                gradient = 2 * X_batch.T.dot(X_batch.dot(w) - y_batch) / X_batch.shape[0]
24                reg = 2 * w.copy()
25                reg[0] = 0
26                gradient += self.reg_weight * reg
27
28                w -= self.step_size * gradient
29
30        self.w = w[1:]
31        self.w0 = w[0]
32
33    def predict(self, test_features):
34        return test_features.dot(self.w) + self.w0

Basics of neural networks

Here's an example of a neural network with three inputs $x_1, x_2, x_3$ and two outputs $a_1, a_2$ :

The value at each output, or neuron, is calculated in the same way as a linear regression prediction:

a_1=xw_1

a_2=xw_2

Each output value has its own weights ( $w_1$ and $w_2$ ).

Here's another example. The network has three inputs $x_1, x_2, x_3$ , two hidden variables $h_1$ and $h_2$ , and one output $a$ .

Values $h_1$ and $h_2$ are passed to the logistic function $\sigma(x)$ :

\sigma(s)=\frac{1}{1+e^{-x}}

The activation function is a logistic function in a neural network. It is included in the neuron after multiplying the input values by the weights, when the neuron outputs become inputs for other neurons.

Each hidden variable $(h_1, h_2)$ is equal to the input value multiplied by a weight:

h_1=xw_1

h_2=xw_2

For convenience, we denote hidden variables $h_1$ and $h_2$ as vector $h$ . Here's the formula for calculating the neural network prediction:

a=\sigma(h)w_3

If we put the weights of several neurons into matrices, we can get an even more complex network, for example:

where:

$x$ — input vector with dimension $p$ (number of features)
$W_1$ — matrix with dimension $p \times m$
$W_2$ — matrix with dimension $m \times k$
$W_3$ — matrix with dimension $k \times 1$
$a$ — model prediction (single number)

When such a neural network calculates a prediction, it sequentially performs all operations:

a=\sigma(\sigma(xW_1)W_2)W_3

Training of neural networks

To train a neural network, we need to state the training objective. Any neural network can be written as a function from its input vector and parameters. Let’s define the following:

$X$ — training set features
$P$ — set of all neural network parameters
$N(X, P)$ — neural network function

Let's take this neural network:

Neural network parameters are weights in neurons:

P=W_1,W_2,W_3

Here's the neural network function:

N(X,P)=\sigma(\sigma(XW_1)W_2)W_3

Let’s also define:

$y$ — training set answers
$L(a, y)$ — loss function (for example, MSE)

Then we can state the objective of neural network training as follows:

\displaystyle\min_{P}L(N(X,P),y)

The minimum of this function can be also found using the SGD algorithm.

The neural network learning algorithm is the same as the SGD algorithm for linear regression. Only instead of gradient for linear regression, we calculate the neural network gradient.

\nabla L(N(X,P),y)