Gradient Descent Training
Gradient descent for linear regression
Write down the loss function in vector form to find its gradient. Express MSE as a scalar product of the difference of vectors:
where is the correct answer vector, and is the prediction vector.
After transposing the vector we can multiply it by another vector.
Combine the MSE and linear regression formulas:
Find the function gradient for parameter vector . Gradients of vector-valued functions are calculated similar way as derivatives. When dealing with vectors, only the factor of remains from the first parenthesis ():
Stochastic gradient descent
We can calculate the gradient using batches/mini-batches. For the algorithm to "see" the whole training set, its batches at each iteration should be changed randomly. Here we need the mini-batch stochastic gradient descent, or stochastic gradient descent, SGD.
One batch should contain an average of 100-200 observations (batch size). When the SGD algorithm has gone through all the batches one time, it means that one epoch has ended. The number of batches is equal to the number of iterations for the completion of one epoch.
Here’s how SGD works
- Input hyperparameters: batch size, number of epochs, and step size
- Define the initial values of the model weights
- Split the training set into batches for each epoch
- For each batch:
- Calculate the loss function gradient
- Update the model weights (add the negative gradient multiplied by the step size to the current weights)
- The algorithm returns the final model weights
You can find the computational complexity of SGD with the following definitions:
n
— the number of observations in the whole training setb
— the batch sizep
— the number of features
SGD in Python
Declare the model class and add the class initializer (__init__
):
12class SGDLinearRegression:3 def __init__(self):4 ...
Add one hyperparameter step_size
to the class initializer:
1class SGDLinearRegression:2 def __init__(self, step_size):3 ...
Now we can pass the step size to the model when creating a class:
1# you can choose step size arbitrarily2model = SGDLinearRegression(0.01)
Save the step size as an attribute:
1class SGDLinearRegression:2 def __init__(self, step_size):3 self.step_size = step_size
Here's the full implementation of the SGD algorithm:
1class SGDLinearRegression:2 def __init__(self, step_size, epochs, batch_size):3 self.step_size = step_size4 self.epochs = epochs5 self.batch_size = batch_size67 def fit(self, train_features, train_target):8 X = np.concatenate(9 (np.ones((train_features.shape[0], 1)),10 train_features11 ),12 axis=113 )14 y = train_target15 w = np.zeros(X.shape[1])1617 for _ in range(self.epochs):18 batches_count = X.shape[0] // self.batch_size19 for i in range(batches_count):20 begin = i * self.batch_size21 end = (i + 1) * self.batch_size22 X_batch = X[begin:end, :]23 y_batch = y[begin:end]2425 gradient = 2 * X_batch.T.dot(X_batch.dot(w) - y_batch) / X_batch.shape[0]2627 w -= self.step_size * gradient2829 self.w = w[1:]30 self.w0 = w[0]3132 def predict(self, test_features):33 return test_features.dot(self.w) + self.w0
Linear regression regularization
Regularization reduces overfitting. For a linear regression model, regularization implies the limitation of weights. To find out how large the weights are, we calculate the distance between the weight vector and the vector consisting of zeros.
To limit the weight values, include the scalar product of the weights in the loss function formula:
The derivative is equal to . Calculate the loss function gradient:
To control the magnitude of regularization, add the regularization weight to the loss function formula. It is denoted as .
The regularization weight is also added to the gradient calculation formula:
When we use Euclidean distance for weight regularization we call it ridge regression.
Here's how the SGD will look if we take into account regularization:
1class SGDLinearRegression:2 def __init__(self, step_size, epochs, batch_size, reg_weight):3 self.step_size = step_size4 self.epochs = epochs5 self.batch_size = batch_size6 self.reg_weight = reg_weight78 def fit(self, train_features, train_target):9 X = np.concatenate(10 (np.ones((train_features.shape[0], 1)), train_features),11 axis=1)12 y = train_target13 w = np.zeros(X.shape[1])1415 for _ in range(self.epochs):16 batches_count = X.shape[0] // self.batch_size17 for i in range(batches_count):18 begin = i * self.batch_size19 end = (i + 1) * self.batch_size20 X_batch = X[begin:end, :]21 y_batch = y[begin:end]2223 gradient = 2 * X_batch.T.dot(X_batch.dot(w) - y_batch) / X_batch.shape[0]24 reg = 2 * w.copy()25 reg[0] = 026 gradient += self.reg_weight * reg2728 w -= self.step_size * gradient2930 self.w = w[1:]31 self.w0 = w[0]3233 def predict(self, test_features):34 return test_features.dot(self.w) + self.w0
Basics of neural networks
Here's an example of a neural network with three inputs and two outputs :
The value at each output, or neuron, is calculated in the same way as a linear regression prediction:
Each output value has its own weights ( and ).
Here's another example. The network has three inputs , two hidden variables and , and one output .
Values and are passed to the logistic function :
The activation function is a logistic function in a neural network. It is included in the neuron after multiplying the input values by the weights, when the neuron outputs become inputs for other neurons.
Each hidden variable is equal to the input value multiplied by a weight:
For convenience, we denote hidden variables and as vector . Here's the formula for calculating the neural network prediction:
If we put the weights of several neurons into matrices, we can get an even more complex network, for example:
where:
- — input vector with dimension (number of features)
- — matrix with dimension
- — matrix with dimension
- — matrix with dimension
- — model prediction (single number)
When such a neural network calculates a prediction, it sequentially performs all operations:
Training of neural networks
To train a neural network, we need to state the training objective. Any neural network can be written as a function from its input vector and parameters. Let’s define the following:
- — training set features
- — set of all neural network parameters
- — neural network function
Let's take this neural network:
Neural network parameters are weights in neurons:
Here's the neural network function:
Let’s also define:
- — training set answers
- — loss function (for example, MSE)
Then we can state the objective of neural network training as follows:
The minimum of this function can be also found using the SGD algorithm.
The neural network learning algorithm is the same as the SGD algorithm for linear regression. Only instead of gradient for linear regression, we calculate the neural network gradient.