Linear Regression From the Inside

Linear regression model

In linear regression, features are a vector of numbers in n-dimensional space (let's say $x$ ). The prediction of the model ( $a$ ) is calculated as follows: the feature vector is scalar multiplied by the weight vector ( $w$ ), then the value of the prediction bias is added to this product:

a=(x,w)+w_0

The $w$ vector and a $w_0$ scalar are parameters of the model. There are $n$ parameters in the $w$ vector, and one in $w_0$ .

If the length of the features vector is equal to one, then there is only one feature in the sample.

Prediction plots for linear regression are set by the equation:

y=wx+w_0

If you change the parameters $w$ and $w_0$ , you will get any straight line:

Training objective

We need to analyze the learning algorithm. Our quality metric will be MSE: the model should achieve its lowest value on the test data. The goal of the training task is formulated as follows: find the model parameters for which the value of the loss function on the training set is minimal.

Let's write the goal of the training task in vector format. The training set is represented as matrix $X$ , in which the rows correspond to objects, and the columns correspond to features. Let's denote the linear regression parameters as $w$ and $w_0$ . To get the prediction vector $a$ , multiply the $X$ matrix by the $w$ vector and add the $w_0$ prediction bias value.

The formula is:

a=Xw+w_0

To shorten it, let's change the notation. In the $X$ matrix, add a column consisting only of ones (it will be the 0 column); and the parameter $w_0$ add to the $w$ vector:

\begin{aligned}\begin{pmatrix} x_{11}&x_{12}&\dots&x_{1n}\\ x_{21}&x_{22}&\dots&x_{2n}\\ \dots&\dots&\dots&\dots \end{pmatrix} &\to \begin{pmatrix} 1&x_{11}&x_{12}&\dots&x_{1n}\\ 1&x_{21}&x_{22}&\dots&x_{2n}\\ \dots&\dots&\dots&\dots&\dots \end{pmatrix} \\ \\ \begin{pmatrix} w_1&w_2&\dots&w_n \end{pmatrix} \enspace\space &\to \enspace\thinspace\thinspace\begin{pmatrix} w_0&w_1&w_2&\dots&w_n \end{pmatrix}\end{aligned}

Then multiply the $X$ matrix by the $w$ vector. The prediction bias is multiplied by a vector of ones (column zero). We get the resulting prediction vector $a$ :

a=Xw

Now we can introduce a new notation $y$ - the vector of target feature values for the training set.

Write the formula for training the linear regression of the MSE loss function:

w=\argmin_{\quad \enspace w} \text{MSE}(Xw,y)

The argmin() function finds the minimum and returns the indices at which it was reached.

Inverse matrix

An identity matrix is a square matrix with ones on the main diagonal and zeros elsewhere. If any matrix $A$ is multiplied by an identity matrix, we will get the same matrix $A$ :

AE=EA=A

The inverse matrix for a square matrix $A$ is a matrix $A$ with a superscript -1 whose product with $A$ is equal to the identity matrix. Multiplication can be performed in any order:

AA^{-1}=A^{-1}A=E

Matrices for which you can find inverses are called invertible matrix. But not every matrix has an inverse. This matrix is called a non-invertible matrix.

Non-invertible matrices are rare. If you generate a random matrix with the numpy.random.normal() function, the probability of getting a non-invertible matrix is close to zero.

To find the inverse matrix, call the numpy.linalg.inv() function. It will also help you check the matrix for invertibility: if the matrix is non-invertible, an error will be detected.

Training linear regression

The training linear regression is:

w=\argmin_{\quad \enspace w} \text{MSE}(Xw,y)

The minimum MSE value is obtained when the weights are equal to this value:

w=(X^\top X)^{-1}X^\top y

How did we get this formula:

The transposed feature matrix is multiplied by itself;
The matrix inverse to the result is calculated;
The inverse matrix is multiplied by the transposed feature matrix;
The result is multiplied by the vector of the target feature values.