Distance Between Vectors

Scalar product

If we multiply all the components, and then add up the obtained values, we obtain the scalar product or dot product. As a result of this operation with two vectors of the same size we obtain a new number. That number is called the scalar.

Here's the formula for a scalar product of two vectors $a=[x_1, x_2,\dots, x_n]$ and $b=[y_1, y_2,\dots,y_n]$ :

a \cdot b = x_1 *y_1 + x_2 *y_2 + \dots + x_n *y_n

The scalar product of vectors $a$ and $b$ is usually denoted by parentheses $(a, b)$ or a dot $a \cdot b$ .

In numpy, we can find the scalar product using the numpy.dot() function:


1import numpy as np
2
3dot_value = np.dot(vector1, vector2)

The matrix multiplication operator allows us to calculate the scalar product even more easily. The operator is denoted by @:


1import numpy as np
2
3volume = np.array([0.1, 0.3, 0.1])
4content = np.array([0.4, 0.0, 0.1])
5
6dot_value = vector1 @ vector2

There's also element-by-element multiplication. Unlike the scalar product, the result will be a vector:


1import numpy as np
2
3vector3 = vector1 * vector2

Planar distance

The vector length or module equals the square root of the scalar product of the vector and itself.

|a|=\sqrt{(a,a)}=\sqrt{x^2+y^2}

To measure the distance between two points, we find the Euclidian distance. Euclidian distance calculates the shortest distance using the Pythagorean theorem.

The Euclidean distance can be written as: $d_2(a, b)$ . The $d$ carries subscript 2 to indicate that the vector coordinates are raised to the second power.

Distance between points $a(x_1, y_1)$ and $b(x_2, y_2)$ is calculated by the formula:

\begin{aligned}d_2(a, b)&=\sqrt{(b-a) \cdot (b-a)} \\&= \sqrt{(x_2-x_1)^2+(y_2-y_1)^2}\end{aligned}

Let's find the Euclidean distance between points $a=(5, 6)$ and $b=(1, 3)$ :


1import numpy as np
2
3a = np.array([5, 6])
4b = np.array([1, 3])
5d = np.dot(a-b, a-b)**0.5
6print('Distance between a and b is equal to', d)

scipy has a dedicated distance library for distance calculations. Call the function to find the Euclidean distance distance.euclidean():


1import numpy as np
2from scipy.spatial import distance
3
4a = np.array([5, 6])
5b = np.array([1, 3])
6d = distance.euclidean(a, b)
7print('Distance between a and b is equal to', d)

The calculation results are the same.

Manhattan distance

Manhattan distance or city block distance is the sum of modules of coordinate differences.

d_1(a,b)= |x_1 -x_2| + |y_1-y_2|

Manhattan distance is formulated $d_1(a, b)$ . The $d$ carries subscript 1 to indicate that the vector coordinates are raised to the first power.

In scipy the function for Manhattan distance calculation is called distance.cityblock()


1import numpy as np
2from scipy.spatial import distance
3
4a = np.array([5, 6])
5b = np.array([1, 3])
6d = distance.cityblock(a, b)
7print('Distance between a and b is equal to', d)

To find the minimum index in the numpy array, call the argmin() function.


1index = np.array(distances).argmin() # minimum element index

Distances in multidimensional space

In machine learning, vectors are features of observations. Usually the vectors are multidimensional rather than two dimensional.

The Euclidean distance between vectors $a=(x_1, x_2,\dots, x_n)$ and $b=(y_1, y_2,\dots,y_n)$ is the sum of squares of coordinate differences.

\begin{aligned} d_2(a,b)&=\sqrt{(x_1-y_1)^2+(x_2-y_2)^2+\dots+(x_n-y_n)^2}\\ &=\sqrt{\sum_{i=1}^n(x_i-y_i)^2} \end{aligned}

The Manhattan distance is the sum of modules of coordinate differences.

\begin{aligned} d_1(a,b)&=|x_1-y_1|+|x_2-y_2|+\dots+|x_n-y_n|\\&=\sum_{i=1}^n |x_i-y_i| \end{aligned}

When the number of coordinates is more than two, we still use the familiar functions distance.euclidean() and distance.cityblock() to calculate distances in multidimensional space.

Nearest neighbors algorithm

Take a look at the picture below. How can we predict the observation class? We can find the nearest object in the sample and get the answer from it. That's how the nearest neighbors algorithm works. Usually we look for the nearest observation in the training set.

The algorithm works both on the plane and in multidimensional space, in which case the distances are calculated using multidimensional formulas.

Creating model class

Class is a new data type with its own methods and attributes.

To create a new class, specify key word class followed by the name of the class:


1
2class ConstantRegression:
3    # class content with four-space offset
4    # ...

To train the model use the fit() method. It's a function inside the class and the first parameter for it is always self. self is a variable that stores the model. It is needed for working with the attributes. Two other parameters are features and target of the training set.


1class ConstantRegression:
2    def fit(self, features_train, target_train):
3        # function content with 4+4-offset
4        # ...

In the process of training, we need to save the mean value of the target. To create the new attribute value, add self with a dot to the beginning of the variable name. This way, we indicate that the variable is inside the class:


1class ConstantRegression:
2    def fit(self, features_train, target_train):
3        self.value = target_train.mean()

Let's use the predict() method to predict the answer, which is the saved mean:


1class ConstantRegression:
2    def fit(self, features_train, target_train):
3        self.value = target_train.mean()
4
5    def predict(self, new_features):
6        answer = pd.Series(self.value, index=new_features.index)
7        return answer