Distance Between Vectors
Scalar product
If we multiply all the components, and then add up the obtained values, we obtain the scalar product or dot product. As a result of this operation with two vectors of the same size we obtain a new number. That number is called the scalar.
Here's the formula for a scalar product of two vectors and :
The scalar product of vectors and is usually denoted by parentheses or a dot .
In numpy
, we can find the scalar product using the numpy.dot()
function:
1import numpy as np23dot_value = np.dot(vector1, vector2)
The matrix multiplication operator allows us to calculate the scalar product even more easily. The operator is denoted by @
:
1import numpy as np23volume = np.array([0.1, 0.3, 0.1])4content = np.array([0.4, 0.0, 0.1])56dot_value = vector1 @ vector2
There's also element-by-element multiplication. Unlike the scalar product, the result will be a vector:
1import numpy as np23vector3 = vector1 * vector2
Planar distance
The vector length or module equals the square root of the scalar product of the vector and itself.
To measure the distance between two points, we find the Euclidian distance. Euclidian distance calculates the shortest distance using the Pythagorean theorem.
The Euclidean distance can be written as: . The carries subscript 2 to indicate that the vector coordinates are raised to the second power.
Distance between points and is calculated by the formula:
Let's find the Euclidean distance between points and :
1import numpy as np23a = np.array([5, 6])4b = np.array([1, 3])5d = np.dot(a-b, a-b)**0.56print('Distance between a and b is equal to', d)
scipy
has a dedicated distance library for distance calculations. Call the function to find the Euclidean distance distance.euclidean()
:
1import numpy as np2from scipy.spatial import distance34a = np.array([5, 6])5b = np.array([1, 3])6d = distance.euclidean(a, b)7print('Distance between a and b is equal to', d)
The calculation results are the same.
Manhattan distance
Manhattan distance or city block distance is the sum of modules of coordinate differences.
Manhattan distance is formulated . The carries subscript 1 to indicate that the vector coordinates are raised to the first power.
In scipy
the function for Manhattan distance calculation is called distance.cityblock()
1import numpy as np2from scipy.spatial import distance34a = np.array([5, 6])5b = np.array([1, 3])6d = distance.cityblock(a, b)7print('Distance between a and b is equal to', d)
To find the minimum index in the numpy
array, call the argmin()
function.
1index = np.array(distances).argmin() # minimum element index
Distances in multidimensional space
In machine learning, vectors are features of observations. Usually the vectors are multidimensional rather than two dimensional.
The Euclidean distance between vectors and is the sum of squares of coordinate differences.
The Manhattan distance is the sum of modules of coordinate differences.
When the number of coordinates is more than two, we still use the familiar functions distance.euclidean()
and distance.cityblock()
to calculate distances in multidimensional space.
Nearest neighbors algorithm
Take a look at the picture below. How can we predict the observation class? We can find the nearest object in the sample and get the answer from it. That's how the nearest neighbors algorithm works. Usually we look for the nearest observation in the training set.
The algorithm works both on the plane and in multidimensional space, in which case the distances are calculated using multidimensional formulas.
Creating model class
Class is a new data type with its own methods and attributes.
To create a new class, specify key word class
followed by the name of the class:
12class ConstantRegression:3 # class content with four-space offset4 # ...
To train the model use the fit()
method.
It's a function inside the class and the first parameter for it is always self
. self
is a variable that stores the model. It is needed for working with the attributes. Two other parameters are features and target of the training set.
1class ConstantRegression:2 def fit(self, features_train, target_train):3 # function content with 4+4-offset4 # ...
In the process of training, we need to save the mean value of the target. To create the new attribute value
, add self
with a dot to the beginning of the variable name. This way, we indicate that the variable is inside the class:
1class ConstantRegression:2 def fit(self, features_train, target_train):3 self.value = target_train.mean()
Let's use the predict()
method to predict the answer, which is the saved mean:
1class ConstantRegression:2 def fit(self, features_train, target_train):3 self.value = target_train.mean()45 def predict(self, new_features):6 answer = pd.Series(self.value, index=new_features.index)7 return answer