Cluster Analysis

Task of cluster analysis

Cluster analysis (clustering) is the task of combining similar observations into groups, or clusters.

Most clustering methods involve determining the similarity or difference of observations, based on the distance between them. The further the observations are from each other, the less similar they are, and vice versa. Clustering should not be confused with classification, where classes and observations are known in advance.

Clustering is used in business for various situations:

It can be used for segmenting users or products. This can be closely connected to recommendation systems.
It also can detect fraud by spotting non-typical behavior (for example, generating fake clicks or likes in social media).

K-means clustering

The most popular clustering algorithm is the k-means method. The key concept of this algorithm is a centroid or the center of a cluster. The degree of closeness to a particular center depends on which cluster the observation is located in. Each cluster has its own centroid, which is calculated as an arithmetic mean of the clustered observations.

K-means segments the observations step by step, so it is an iterative algorithm. Let's analyze how it works for a given number of k-clusters:

The algorithm randomly assigns a cluster number to each observation, from 1 to $k$ .
The algorithm randomly assigns a cluster number to each observation, from 1 to $k$ .
- The centroid of each cluster is calculated.
- Each observation is assigned the number of a new cluster whose centroid is closest to the observation.

Another condition for stopping the algorithm's work is the parameter for the maximum number of iterations (max_iter).

The average value of $x_j$ vectors is the product of the sum of all vectors and $\frac{1}{N}$ , where $N$ is the number of vectors. The centroid value ( $\mu$ ) is calculated the same way:

\begin{aligned}\mu &= \frac{1}{N}\sum^N_{j=1}x^j\\&=\frac{1}{N}\Bigg(\sum^N_{j=1}x^j_1,\sum^N_{j=1}x^j_2,\dots,\sum^N_{j=1}x^j_n\Bigg)\end{aligned}

Here, the first coordinate of the resulting vector is the sum of the first coordinates of all vectors, the second coordinate is the sum of the second coordinates, and further on to $n$ (vector dimensions).

To solve the clustering task we're going to use features approximate of the observation in each segment. We need them to tell the algorithm where to look for clusters. We will pass the values of these features as the input to k-means in order to set initial centroids (an optional parameter).

If you pass initial centroids to the k-means algorithm, Python will return a RuntimeWarning. To avoid it, block this output with filterwarnings(). We will deal with the warning later, but for now let's add these lines to the code:


1import warnings
2warnings.filterwarnings("ignore", category=RuntimeWarning)

Train the k-means algorithm:


1from sklearn.cluster import KMeans
2import warnings
3warnings.filterwarnings("ignore", category=RuntimeWarning)
4
5# n_clusters - number of clusters
6model = KMeans(n_clusters=3, random_state=12345)
7model.fit(data)
8
9# Obtaining cluster centroids
10print("Cluster centroids:")
11print(model.cluster_centers_)
12
13# init - initial centroids
14centers = np.array([[20, 80, 8], [50, 20, 5], [20, 30, 10]])
15model = KMeans(n_clusters=3, init=centers, random_state=12345)
16model.fit(data)
17
18print("Cluster centroids of the model with initial centroids:")
19print(model.cluster_centers_)

Objective function

In supervised learning tasks, the loss function shows the difference between the prediction and the correct answer. When the correct answers are unknown, an objective function is needed to assess the quality of the model in unsupervised learning.

In supervised learning, you have to find the model parameters with the smallest value of the loss function in the training set. The loss function is a special case of an objective function. We need to find its minimum value with the training set available.

Using the k-means algorithm, say we need to distribute observations in such a way that the distance between them inside one cluster (intra-cluster distance) is minimal. The objective function of the algorithm is defined as the sum of intra-cluster distances. The goal of the training is to minimize the resulting sum.

Let's calculate the objective function of the k-means method. First, however, we will replace the centroid for $N$ observations with the centroid of the cluster in this formula:

\begin{aligned}\mu &= \frac{1}{N}\sum^N_{j=1}x^j\\&=\frac{1}{N}\Bigg(\sum^N_{j=1}x^j_1,\sum^N_{j=1}x^j_2,\dots,\sum^N_{j=1}x^j_n\Bigg)\end{aligned}

That is, the centroid of each cluster ( $\mu_k$ ) is calculated as the average for all observations in the cluster:

\begin{aligned}\mu_k &= \frac{1}{|C_k|}\sum_{x^j\in C_k}x^j\\&=\frac{1}{|C_k|}\Bigg(\sum_{x^j\in C_k}x^j_1,\sum_{x^j\in C_k}x^j_2,\dots,\sum_{x^j\in C_k}x^j_n\Bigg)\end{aligned}

$|C_k|$ is the number of points in the cluster.
$x$ defines observations, and in this case, it's vectors of size $n$ .
$\in$ signifies that the observation belongs to the cluster.

In order to find the nearest centroid for each observation and assign a cluster number to it, the Euclidean distance $d_2$ from the observation to the centroid of each cluster is calculated ( $d_2$ equals the root of the sum of the squared differences of the coordinates). Then the smallest distance is chosen.

The formula is written as follows:

\min_k d_2(\mu_k,x)=\min_k \sqrt{\sum^n_{i=1} (\mu^k_i - x_i)^2}

The value of $k$ at which the minimum is reached will be the number of the nearest cluster.

When the observation is assigned the number of the nearest cluster, $C_k$ , you can proceed to the calculation of intra-cluster distances, calculated as a sum of squared distances from each observation in the $C_k$ cluster to the $\mu_k$ centroid:

D_k=\sum_{x^j\in C_k} (d_2(\mu_k, x^j))^2

At the beginning of the algorithm, cluster numbers are assigned to observations randomly, and there is significant intra-cluster deviation. At the end of the algorithm, all observations with the same cluster number are combined, and the boundaries of the clusters become visible.

The final objective function of the algorithm is calculated as the sum of intra-cluster deviations:

\begin{aligned}\text{SD}&=\sum_k D_k \\&= \sum_k \sum_{x^j\in C_k} (d_2(\mu_k, x^j))^2\end{aligned}

The value of the loss function is stored in the inertia_ attribute.

Local Minimum

Let's take a closer look at the RuntimeWarning. If you remove the lock, the model training is going to start with this:


1from sklearn.cluster import KMeans
2
3model = KMeans(n_clusters=3, init=centers, random_state=12345)
4model.fit(data)
5
6print("The objective function of the model with initial centroids:")
7print(model.inertia_)


1The objective function of the model with initial centroids:
274253.20363562103
3/usr/local/lib/python3.6/site-packages/sklearn/cluster/k_means_.py:969:
4RuntimeWarning: Explicit initial center position passed: performing only one init
5in k-means instead of n_init=10
6return_n_iter=True)

The warning output lock filterwarnings() hides the message that the number of algorithm launches (n_init) with initial centroids is now equal to one.

Each launch starts with the algorithm randomly assigning cluster numbers to observations, so we get the initial set of observations of a certain cluster. At the next launch, the same observations are assigned new numbers, and the value of the objective function changes as a result of the algorithm run.

At each launch of the algorithm, the objective function is obtained as a minimum for a specific initial set of observations in the cluster. This minimum is called local. When starting on another initial set, the function value may change. This will be the new local minimum.

The parameter n_init is set to 10 by default. The algorithm is launched 10 times with different initial clusters. The lowest out of all the local minima is selected.

Therefore, the objective loss function with initial centroids is smaller when the algorithm is launched only once. With a 10-fold launch, there's a higher chance of finding more suitable initial centroids.

Let's analyze one more parameter with max_iter. The more iterations there are, the closer we get to the local minimum.

Here are the results of two launches of the algorithm on synthetic data with the parameter max_iter set to 1. The data is generated as two groups of points. The initial centroids are marked on the graphs, influencing the distribution of points in the clusters.

On the left, the initial centroids ended up roughly in the cluster centers. On the right, the initial centroids have moved away from the clusters, and the clusters have shifted as well. The objective function is larger than the one on the left.

As the number of iterations increases to four, the objective function values become the same:

The default number of iterations is 300, which is enough for most tasks.

Data visualization

In order to visualize the data about the service's users, let's draw a chart using the pairplot method from the seaborn library. The feature distribution is on diagonals: purchase, timespent, and months. There are also scatter plots for all pairs of features. The type of the diagonal graph is determined by the parameter diag_kind:


1import seaborn as sns
2sns.pairplot(data, diag_kind='hist')

The points grouped together on the graphs are the future clusters.

Four groups of points are clearly distinct on the projections of the feature pair purchase (average check) and timespent (average session time).
The other projections have two or three groups of points.
The cluster of observations with high purchase values is noticeably separated from the other points.

Let's add a cluster fill to the graph of a model trained without initial centroids. Fill rules are set by the hue parameter. It accepts an array of string variables as the input, so let's transfer the numbers of clusters to the array of strings. Then let's name the centroids to add them to the graph, and then we'll merge all the data.


1import pandas as pd
2from sklearn.cluster import KMeans
3import seaborn as sns
4
5centroids = pd.DataFrame(model.cluster_centers_, columns=data.columns)
6# Add a column with the cluster number
7data['label'] = model.labels_.astype(str)
8centroids['label'] = ['0 centroid', '1 centroid', '2 centroid']
9# An index reset will be needed later
10data_all = pd.concat([data, centroids], ignore_index=True)
11
12# Plot the graph
13sns.pairplot(data_all, hue='label', diag_kind='hist')

Our observations were confirmed! A group of points with high purchase values ended up in a separate cluster. We have identified a premium segment.

The basic and advanced segments are more complicated. We'll need to add an additional layer of initial centroids to the chart to determine the clusters suitable for them. To do this, save the instance corresponding to the graph, PairGrid:


1pairgrid = sns.pairplot(data_all, hue='label', diag_kind='hist')

Pass the additional values needed for plotting the graph using the data attribute:


1pairgrid.data = pd.DataFrame(
2  [
3    [20, 80, 8],
4    [50, 20, 5],
5    [20, 30, 10]
6  ],
7  columns=data.drop(columns=['label']).columns)

Let's call one more method: map_offdiag. It constructs data from pairgrid.data on projections outside of the diagonals. The func parameter defines the graph type, s sets its size, marker determines the shape of the points, and color sets their color:


1pairgrid.map_offdiag(func=sns.scatterplot, s=200, marker='*', color='red')

Optimal number of clusters

The objective function of the k-means method decreases as the number of clusters increases. If each observation has its own cluster, the intra-cluster distance is equal to zero, so we minimize the objective function.

There are only three features in our task, which is why we've got six paired projections of the pairplot graphs that show us the data in general. There can be tens of hundreds of attributes in other tasks, which makes it difficult to cover them with the pairplot graph. Also, our initial data had a structure (i.e. you could already see groups of points). As a result of the algorithm's work, they became clusters.

But the data is not always distributed so clearly, the elbow method can help us here. To plot a graph using this method, you'll need to make a list of the target function values of 1 to 10 clusters (less than 20). Train the model several times and save the objective function values for each model in the distortion list:


1distortion = []
2K = range(1, 8)
3for k in K:
4    model = KMeans(n_clusters=k, random_state=12345)
5    model.fit(data)
6    distortion.append(model.inertia_)

Let's display the resulting list on the graph:


1import matplotlib.pyplot as plt
2from sklearn.cluster import KMeans
3
4plt.figure(figsize=(12, 8))
5plt.plot(K, distortion, 'bx-')
6plt.xlabel('Number of clusters')
7plt.ylabel('Objective function value')
8plt.show()

The value of the objective function decreases sharply at first, and then eases into a plateau. This moment of transition signifies the optimal number of clusters.

In this graph, the plateau starts after the fourth cluster, although the second and third clusters are acceptable as well. After these, the decline of the objective function is not significant.

Let's move on to our task and calculate the values of the objective function for models trained on different numbers of clusters.

Result interpretation

By working with effectively segmented data, you've learned how the number of clusters is connected to the objective function. Now it's time to figure out how k-means works on different types of data.

Below are synthetic data and clustering algorithms from the sklearn documentation.

K-means is in the leftmost column. This method works best on well-defined data groups of similar size, like those in the fifth row. K-means finds clusters even where there aren't any:

The third row shows how some of the objects from the large cluster move to a smaller segment nearby. This is called cluster splitting. The rest of the rows show that the method poorly separates close-lying elongated groups.

The sklearn library takes all the features of the k-means algorithm into account. For example, if it fits the data with large dimensions poorly, you can alter the max_iter parameter. You can determine the number of algorithm iterations by calling the n_iter_ attribute.

Structuring data

Clustering can also be used in tasks with data labeling. It will allow you to see the structure in the data and understand which features are more important.

In this dataset for classifier training, you know the target feature used to distribute observations into clusters. The distribution of observations by groups looks like this:

You can see some insights in the graph. To analyze it further, we can train the k-means algorithm without the column with countries of manufacture.