Anomaly Detection

Anomalies

Anomalies, or outliers, are observations with abnormal properties. Outliers indicate a problem in the data or that something is out of the ordinary. Some anomalies can be seen or predicted.

Since outliers are usually unpredictable and uncommon, little to no anomalies show up during training.

Boxplot

A boxplot is also called a box-and-whisker plot because of the lines that extend from the boxes like whiskers.

The upper and lower boundaries of the box mark the first and third quartiles (75% and 25% of the values). The median is located in the center (50% of the values). The "whiskers" stretch up and down from the box's borders to a distance of 1.5 interquartile ranges (IQR) and indicate variability outside the lower and upper quartiles. Outliers are shown outside the whisker bounds (the minimum and the maximum).

The IQR is calculated this way:

\text{IQR} = \text{Q}_3 - \text{Q}_1

The formula for the lower inner fence of the box (here, k is the coefficient for interquartile ranges, typically set to 1.5):

\text{L} = \text{Q}_1 - k * \text{IQR}

The formula for the upper outer fence:

\text{R} = \text{Q}_3 + k * \text{IQR}

The higher the coefficient k, the fewer outliers there are.

The diagram displays information about all of the outliers. It is stored in the "fliers" list inside the boxplot **instance. We can see how many outliers there are by calling the get_data() function. The necessary values will be separated by indices.


1boxplot = plt.boxplot(df['column'].values)
2
3# list of outliers
4outliers = list(boxplot["fliers"][0].get_data()[1])
5print("Outliers: ", len(outliers))

Isolation forest

The isolation forest is an ensemble method. Its calculations are based on the averaged estimations of several solution trees. The tree nodes contain the decision rules which assign each observation to a specific branch.

This method relies on the fact that anomalies can be isolated from the rest by a small number of decision rules.

An isolation tree is built as a decision tree, but the decision rules for it are chosen at random. Observations located at a shallow depth can be easily isolated and are considered anomalous, while the rest are considered normal. Estimates on anomalies are collected from all the trees and averaged.

Here is how an isolating forest can be trained in the sklearn library. Import the IsolationForest() class from the sklearn.ensemble module:


1from sklearn.ensemble import IsolationForest

Let's create a model and record the number of trees in the n_estimators parameter. The more trees there are, the more accurate the results will be:


1isolation_forest = IsolationForest(n_estimators=100)

Let's create a model and record the number of trees in the n_estimators parameter. The more trees there are, the more accurate the results will be:


1sales = df['column'].values.reshape(-1, 1)

Selecting anomalies by one feature won't give you an accurate representation of the entire dataset. An isolation forest detects outliers based on several features:


1data = df[['column1', 'column2']]

Further model training is the same for both one-dimensional and multidimensional data. Now let's train the model on sales and profit data with fit():


1isolation_forest.fit(data)

Calling the decision_function() will show us how exactly the model evaluated the observations:


1anomaly_scores = isolation_forest.decision_function(data)

Anomaly estimates vary between -0.5 and 0.5. A lower estimate indicates a higher chance that the observation is an outlier.

To calculate the number of anomalies, let's call the predict() method, which classifies observations and distinguishes the normal ones from the outliers. If the observation class is 1, the observation is normal, but if it's -1, it's an outlier.


1estimator = isolation_forest.predict(data)

If the anomaly estimates aren't necessary, you can train the model and get the classification by using fit_predict():


1estimator = isolation_forest.fit_predict(data)

KNN-based anomaly detection method

There is another way to find anomalies in multidimensional data, the k-nearest neighbors algorithm. The k-nearest neighbors algorithm (KNN) works by considering each observation in the dataset as a vector and searches for anomalies in multidimensional space. The further an observation is from its neighbors, the higher the probability of it being an outlier.

The KNN() class is located in the PyOD (Python toolkit for detecting outlying objects) library. Import it from the pyod.models.knn module:


1from pyod.models.knn import KNN

Let's train the model on the set by calling the fit() method:


1model = KNN()
2model.fit(data)

Let's train the model on the set by calling the fit() method:


1predictions = model.predict(data)

The predict() method will return a list where "1" indicates an anomaly and "0" means the observation is part of the general trend.