Chapter Summary: Solving Tasks Related to Machine Learning

Task Statements

Formulating a business task

When you're formulating a task, the main questions to be answered are:

What are we forecasting and what business task does this solve?
What data do we have?
Who is going to use our model and how (how is it integrated into business processes)?
What results do you hope to achieve?
- How has the problem been solved before now? Was that method successful?
- What business effect could your model have?
- Do you have enough resources (time, people, finance, computing resources)?
- What metrics will you use to evaluate your model's performance?
- Are there any successful benchmarks (existing examples in the market) for the application of machine learning for similar tasks?

Defining the task and translating it into the language of machine learning

If you're able to give clear answers to all these questions and confirm that you really need to use machine learning, then translate the business statement into the ML language. This involves:

Identifying the type of the task (supervised/unsupervised learning, classification/regression/clustering)
Features:
- What data to select as features?
- What's the sample size?
- What's the time period?
- How good is this data?
- Does the data have a time structure?
Metrics you'll use to optimize and evaluate your model
Whether the model has limitations:
- Speed
- Result interpretability
- Accuracy
- Time needed for development
The algorithms you're going to use (informed by all the previous points)

EDA: Analyzing the Quality of Features

First, you define the task and the data you're going to use, then obtain the data itself (perhaps as a CSV file). Next, you study it.

The goals of exploratory data analysis are:

To evaluate the quality of the data and the volume of preprocessing required
To examine the distributions and mutual correlations and detect anomalies, if they're present
To formulate initial hypotheses regarding the features or the target variable

Here are some useful questions that you need to answer when evaluating the quality of data:

How large is the dataset?
What features does it include? If they're anonymized, it's better to study what each feature means in order to get deeper into the business process.
What types of features you have? Usually, either numerical or categorical.
Does the target variable have a time structure? This determines which methods for splitting data into train and validation sets you'll be able to apply in future stages and which derivative features you'll be able to use.
How many missing values are there? At this stage you can decide how to process missing values, either to remove them from the dataset or fill them in from the "past" or "future."

EDA: Formulating Hypotheses

Once you've gotten a rough sense of your features, you can now take a closer look at them.

Study the distribution of numeric features. Plot histograms for them.
If your data has a time dimension, plot the distribution graphs of values over time.
Use the distribution of features and target variables to determine if outliers are present.

Calculate a correlation matrix and plot a heatmap based on it.

You can calculate a correlation matrix with just one line of code using the DataFrame method corr():


1cm = df.corr()

Now the variable cm stores the correlation matrix. To present it visually, use heatmap() from the seaborn library:


1sns.heatmap(cm, annot = True, square=True)

At this stage you can already see:

The features that have the strongest correlation with the target variable.
The features that strongly correlate with each other.

How paired graphs look.

feature-feature

feature-target variable

Paired graphs can be rendered with the seaborn library's scatterplot():


1sns.scatterplot(df['Feature 1'], df['Feature 2'])

Once you've spent some time meditating on your graphs, you'll be able to formulate preliminary hypotheses:

What features might be the most valuable for the model, based on correlations?
What other useful features can we generate on the basis of existing ones?
In general, how useful can a model with data of this quality be?

Data Preprocessing

Data preprocessing generally consists of the following stages:

Processing missing values
Processing outliers
Converting categorical variables
Standardizing data (for linear models or models that are sensitive to distance, such as clustering)

Sometimes the selection of features is also considered part of preprocessing, but this can also be done at later stages in the development of a model.

Processing missing values

There are several ways to deal with missing values:

Simply removing observations with missing values.
Replacing them with neighboring values from the past.
Replacing them with mean values.
Replacing them with null values.
In some cases we can replace missing values with the indicator "unspecified."

The general approach to missing values goes like this: try to process them carefully, and make sure the filled-in values don't result in anomalies when you're training the model. If they do, you'll probably have to get rid of features for which there are a lot of missing values.

Processing outliers

The majority of algorithms are resistant to outliers, but some aren't. Look at how just one point can shift the relation proposed by the model for this point cloud from version 1 to version 2:

The standard procedure for working with outliers is as follows:

Define the threshold beyond which an observation is considered an outlier.
Remove such observations from the sample or replace them with the mean or maximum/minimum value for this feature.

Converting categorical variables

These are the two most common methods for transforming categorical variables:

Converting them into numeric values without any significant transformation. There are two different conceptual approaches:

Replacing categories with numeric values (label encoding)

For example, converting the string values "Moscow," "Berlin," and "Paris" to the numeric 0, 1, and 2. Here we'll need the LabelEncoder() class from the sklearn.preprocessing module:


1from sklearn.preprocessing import LabelEncoder
2
3print(df['column'].head())
4
5encoder =  LabelEncoder() # creating a variable of the LabelEncoder class
6df['column'] = encoder.fit_transform(df['column']) # using the encoder to tranform strings into numbers
7
8print(df['column'].head())

The feature "City" is now numeric instead of categorical. But it has one significant disadvantage: it creates relations of the "greater"/"smaller" type for the categories.

Transforming a categorical field into a set of binary ones (one-hot encoding)
Here, instead of the "City" field, we'll have the fields "Moscow," "Berlin," "Paris," and others, and these new features will take the values 0 or 1. To make this happen, we'll need the pandas.get_dummies() function. It takes the whole DataFrame as input, identifies categorical variables, transforms them into new features (with convenient names) called dummy variables, and returns an updated DataFrame:
```
1print(df.head())
2df = pd.get_dummies(df)
3print(df.head())
```

Binary substitution of categorical features are almost always a good idea. But if there are too many of them or if one of the features has a lot of unique values (or both), transformation will make the matrix (DataFrame) too big. So here we can apply label encoding or simply remove particular features.

Creating new features based on existing ones.

4. Standardizing data

Standardization is absolutely mandatory in two areas of ML:

Linear regression
Clustering and methods based on mutual distance between objects

Standardizing a certain feature involves the following steps:

Calculating the feature's mean in the sample and decreasing each observation by this value (to center the values around 0)
Dividing each resulting value by the standard deviation (to scale the dispersion to a value of 1)

Thus, within the model development pipeline, standardization goes like this:

Split the data into train and validation sets.


1X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In this example we have a classic 80/20 split.

Train and apply the "standardizer" on the train set.


1scaler = StandardScaler()
2X_train_st = scaler.fit_transform(X_train)

Here the standardizer remembers the mean and standard deviation of your train set and, taking this knowledge into account, applies standardization to it.

Apply the standardizer to the validation data.


1X_test_st = scaler.transform(X_test)

Here you standardize the test data and store the matrix (not the DataFrame) with the transformed feature values in the variable X_test_st.

This may seem illogical, and after such a transformation, the distribution of the altered validation data might still not match the normal distribution.

There are two reasons for that. First, when testing a model in real life, you don't know the future distribution of your feature values. Second, if the distributions in the validation set (and thus the standard deviation and the mean) are too different from what was in the train set, your model just won't perform very well.

Random and Time Split

Your goal is not just to build a model, but to choose the best model from among several. This can be done by comparing the metrics you get from different models for the validation data. Here we must answer three questions:

How will you split the data to evaluate the model's performance with validation data?
What models are you going to choose from?
Which metrics will you use to judge models?

There are two approaches to splitting:

Randomly
Based on time

You can use random splitting when you don't need to predict a time series.

Take a time-based approach when you're predicting the target variable value for consecutive observations. The features will be the current values of variables plus other features that are based on time, including ones directly related to the target variable:

Values from the past (lags).
The values of aggregate functions (sum, mean, standard deviation, median) for the features and target variable over sliding time windows.

Each observation (point in time) is linked to information from the past, so here random splitting wouldn't fit. In this case it would be wrong to use a random split:

Better to use a time split:

When you're training using data for a time series and using validation to model a true hold-out set from the "future" for an algorithm "from the past," you should use a time (or time-based) split. This involves making a validation set from observations from the very end of the the available historical period, rather than from randomly selected points in time.

Selecting Metrics

To evaluate your models and select the best one, you need to define the metrics that describe the essence of your task.

There are two important things we haven't discussed yet:

Having selected the priority metric, you can tune the training process so that it optimizes that metric.


1model = RandomForestRegressor(criterion='mae')

This will make your algorithm less sensitive to cases when the model makes large errors for particular observations; that is, you'll optimize the training by minimizing the error's absolute value. But if you define the model like this:


1model = RandomForestRegressor(criterion='mse')

then the model will suffer higher "fines" for errors, since here we square the difference between the real and predicted values, rather than simply taking the absolute value.

In this case, these are the only two possible values of the criterion parameter. Other algorithms have more options.

This is why it's important to read the description of the algorithm in the documentation and find out how to optimize it.

For certain business tasks, you'll have to create your own metrics based on existing ones to evaluate a model. One simple and common example is "share of false responses." This is the ratio of the number of FP (false positive answers) to the size of your sample. If each response costs you something, then you want your model to give you the smallest possible number of false responses. Like precision_score, this metric characterizes the precision of responses, and is calculated using that score:
```
1false_positive_rate = 1 - precision_score(y_true, y_pred)
```

The Importance of Features

Here the focus shifts to the interpretability of the model.

We need at least a basic understanding not only of "what" the model said, but also of "how" and "why" it did so. This is crucial for getting to the core of the process being modeled and for evaluating the model's performance in general. So, we need to pay special attention to an analysis of feature importance.

Things are simplest with linear models. Since these features are standardized, the coefficient value for each of them tells you its importance.

y = 132 + 37*x_1+105*x_2+0.3*x_3-79*x_4

What happens if we change the value of the third feature, x₃, from 0 to 1? The result won't change considerably, this means the third feature doesn't have a strong impact on the model's prediction and it's importance isn't that great. The second feature is different. Changing it even slightly (e.g. from 0 to 1) will significantly affect the result. This feature, is very important for the model. The coefficient of the fourth feature is -79. Since it's negative, if we increase the value of the feature, the predicted value will decrease, and by a fair amount, since the absolute value of the coefficient is high. The fourth feature is thus also important.

The coefficients of linear regression are stored in the .coef_ attribute of the trained model:


1feature_weights = model.coef_

Here we stored the coefficients of all the features in the variable feature_weights.

To print the null coefficient (the value of the prediction when the values of all features are 0), use the .intercept_ attribute:


1weight_0 = model.intercept_

If there's one tree, we generally use the parameter that most strongly predetermines the predictions in the subsequent nodes. Feature importance is defined by rank rather than coefficients.

An important nuance: the closer a feature is to the base of the tree (the top, in the image below), the greater its importance and impact on predicted results.

Feature importance for trees (for both regression and classification algorithms) is stored in the .feature_importances_ attribute of the trained model:


1importances = model.feature_importances_

Determining feature importance isn't the most obvious task even when we have a single tree. When we have ensembles of models (gradient boosting and random forest), it's an entire field of study, for which special algorithms are developed. But in real life you probably won't have to get that deep into the theoretical aspects of computing importance. The random forest and gradient boosting implementations in sklearn also store feature importance in the .feature_importances_ attribute.