Chapter Summary: Business Tasks Involving Machine Learning

What Is Learning?

Learning consists in searching for relationships. When the relations are searched for by an algorithm instead of a human, we call it machine learning (ML).

A variable is called a target, if it can be predicted or forecast based on specific features. Forecasts are based on observations: existing examples for which feature and target-variable values are already known.

To make successful forecasts you find interrelations, which are based on empirical data. The ability to find certain interrelations in order to forecast events is called learning.

Introduction to Forecasting and Machine Learning

Let's define the set of target variable values as follows:

\vec{y}

This is not a single number, but a set of several numbers. Such sets are called vectors. $y$ is the vector of the target variable. Each object has an identifier of a unique observation, or index.

For each index $i$ , you have a set of feature values, or feature vector.

\vec{x_i} = (x_{i_1}, x_{i_2},\dots, x_{i_N} )

where $(x_{i_1}, x_{i_2},\dots, x_{i_N} )$ are the values of $N$ parameters for the $i^{\text{th}}$ object.

$M$ objects and $N$ parameters can be presented as a table, or matrix. The rows will contain observations, or objects. The columns will contain features. Let's define an object-feature matrix as $X$ :

X = \begin{pmatrix} x_{1_1} & \cdots & {x_{1_N}}\\ \vdots & \ddots & \vdots \\ x_{M_1} & \cdots & {x_{M_N}} \end{pmatrix}

You use this data to create a function that can faithfully reflect the interrelation of natural phenomena and that can identify the relationship between the feature values and the target variable. We'll call it $\hat{y}$ :

If you apply this function to the feature values of a particular object, you'll get the target variable forecast for this observation. It should be close to the population size observed in the past:

f(\vec{x}) = f(x_1, x_2, ..., x_N) = \hat{y}

f(\vec{x_i}) = f(x_{i_{1}}, x_{i_{2}}, ..., x_{i_{N}}) = \hat{y_i}

\hat{y_i} \sim y_i

When your formula yields a forecast or estimate of the target variable's value that's close to the historical value in the majority of cases, you've found the hidden relation between the values.

\hat{y_{i}} = w_{0} + w_{1} * w_{i_{1}} + ... + w_{N} * w_{i_{N}}

In the above formula the forecasted number of penguins is the sum of the "null" coefficient (the number of penguins when all the features are equal to 0) and the values of the features multiplied by coefficients.

How do we choose $w_0, w_1, ..., w_N$ values, also called function parameters, or weights, so what we deduce works with all our observations? You can always delegate the search for interrelations to a machine (an algorithm or program).

Machine learning is the process of searching by a machine for interrelations between values based on various objects. As the algorithm processes huge arrays of data, it learns from them and builds a world model that reflects hidden relationships between processes or objects. The more observations there are, the better.

Supervised Learning

Supervised learning models are given a large number of observations as input, where each observation has feature values ( $X$ ) and the target variable value ( $y$ ), or label. The task of the model is to uncover the relation between $X$ and $y$ and learn to predict $y$ for new objects that only have the vector of feature values ( $X$ ). Here you are training the model, since you determine what can be identified as features (objects) and what to consider the target variable, or correct answer.

Supervised learning models are broken down into classification and regression models.

Classification is used when we need to get the name of the class of each new object as the model's answer. There may be any number of classes ( $N$ ), but it must be finite and more than 1. When $N=2$ , the classification will be binary.

If $N>2$ , it's a multiclass classification.

With regression, the answer will be a continuous variable.

Unsupervised Learning

You can pass a large number of observations to an algorithm without an answer ( $y$ ) and train it to build interrelations between the objects themselves. This process is called unsupervised learning, and is the second major type of machine learning.

Dividing objects into groups is called clustering. The main difference between clustering and classification is that there are no given classes or "right answers."

With clustering, we don't have given classes; it's the algorithm that forms them.

Unsupervised learning is also applied in dimensionality reduction of feature matrices.

One of the reasons for dimensionality reduction is that algorithms don't function well when there are too many features and not enough observations.

The last major type of ML is reinforcement learning. Here the model learns step by step and slightly alters its operating algorithm at each stage, using hints from the outside environment.

Training a Model in Python: sklearn

A model is a system of interrelations between features and a target variable or between observations that reflects reality to a high degree of accuracy. Models are trained using existing observations and built with various methods, or algorithms. Usually "algorithm" refers to an abstract approach to model training, and when applied to computer languages, we call it algorithm implementation.

Here are the two convenient libraries in Python used for machine learning tasks.

pandas — for data analysis and preprocessing
scikit-learn (sklearn) — for the implementation of machine learning algorithms

The sklearn library has lots of tools for working with data and models, so they're grouped into subsections. The tree module contains the decision tree. DecisionTreeClassifier is a data structure designed for classification with a decision tree.

Let's import the structure from the library:


1from sklearn.tree import DecisionTreeClassifier

Then we'll create an object having this data structure.


1model = DecisionTreeClassifier()

The model variable will store our model. To train it, we need to launch the training algorithm.

The models receive a set of feature values, $X$ , and a target variable, $y$ , as input. To define an object-feature matrix $X$ and a target variable vector $y$ , we'll call the drop() method from the pandas library:


1y = data['target']
2X = data.drop(['target'], axis = 1)

The method receives a list with the names of the columns to be deleted. The axis = 1 parameter indicates that it's a column we want to remove.

Now we can build an interrelation and use it to predict $y$ from the new $X$ . To start training, call the fit() method and pass it the data as a parameter:


1model.fit(X, y)

To train the model, you pass it the matrix with features ( $X$ ) and the vector with the target variable values ( $y$ ).

To make predictions for a set of data, you just need to call the predict() method:


1predictions = model.predict(X)

Train, Validation, and Test Data

Before you pass a model real input data, you need to make sure it works well. You can train the model (using train data) and then compare its predictions with the real target values starting from the following month. Data given to the model during training in order to fine-tune it is called validation data. Finally, we try out the model on test or hold-out data.

If we take 150,000 observations and divide them into two unequal parts, 100,000 and 50,000. We'll pass the first part to the model and train the algorithm on them. Then we'll use the model to predict the answers for the second part of the data and compare the results with the real target values. This will allow us to fine-tune the model.

The 100,000 observations used to train the model are train data. The 50,000 used to test the final fit are validation data.

Underfitting and Overfitting

When we train a model and "fit" weights using train data, we measure at every step how close the answer with the chosen weights is to the real state of things. Thus, we estimate the training error even during the training stage.

It's almost impossible to choose the appropriate weights for a linear model or any other algorithm so that it gives a correct answer for each observation. The model will be wrong for some observations. This is called training error.

Our goal is to minimize it. If we don't, and the algorithm we end up choosing is wrong for half of all observations, this means our model is underfitted. The error that occurs in such cases is called bias. A biased function fails to account for all the relations within a given set of data. The most common reasons for bias are:

The number of samples or features is too small
The function is too simple
A faulty approach to selecting options for the target relation

A model with bias will yield poor results with both train and test data.

Ideally a model (function, algorithm) not only rarely makes errors when being trained, but also performs well with new data it didn't "see" when we were "fitting" the weights and searching for the best possible relation. In other words, the model must have a high generalization capacity. Then, using machine learning will be really helpful.

When the model demonstrates far worse results on validation data than it did during training, it's called overfitting. Such errors are called variance error. This implies that the model is trying too hard to fit the data and doesn't ignore the noise: when it was trained, it took into account excessive information in addition to the actual relations within the data.

Divide and Validate

The terms "validation data" and "test data" are sometimes used interchangeably, and there's still no perfect consensus regarding their definitions. And we encounter the same kind of thing in Python libraries: the function that splits data is called train_test_split(), even though we can use it to get validation data. In each case, if you consider the context you should be able to figure out what's meant.

No matter what you call the portions of data, the basic questions remain the same:

In what proportions do we split the data?
How do we divide it?

A quick way is to call the train_test_split() function from the sklearn library's model_selection module. The syntax is as follows:


1X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Here the input data for the function is the feature matrix $X$ , the target variable vector $y$ , and the test_size parameter, which controls how much of the data will be split off.

The function returns two feature matrices and two target variable vectors, obtained by dividing the original data according to the proportion defined in test_size.

By default, train_test_split() randomly divides the data in the proportion that you set. If you run the code several times, you'll get different matrices and vectors.

First give the random_state parameter a value of zero when dividing the data:


1X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

The random_state parameter is present in other functions, too, and is similarly responsible for "randomness". For example, it can be used when defining the model's algorithm:


1model = RandomForestRegressor(random_state=0)

Assessing the quality of a model

After you've split the data the right way and in the right proportions, train it with the train data and check its performance with the validation data.

Let's look at syntax based on the determination coefficient, or R-squared. This metric takes values between 0 and 1 and represents how accurately the model's forecast represents the target variable. If the forecasts are perfect, the R-squared value will be 1; if they're horrible, it'll be 0. The formula of sample determination coefficient looks as follows:

R^2 = 1 - \frac{\sum_{i=1}^{n}{(y_i-{\hat {y}}_{i})^{2}}}{\sum_{i=1}^{n}{(y_i-{{\overline {y}}_i)^{2}}}}

The numerator is the sum of the squared error values, so if your forecasts are usually accurate, the fraction will equal 0, and the metric will be 1.

The denominator contains the sum of the differences between values and the mean. It normalizes your error for the real target variable variance. If the error is large and the value itself varies a lot, this will partially make up for the difference between forecast and fact.

The R-squared (r2_score) syntax is simple: pass the real target variable vector and the forecast one as parameters:


1metrics.r2_score(y_train, y_pred)

The function returns one value: the coefficient of determination.

The Machine Learning Pipeline

Here are the basic stages of machine learning:

The sequence may differ depending on the task and algorithm. However, there are some general principles that are worth paying attention to.

Defining the task

At this stage, you determine your model's destiny. You translate a specific business task into math, analytical tools, and machine learning. Make sure you thoroughly understand the problem the business is facing, since that determines the choice of model, algorithm, and metrics.

EDA

Before you pass the input data to your model to get forecasts, you need to carry out exploratory data analysis. Sometimes it's useful to plot histograms of features. Use the distplot() (distribution plot) function from the seaborn library:


1import seaborn as sns
2
3sns.distplot(df['feature_1'])

Exploratory data analysis allows you to formulate initial hypotheses about the quality of data and the presence of anomalies.

Data preprocessing

Data is often preprocessed, meaning:

We get rid of missing values.
Some features are transformed.
Data is normalized or standardized.
New features are created on the basis of existing ones. This process is also called feature engineering.

Choosing a validation strategy

Now you should choose how you'll create the validation set based on the type of data and the task.

It's also important at this stage to make sure that the distribution of values in the train data is close to the distribution the model is going to deal with.

Choosing an algorithm

You have a toolbox of algorithms, that you can choose from. Some algorithms are more accurate but harder to interpret, others are faster but weaker. Here are the basic criteria for choosing an algorithm:

Accuracy
Speed
Interpretability
Individual algorithms' characteristics

Even the simplest algorithm has a number of parameters that can be adjusted. Usually you choose parameters in iterations: you train a model that has certain parameters, estimate the metrics, see that the results are poor and change the parameters, and do the whole procedure again.

Choosing Metrics

Before you move on to training your algorithm, determine how you'll evaluate its performance.

There's a standard set of metrics for each type of task , but it's important to not simply run the results through this set, but to understand which metric best reflects the essence of a business process.

At this stage, it's worth finding out what methods your company already uses for these tasks. That way you'll be able to compare the efficacy of machine learning with an established baseline.

Training and forecasting

At the fit stage, you pass it the train set so it can identify relations between features. Then we move on to predicting. Here you take the features alone, give them to the trained model as input, and save the predicted values.

Evaluating the quality of results and choosing the best model

At this stage, you look at the difference between the predicted and real values for the objects from the validation set. Often, analysts evaluate several algorithms and choose the one with the best metrics.

Analyzing the importance of features

You need to confirm once more that the model reflects the right patterns and interrelations within data. By analyzing the importance of features. This lets you evaluate not only the predictions themselves, but also the reasons the model made them.

What's next?

After going through the whole pipeline once, you'll probably need to return to earlier stages, make changes, and see what the effects are.

Why Isn't Machine Learning Universal?

Now it's time to learn when machine learning shouldn't be used. Before passing the data to a machine learning model, answer these key questions:

Is the sample big enough?
Is the data high-quality?
Is your model capable of making realistic forecasts?

Samples

There's no definitive answer as to what sample can be considered big enough. Experts are often guided by empirical rules when deciding on the size of the sample. They take into account the number of features, the variety of target variable values, and the details of the algorithms themselves.

According to the first rule, the minimum required number of observations in a sample is linearly related to the number of features. The minimum size of a sample can be found with the formula $s = k * n$ , where $n$ is the number of features for each observation and $k$ is a constant, which, experience shows, is often equal to 10.

Another empirical rule is the number of target clusters. The more clusters you have, the harder it is to tell them apart on the basis of available features. If you calculate the minimum required number of observations using the previous rule and increase the number of target classes by a factor of $n$ , you'll have to increase the resulting value to the same degree.

The last rule takes as its starting point the family to which the algorithm belongs.

In short: if the number of customers or observations is not measured in the thousands, you don't really need machine learning.

The quality of data

In machine learning, there is a rule called GIGO: garbage in, garbage out. If the model's input data is low-quality, you'll get bad results even if you choose the right algorithm. You might encounter the following problems in data:

Noise
Missing values
Errors and outliers
Changes in data distribution over time

Try to find out whether your dataset has these kinds of problems in the EDA stage. You can solve some of them by means of preprocessing.

But there may be cases when you can't affect the quality of data.

Variability in data is another problem you might encounter. The distributions of features in train, validation, and test data must be similar, or else the model will be useless.

Low-quality models

It may seem that you have a lot of high-quality data, but the metrics might tell you that you can't predict the chosen target variable using the data you have. Why? Because your features aren't related to your target variable, so you can't forecast the value.

Data on the number of visitors to US restaurant chains won't help you forecast the African penguin population, no matter how accurate the data is. You'll probably get relative error near 100% and an R-squared value close to 0.

Or maybe the data will be too noisy and volatile. It's almost impossible to find a "good signal" in such a set. In cases like these, machine learning won't do; you'll need to use other methods of mathematical modeling.