Data Collection

Data Sources

There are many data sources that you can use to train models.

One source is a company's data warehouse.

Most often, a company cannot provide the data. If the task is common, datasets can be found in open sources:

For some tasks, data can be collected on the Internet. Download all the pages of the required portal and use crawler or scraper software to extract the data.

Data Labeling

Data can be unlabeled but still useable. To get a training set, we need to conduct data labeling or data annotation. This labels the data.

There are dedicated online services for labeling. Here are some popular services:

Labeling Quality Control

Data quality can be improved after labeling by using the methods for labeling quality control. All the observations, or a part thereof, are labeled several times and then the final answer is formed.

Target Leakage

Target leakage occurs when information about the target accidentally leaks into the features.

Cross-Validation

Cross-validation will help to train and test the model using several randomly formed samples.

We split all the data into training set and test set. We hold out the test set until the final evaluation, and randomly split the training set into $K$ equal blocks. The split method itself is called K-Fold, where $K$ is the number of blocks or folds.

\begin{array}{ccccc} \boxed{\text{Training set}}\\\\ \color{red}\boxed{\text{Validation set}}\\\\\\k\text{-fold}\\\\1&\color{red}\boxed{\text{Subset 1}} & \boxed{\text{Subset 2}} &\boxed{\text{Subset 3}} &\boxed{\text{Subset 4}} &\boxed{\text{Subset 5}} \\ \\2& \boxed{\text{Subset 1}} & \color{red}\boxed{\text{Subset 2}} &\boxed{\text{Subset 3}} &\boxed{\text{Subset 4}} &\boxed{\text{Subset 5}} \\ \\ 3&\boxed{\text{Subset 1}} & \boxed{\text{Subset 2}} &\color{red}\boxed{\text{Subset 3}} &\boxed{\text{Subset 4}} &\boxed{\text{Subset 5}} \\\\4& \boxed{\text{Subset 1}} & \boxed{\text{Subset 2}} &\boxed{\text{Subset 3}} &\color{red}\boxed{\text{Subset 4}} &\boxed{\text{Subset 5}} \\\\ 5& \boxed{\text{Subset 1}} & \boxed{\text{Subset 2}} &\boxed{\text{Subset 3}} &\boxed{\text{Subset 4}} &\color{red}\boxed{\text{Subset 5}} \end{array}

We are "crossing" the data, each time taking a new block for validation. And the mean of all scores obtained through cross-validation is our model's evaluation score.

The cross-validation method resembles a bootstrap in that several samples are formed, but the difference is that cross-validation uses blocks with fixed content that doesn't change at each stage of training and validation. Each observation passes through both the training set and validation set.

Cross-validation is useful when we need to compare models, select hyperparameters, or evaluate the usefulness of features. It minimizes the randomness of data splitting and gives a more accurate result. The only drawback of cross-validation is the computing time.

Cross-Validation in Sklearn

To evaluate the model by cross-validation we will use the cross_val_score function from the sklearn.model_selection module.


1from sklearn.model_selection import cross_val_score
2cross_val_score(model, features, target, cv=3)

The function takes several arguments, such as:

model — model for cross-validation. It is trained in the process of cross-validation, so we have to pass it untrained.


1from sklearn.tree import DecisionTreeClassifier
2model = DecisionTreeClassifier()

features
target
cv — number of blocks for cross-validation (by default, it's 3)

The function does not require splitting data into blocks or samples for validation and training. The function returns a list of model evaluation scores from each validation. Each score is equal to model.score() for the validation sample.