Data Collection
Data Sources
There are many data sources that you can use to train models.
One source is a company's data warehouse.
Most often, a company cannot provide the data. If the task is common, datasets can be found in open sources:
- Kaggle Data Science competition platform;
- UC Irvine Machine Learning Repository;
- U.S. Government’s open database;
- FiveThirtyEight: Open data on opinion poll analysis, politics, economics, and more.
For some tasks, data can be collected on the Internet. Download all the pages of the required portal and use crawler or scraper software to extract the data.
Data Labeling
Data can be unlabeled but still useable. To get a training set, we need to conduct data labeling or data annotation. This labels the data.
There are dedicated online services for labeling. Here are some popular services:
Labeling Quality Control
Data quality can be improved after labeling by using the methods for labeling quality control. All the observations, or a part thereof, are labeled several times and then the final answer is formed.
Target Leakage
Target leakage occurs when information about the target accidentally leaks into the features.
Cross-Validation
Cross-validation will help to train and test the model using several randomly formed samples.
We split all the data into training set and test set. We hold out the test set until the final evaluation, and randomly split the training set into equal blocks. The split method itself is called K-Fold, where is the number of blocks or folds.
We are "crossing" the data, each time taking a new block for validation. And the mean of all scores obtained through cross-validation is our model's evaluation score.
The cross-validation method resembles a bootstrap in that several samples are formed, but the difference is that cross-validation uses blocks with fixed content that doesn't change at each stage of training and validation. Each observation passes through both the training set and validation set.
Cross-validation is useful when we need to compare models, select hyperparameters, or evaluate the usefulness of features. It minimizes the randomness of data splitting and gives a more accurate result. The only drawback of cross-validation is the computing time.
Cross-Validation in Sklearn
To evaluate the model by cross-validation we will use the cross_val_score
function from the sklearn.model_selection
module.
1from sklearn.model_selection import cross_val_score2cross_val_score(model, features, target, cv=3)
The function takes several arguments, such as:
model
— model for cross-validation. It is trained in the process of cross-validation, so we have to pass it untrained.
1from sklearn.tree import DecisionTreeClassifier2model = DecisionTreeClassifier()
features
target
cv
— number of blocks for cross-validation (by default, it's 3)
The function does not require splitting data into blocks or samples for validation and training. The function returns a list of model evaluation scores from each validation. Each score is equal to model.score()
for the validation sample.