Data Collection

Glossary

A crawler/scraper: a software that is used to extract data from websites.

Cross-validation: a method of model training and testing when the training set is split into $K$ equal blocks. At each of the $K$ stages, the block $i$ index is used for validation, and the rest for training.

Data labeling/data annotation: the process of determining the target values.

Labeled data: data with known target value.

Target Leakage: a situation when information about the target accidentally leaks into the features.

Unlabeled data: data that lacks the target value.

Practice


1# Cross-validation
2# model — untrained model for cross-validation;
3# cv — number of blocks for cross-validation (by default, it's 3).
4
5from sklearn.model_selection import cross_val_score
6cross_val_score(model, features, target, cv=3)