Data Collection
Glossary
A crawler/scraper: a software that is used to extract data from websites.
Cross-validation: a method of model training and testing when the training set is split into equal blocks. At each of the stages, the block index is used for validation, and the rest for training.
Data labeling/data annotation: the process of determining the target values.
Labeled data: data with known target value.
Target Leakage: a situation when information about the target accidentally leaks into the features.
Unlabeled data: data that lacks the target value.
Practice
1# Cross-validation2# model — untrained model for cross-validation;3# cv — number of blocks for cross-validation (by default, it's 3).45from sklearn.model_selection import cross_val_score6cross_val_score(model, features, target, cv=3)