Preparing Dataset for ML
Understanding Type of Tasks
Supervised learning
A supervised learning task is when you have a training dataset and a target feature that you need to predict by using the rest of the features.
All variables and features are either categorical or numerical, and the target is no exception.
Classification tasks deal with categorical targets. If we only have two categories, it is a binary classification.
If the target is numerical, then it's a regression task. The data is used to find relationships between variables and make predictions based on the information.
There are other ML type of tasks as well:
- unsupervised learning — no target
- semi-supervised learning — target is known only for a portion of training data
- Recommendation — users and items replace features and observations (something that you can recommend like movies or neighborhoods)
Parts of Dataset
Training dataset
In machine learning, rows and columns represent observations and features respectively. The feature that we need to predict is called the target.
Test dataset
To test if our model makes accurate predictions even when it faces new data, we are going to use the test dataset.
Validation datasets
For quality evaluation to be reliable, we need a validation data set.
The validation dataset is separated from the source dataset before the model is trained. Validation shows how the models act in the field and helps to reveal overfitting.
The portion of the data to be assigned to the validation set depends on the number of observations and features, as well as the data variation. Here are the two most common scenarios:
The test set exists (or will exist in the near future), but is unavailable for the time being. The preferable ratio is 3:1. This means 75% for the training set and 25% for the validation set.
The test set doesn't exist. In that case, the source data has to be split into three parts: training, validation, and test. The sizes of the validation set and the test set are usually equal. This scenario gives us a 3:1:1 ratio.
Splitting data into two sets
Scikit-learn has a special function train_test_split()
for this purpose. It can split any data set in two, and is named so because it is usually used to split sets into training and test sets. We are going to use this function to obtain a training set and a validation set.
1from sklearn.model_selection import train_test_split
Before splitting, we need to set two parameters:
- Name of the dataset that we are going to split.
- Size of validation set (
test_size
). The size is expressed as a decimal from 0 to 1 that represents a fraction of the source dataset.
The train_test_split()
* function returns two sets of data: training and validation.
1df_train, df_valid = train_test_split(df, test_size=0.25, random_state=54321)
Note: we can assign any value to random_state
except for None
.