Takeaway Sheet: Business Tasks Involving Machine Learning

Glossary

Classification is an instance of supervised learning where the target variable takes a value from a limited set of possible values.

Clustering is an instance of unsupervised learning where objects are split into groups based on interrelations between them.

Dimensionality reduction is an instance of unsupervised learning where the feature vectors change as the number of features decreases.

Machine learning is the process of a model's searching for interrelations between values based on various objects.

Model is a system of interrelations between features and a target variable, or between observations that faithfully reflects the reality.

Observations are existing examples for which feature and target-variable values are already known.

Regression is an instance of supervised learning where the target variable is a continuous value.

Supervised learning consists in establishing interrelations between features and target variables on the basis of already existing labeled data.

Target variable is a variable that can be predicted or forecast based on specific features.

Test data is used to try out the model after training.

Train data is the data on which the model is trained.

Unsupervised learning consists in establishing interrelations between objects with target variables being unknown.

Validation data is used to fine-tune the model during training.

Practice


1# import the classifier class
2from sklearn.tree import DecisionTreeClassifier
3# create a classifier object
4model = DecisionTreeClassifier()
5# train the model
6model.fit(X, y)
7# get predictions
8predictions = model.predict(X)


1# divide the DataFrame into a target variable and training features
2
3y = data['target']
4X = data.drop(['target'], axis = 1)


1# Divide the set into train and validation data
2# test_size - the share of the dataset to be split off for validation
3# random_state - the parameter for reproducing the result
4
5X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)