Takeaway Sheet: Machine Learning Algorithms

Glossary

Global minimum the function's smallest value.

Gradient a vector consisting of local derivatives.

Gradient descent a method for finding a local minimum.

Hyperparameters the parameters whose values are set before training begins; they do not change. A model can have several hyperparameters.

Local minimum the function's smallest value within a given range.

Multicollinearity is a strong correlation among several features.

Normalization consists in converting the feature values to the range from 0 to 1.

Regularization refers to any additional limitation on a model, or to an action that will decrease the model's complexity and minimize the impact of overfitting. This information can be though of as a fine for the model's complexity.

Standardization consists in converting the feature values so that they are random values from the standard normal distribution (with a mathematical expectation of 0 and a variance of 1).

Practice


1# standardizing features
2from sklearn.preprocessing import StandardScaler
3
4scaler = StandardScaler() # creating a scaler class object
5scaler.fit(X) # training standardizer
6X_sc = scaler.transform(X) # transforming the data set


1# calculating the correlation matrix
2cm = df.corr()


1# initializing linear regression models
2from sklearn import linear_model
3linear_regression = linear_model.LinearRegression()
4# a linear regression model with built-in L1 weight regularization
5lasso = linear_model.Lasso()
6# a linear regression model with built-in L2 weight regularization
7ridge = linear_model.Ridge()
8
9# training a model
10linear_regression.fit(X_train, y_train)
11# getting predictions
12y_pred = linear_regression.predict(X_val)
13# printing the linear model's coefficients
14print(model.coef_)
15# printing the regression's null coefficient
16print(model.intercept_)


1# Calculating the MAE value for a trained model
2# MSE (mean_squared_error) and R2 (r2_score) are found similarly
3from sklearn.metrics import mean_absolute_error
4
5print('MAE:', mean_absolute_error(y_true, y_pred))
6```---
7
8```python
9# initializing a logistic regression model (algorithm for binary classification)
10from sklearn.linear_model import LogisticRegression
11
12model = LogisticRegression()
13
14# obtaining the probability that an object enters a certain class
15y_probas = model.predict_proba(X_test)


1# building a confusion matrix for classification
2from sklearn.metrics import confusion_matrix
3
4cm = confusion_matrix(y_true,y_pred)
5tn, fp, fn, tp = cm.ravel() # "evening out" the matrix to get the intended values


1# calculating classification metrics
2from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
3acc = accuracy_score(y_true, y_pred)
4precision = precision_score(y_true, y_pred)
5recall = recall_score(y_true, y_pred)
6f1 = f1_score(y_true, y_pred)
7roc_auc = roc_auc_score(y_true, probabilities[:,1])


1# training a decision tree classifier
2
3tree_model = DecisionTreeClassifier()
4tree_model.fit(X_train, y_train)
5y_pred = tree_model.fit(X_test)
6
7# visualizing a trained decision tree
8
9plt.figure(figsize = (20,15)) # set the figure size to get a larger image
10plot_tree(tree_model, filled=True, feature_names = X_train.columns, class_names = ['not fault', 'fault'])
11plt.show()


1# training random forest regressor
2from sklearn.ensemble import RandomForestRegressor
3
4# n_estimators - the number of decision trees
5rf_model = RandomForestRegressor(n_estimators = 100)
6rf_model.fit(X_train, y_train)
7y_pred = rf_model.predict(X_val)


1# training gradient boosting regressor
2from sklearn.ensemble import GradientBoostingRegressor
3
4# n_estimators - the number of simple models
5gb_model = GradientBoostingRegressor(n_estimators = 100)
6gb_model.fit(X_train, y_train)
7y_pred = gb_model.predict(X_val)


1# Clustering with KMeans
2from sklearn.cluster import KMeans
3
4# obligatory standardization of data before passing it to the algorithm
5sc = StandardScaler()
6X_sc = sc.fit_transform(X)
7
8km = KMeans(n_clusters = 5) # setting the number of clusters as 5
9labels = km.fit_predict(X_sc) # applying the algorithm to the data and forming a cluster vector


1# plotting a dendrogram
2from scipy.cluster.hierarchy import dendrogram, linkage
3
4# obligatory standardization of data before passing it to the algorithm
5sc = StandardScaler()
6X_sc = sc.fit_transform(X)
7
8linked = linkage(X_sc, method = 'ward')
9
10plt.figure(figsize=(15, 10))
11dendrogram(linked, orientation='top')
12plt.title('Hierarchial clustering for GYM')
13plt.show()


1# estimating silhouette score in clustering
2from sklearn.metrics import silhouette_score
3
4silhouette_score(x_sc, labels)