Takeaway Sheet: Machine Learning Algorithms
Glossary
Global minimum the function's smallest value.
Gradient a vector consisting of local derivatives.
Gradient descent a method for finding a local minimum.
Hyperparameters the parameters whose values are set before training begins; they do not change. A model can have several hyperparameters.
Local minimum the function's smallest value within a given range.
Multicollinearity is a strong correlation among several features.
Normalization consists in converting the feature values to the range from 0 to 1.
Regularization refers to any additional limitation on a model, or to an action that will decrease the model's complexity and minimize the impact of overfitting. This information can be though of as a fine for the model's complexity.
Standardization consists in converting the feature values so that they are random values from the standard normal distribution (with a mathematical expectation of 0 and a variance of 1).
Practice
1# standardizing features2from sklearn.preprocessing import StandardScaler34scaler = StandardScaler() # creating a scaler class object5scaler.fit(X) # training standardizer6X_sc = scaler.transform(X) # transforming the data set
1# calculating the correlation matrix2cm = df.corr()
1# initializing linear regression models2from sklearn import linear_model3linear_regression = linear_model.LinearRegression()4# a linear regression model with built-in L1 weight regularization5lasso = linear_model.Lasso()6# a linear regression model with built-in L2 weight regularization7ridge = linear_model.Ridge()89# training a model10linear_regression.fit(X_train, y_train)11# getting predictions12y_pred = linear_regression.predict(X_val)13# printing the linear model's coefficients14print(model.coef_)15# printing the regression's null coefficient16print(model.intercept_)
1# Calculating the MAE value for a trained model2# MSE (mean_squared_error) and R2 (r2_score) are found similarly3from sklearn.metrics import mean_absolute_error45print('MAE:', mean_absolute_error(y_true, y_pred))6```---78```python9# initializing a logistic regression model (algorithm for binary classification)10from sklearn.linear_model import LogisticRegression1112model = LogisticRegression()1314# obtaining the probability that an object enters a certain class15y_probas = model.predict_proba(X_test)
1# building a confusion matrix for classification2from sklearn.metrics import confusion_matrix34cm = confusion_matrix(y_true,y_pred)5tn, fp, fn, tp = cm.ravel() # "evening out" the matrix to get the intended values
1# calculating classification metrics2from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score3acc = accuracy_score(y_true, y_pred)4precision = precision_score(y_true, y_pred)5recall = recall_score(y_true, y_pred)6f1 = f1_score(y_true, y_pred)7roc_auc = roc_auc_score(y_true, probabilities[:,1])
1# training a decision tree classifier23tree_model = DecisionTreeClassifier()4tree_model.fit(X_train, y_train)5y_pred = tree_model.fit(X_test)67# visualizing a trained decision tree89plt.figure(figsize = (20,15)) # set the figure size to get a larger image10plot_tree(tree_model, filled=True, feature_names = X_train.columns, class_names = ['not fault', 'fault'])11plt.show()
1# training random forest regressor2from sklearn.ensemble import RandomForestRegressor34# n_estimators - the number of decision trees5rf_model = RandomForestRegressor(n_estimators = 100)6rf_model.fit(X_train, y_train)7y_pred = rf_model.predict(X_val)
1# training gradient boosting regressor2from sklearn.ensemble import GradientBoostingRegressor34# n_estimators - the number of simple models5gb_model = GradientBoostingRegressor(n_estimators = 100)6gb_model.fit(X_train, y_train)7y_pred = gb_model.predict(X_val)
1# Clustering with KMeans2from sklearn.cluster import KMeans34# obligatory standardization of data before passing it to the algorithm5sc = StandardScaler()6X_sc = sc.fit_transform(X)78km = KMeans(n_clusters = 5) # setting the number of clusters as 59labels = km.fit_predict(X_sc) # applying the algorithm to the data and forming a cluster vector
1# plotting a dendrogram2from scipy.cluster.hierarchy import dendrogram, linkage34# obligatory standardization of data before passing it to the algorithm5sc = StandardScaler()6X_sc = sc.fit_transform(X)78linked = linkage(X_sc, method = 'ward')910plt.figure(figsize=(15, 10))11dendrogram(linked, orientation='top')12plt.title('Hierarchial clustering for GYM')13plt.show()
1# estimating silhouette score in clustering2from sklearn.metrics import silhouette_score34silhouette_score(x_sc, labels)