Preparing Features

One-Hot Encoding (OHE)

It adds a separate column for each feature value except one (it is encoded by all zeros in the new columns).
If the category fits the observation, 1 is assigned; otherwise, 0 is assigned.

You can get OHE dummy variables using the get_dummies() function.


1pd.get_dummies(df['column'])

With setting drop_first to True to avoid the dummy trap.


1pd.get_dummies(df['column'], drop_first=True)

Ordinal Encoding

It encodes each class with a number.
The numbers are put in the columns.


1from sklearn.preprocessing import OrdinalEncoder
2
3encoder = OrdinalEncoder()               # creating a new instance of the encoder
4encoder.fit(data)                        # obtaining the list of categorical features from data
5data_ordinal = encoder.transform(data)   # do the encoding using the fitted encoder

Use DataFrame() to add column names.


1data_ordinal = pd.DataFrame(encoder.transform(data), columns=data.columns)

If we need to transform the data only once you can call the fit_transform() method. It combines fit() and transform().


1data_ordinal = pd.DataFrame(encoder.fit_transform(data), columns=data.columns)

Feature Scaling

One way to scale the features is to standardize the data.

Suppose that all features are normally distributed, the mean (M) and variance (Var) are determined from the sample. Feature values are converted by the formula:

\text{X}=\frac{\text{X}-M}{\sqrt{\text{Var}}} \quad \text{where X - value}

For the new feature, the mean becomes 0 and variance equals 1.


1from sklearn.preprocessing import StandardScaler
2
3scaler = StandardScaler()
4scaler.fit(df)
5df_scaled = scaler.transform(df)