Preparing Features
One-Hot Encoding (OHE)
- It adds a separate column for each feature value except one (it is encoded by all zeros in the new columns).
- If the category fits the observation, 1 is assigned; otherwise, 0 is assigned.
You can get OHE dummy variables using the get_dummies()
function.
1pd.get_dummies(df['column'])
With setting drop_first
to True
to avoid the dummy trap.
1pd.get_dummies(df['column'], drop_first=True)
Ordinal Encoding
- It encodes each class with a number.
- The numbers are put in the columns.
1from sklearn.preprocessing import OrdinalEncoder23encoder = OrdinalEncoder() # creating a new instance of the encoder4encoder.fit(data) # obtaining the list of categorical features from data5data_ordinal = encoder.transform(data) # do the encoding using the fitted encoder
Use DataFrame()
to add column names.
1data_ordinal = pd.DataFrame(encoder.transform(data), columns=data.columns)
If we need to transform the data only once you can call the fit_transform()
method. It combines fit()
and transform()
.
1data_ordinal = pd.DataFrame(encoder.fit_transform(data), columns=data.columns)
Feature Scaling
One way to scale the features is to standardize the data.
Suppose that all features are normally distributed, the mean (M) and variance (Var) are determined from the sample. Feature values are converted by the formula:
For the new feature, the mean becomes 0 and variance equals 1.
1from sklearn.preprocessing import StandardScaler23scaler = StandardScaler()4scaler.fit(df)5df_scaled = scaler.transform(df)