Feature Engineering
Adding Feature Using Arithmetic Operators
In pandas, you can perform arithmetic operations on columns: addition, subtraction, multiplication, and division. For example:
1df['column1'] = df['column2'] + df['column3']2df['column1'] = df['column2'] - df['column3']3df['column1'] = df['column2'] * df['column3']4df['column1'] = df['column2'] / df['column3']
Creating new feature based on regional sales
1import pandas as pd23df = pd.read_csv('/datasets/vg_sales.csv')45df['total_sales'] = df['na_sales'] + df['eu_sales'] + df['jp_sales']
Adding Boolean Feature
Say we want a column to indicate whether something is true. We can create it using the comparison operators ==, <, >=, etc.
1import pandas as pd23df = pd.read_csv('/datasets/vg_sales.csv')45df['is_nintendo'] = df['publisher'] == 'Nintendo'
Categorization by Classification
One way to categorize the data is to combine it into few categories e.g. to make age groups for age values: 18 or under, 19-65, and over 65. Classification rules like these can be conveniently represented in Python as functions that take parameters and return a category value.
The age_group()
function that we wrote and the apply()
method can be used to return a column with a group based on a column with a different index. Note that we don't include parentheses when writing age_group
inside apply()
.
1def get_age_group(age):2 """3 The function returns the age group according to the age value, using the following rules:4 —'children' for age <= 185 —'adult' for 19 <= age <= 646 —'retired' for all other cases7 """89 if age <= 18:10 return 'children'11 if age <= 64:12 return 'adult'13 return 'retired'1415df['age_group'] = df['age'].apply(get_age_group)
Testing the function above with a small DataFrame
1df = pd.DataFrame([15, 18, 19, 67, 64, 87], columns=['age'])23df['age'].apply(get_age_group)
Row-Level Functions
When the value from a single column is insufficient for categorization, the function can pass the contents of the whole row as a Series. A function that is given a whole row can also return a value from a specific column.
When processing rows instead of single values, the apply()
method differs in two ways:
- The
apply()
method is called for a DataFrame, not just for one column. - To pass rows into the function, we’ll need to use the
apply()
method with the parameteraxis= 1
.
1def get_bmi(row):23 height = row['height']4 weight = row['weight']56 # we don't calculate BMI for children under 27 if row['age'] < 2:8 return None910 bmi = weight/height**21112 return bmi1314# applying the defined functions to rows15df['bmi'] = df.apply(get_bmi, axis=1)