Knowledge Base

Feature Engineering

Adding Feature Using Arithmetic Operators

In pandas, you can perform arithmetic operations on columns: addition, subtraction, multiplication, and division. For example:

1df['column1'] = df['column2'] + df['column3']
2df['column1'] = df['column2'] - df['column3']
3df['column1'] = df['column2'] * df['column3']
4df['column1'] = df['column2'] / df['column3']

Creating new feature based on regional sales

1import pandas as pd
2
3df = pd.read_csv('/datasets/vg_sales.csv')
4
5df['total_sales'] = df['na_sales'] + df['eu_sales'] + df['jp_sales']

Adding Boolean Feature

Say we want a column to indicate whether something is true. We can create it using the comparison operators ==, <, >=, etc.

1import pandas as pd
2
3df = pd.read_csv('/datasets/vg_sales.csv')
4
5df['is_nintendo'] = df['publisher'] == 'Nintendo'

Categorization by Classification

One way to categorize the data is to combine it into few categories e.g. to make age groups for age values: 18 or under, 19-65, and over 65. Classification rules like these can be conveniently represented in Python as functions that take parameters and return a category value.

The age_group() function that we wrote and the apply() method can be used to return a column with a group based on a column with a different index. Note that we don't include parentheses when writing age_group inside apply().

1def get_age_group(age):
2 """
3 The function returns the age group according to the age value, using the following rules:
4 —'children' for age <= 18
5 —'adult' for 19 <= age <= 64
6 —'retired' for all other cases
7 """
8
9 if age <= 18:
10 return 'children'
11 if age <= 64:
12 return 'adult'
13 return 'retired'
14
15df['age_group'] = df['age'].apply(get_age_group)

Testing the function above with a small DataFrame

1df = pd.DataFrame([15, 18, 19, 67, 64, 87], columns=['age'])
2
3df['age'].apply(get_age_group)

Row-Level Functions

When the value from a single column is insufficient for categorization, the function can pass the contents of the whole row as a Series. A function that is given a whole row can also return a value from a specific column.

When processing rows instead of single values, the apply() method differs in two ways:

  1. The apply() method is called for a DataFrame, not just for one column.
  2. To pass rows into the function, we’ll need to use the apply() method with the parameter axis= 1.
1def get_bmi(row):
2
3 height = row['height']
4 weight = row['weight']
5
6 # we don't calculate BMI for children under 2
7 if row['age'] < 2:
8 return None
9
10 bmi = weight/height**2
11
12 return bmi
13
14# applying the defined functions to rows
15df['bmi'] = df.apply(get_bmi, axis=1)
Send Feedback
close
  • Bug
  • Improvement
  • Feature
Send Feedback
,