Categorization
Classifying by Type
Categories in a dataset might be stored as strings of varying lengths.
What are the implications of storing them like this?
- The table isn't easy to process visually.
- File size and data processing time are greater than they need to be.
- To filter data by type, we need to input the entire name (without any typos!).
- Creating new categories and changing old ones can take a lot of time.
To store information about categories in the best way possible, use a dictionary that maps each category name to a number. This number will be used in place of the category name in the table.
Classifying by Age Group
There is often just one entry with a specific index value. It’s impossible to work with bits of data like this and reach statistical conclusions. This is why this data must be categorized — that is, combined into categories.
One way to categorize the data is to filter it by age group. For example: 18 or under, 19-65, and over 65.
Classification rules like these can be conveniently represented in Python as functions that take parameters and return a category value.
The age_group()
function that we wrote and the apply()
method can be used to return a column with a group based on a column with a different index. Note that we don't include parentheses when writing age_group
inside apply()
.
1def age_group(age):2 """3 The function returns the age group according to the age value, using the following rules:4 —'children' for age <= 185 —'adult' for 19 <= age <= 646 —'retired' for all other cases7 """89 if age <= 18:10 return 'children'11 if age <= 64:12 return 'adult'13 return 'retired'1415data['column_group'] = data['column'].apply(age_group)
Row-Level Functions
When the value from a single column is insufficient for categorization, the function can pass the contents of the whole row as a Series. A function that is given a whole row can also return a value from a specific column.
When processing rows instead of single values, the apply()
method differs in two ways:
- The
apply()
method is called for thedata
DataFrame, not just for the['age']
column. - By default, pandas passes columns to the
group()
function. To pass rows into the function, we’ll need to use theapply()
method with the parameteraxis = 1
.
1# defining a row-level function23def process_record(row):4 info1 = row['column1']5 info2 = row['column2']67 # Then the values are processed
1# applying the defined functions to rows23data['column'].apply(process_record, axis=1)