Categorization

Classifying by Type

Categories in a dataset might be stored as strings of varying lengths.

What are the implications of storing them like this?

The table isn't easy to process visually.
File size and data processing time are greater than they need to be.
To filter data by type, we need to input the entire name (without any typos!).
Creating new categories and changing old ones can take a lot of time.

To store information about categories in the best way possible, use a dictionary that maps each category name to a number. This number will be used in place of the category name in the table.

Classifying by Age Group

There is often just one entry with a specific index value. It’s impossible to work with bits of data like this and reach statistical conclusions. This is why this data must be categorized — that is, combined into categories.

One way to categorize the data is to filter it by age group. For example: 18 or under, 19-65, and over 65.

Classification rules like these can be conveniently represented in Python as functions that take parameters and return a category value.

The age_group() function that we wrote and the apply() method can be used to return a column with a group based on a column with a different index. Note that we don't include parentheses when writing age_group inside apply().


1def age_group(age):
2    """
3    The function returns the age group according to the age value, using the following rules:
4    —'children' for age <= 18
5    —'adult' for 19 <= age <= 64
6    —'retired' for all other cases
7    """
8
9    if age <= 18:
10        return 'children'
11    if age <= 64:
12        return 'adult'
13    return 'retired'
14
15data['column_group'] = data['column'].apply(age_group)

Row-Level Functions

When the value from a single column is insufficient for categorization, the function can pass the contents of the whole row as a Series. A function that is given a whole row can also return a value from a specific column.

When processing rows instead of single values, the apply() method differs in two ways:

The apply() method is called for the data DataFrame, not just for the ['age'] column.
By default, pandas passes columns to the group() function. To pass rows into the function, we’ll need to use the apply() method with the parameter axis = 1.


1# defining a row-level function
2
3def process_record(row):
4    info1 = row['column1']
5    info2 = row['column2']
6
7    # Then the values are processed


1# applying the defined functions to rows
2
3data['column'].apply(process_record, axis=1)