Knowledge Base

Categorization

Classifying by Type

Categories in a dataset might be stored as strings of varying lengths.

What are the implications of storing them like this?

  • The table isn't easy to process visually.
  • File size and data processing time are greater than they need to be.
  • To filter data by type, we need to input the entire name (without any typos!).
  • Creating new categories and changing old ones can take a lot of time.

To store information about categories in the best way possible, use a dictionary that maps each category name to a number. This number will be used in place of the category name in the table.

Classifying by Age Group

There is often just one entry with a specific index value. It’s impossible to work with bits of data like this and reach statistical conclusions. This is why this data must be categorized — that is, combined into categories.

One way to categorize the data is to filter it by age group. For example: 18 or under, 19-65, and over 65.

Classification rules like these can be conveniently represented in Python as functions that take parameters and return a category value.

The age_group() function that we wrote and the apply() method can be used to return a column with a group based on a column with a different index. Note that we don't include parentheses when writing age_group inside apply().

1def age_group(age):
2 """
3 The function returns the age group according to the age value, using the following rules:
4 —'children' for age <= 18
5 —'adult' for 19 <= age <= 64
6 —'retired' for all other cases
7 """
8
9 if age <= 18:
10 return 'children'
11 if age <= 64:
12 return 'adult'
13 return 'retired'
14
15data['column_group'] = data['column'].apply(age_group)

Row-Level Functions

When the value from a single column is insufficient for categorization, the function can pass the contents of the whole row as a Series. A function that is given a whole row can also return a value from a specific column.

When processing rows instead of single values, the apply() method differs in two ways:

  1. The apply() method is called for the data DataFrame, not just for the ['age'] column.
  2. By default, pandas passes columns to the group() function. To pass rows into the function, we’ll need to use the apply() method with the parameter axis = 1.
1# defining a row-level function
2
3def process_record(row):
4 info1 = row['column1']
5 info2 = row['column2']
6
7 # Then the values are processed

1# applying the defined functions to rows
2
3data['column'].apply(process_record, axis=1)
Send Feedback
close
  • Bug
  • Improvement
  • Feature
Send Feedback
,