Data Transformations

Grouping Data

Doing a groupby() operation changes the row index of the data to the keys we're grouping by. To group over multiple different columns, we pass a list to the groupby() method.

The DataFrameGroubBy object is part of a data processing framework called split-apply-combine:

split the data into groups
apply a statistical aggregation function to each group
combine the results for each group

Example. We split the data into groups with df.groupby(['platform', 'genre']), and we applied the mean() method and combined the result into a Series with grp['critic_score'].mean().


1import pandas as pd
2
3df = pd.read_csv('/datasets/vg_sales.csv')
4df.dropna(inplace=True)
5
6grp = df.groupby(['platform', 'genre'])
7mean_scores = grp['critic_score'].mean()
8print(mean_scores)

Processing Grouped Data with agg()

The agg() method uses a dictionary as input where the keys are column names and the corresponding values are the aggregate functions you want to apply to them. It is useful to get different summary statistics for different columns. We can even apply your own custom functions with agg().


1import pandas as pd
2
3df = pd.read_csv('/stats/vg_sales.csv')
4df.dropna(inplace=True)
5
6agg_dict = {'critic_score': 'mean', 'jp_sales': 'sum'}
7
8grp = df.groupby(['platform', 'genre'])
9print(grp.agg(agg_dict))

Aggregating Data with pivot_table()

pandas also offers pivot tables as an alternative method to group and analyze data. Pivot tables are a great tool to synthesize datasets and explore their different dimensions. They're very popular in spreadsheet apps like Excel, but it's even more powerful to make them programmatically with pandas.

The pivot_table() method's common parameters are:

index= — the column whose values become the index in the pivot table
columns= — the column whose values become columns in the pivot table
values= — the column whose values we want to aggregate in the pivot table
aggfunc= — the aggregate function we want to apply to values in each row-column group

Example. Compute the total EU sales for every genre/platform combination.


1import pandas as pd
2
3df = pd.read_csv('/datasets/vg_sales.csv')
4df.dropna(inplace=True)
5
6pivot_data = df.pivot_table(index='genre',
7                            columns='platform',
8                            values='eu_sales',
9                            aggfunc='sum'
10                           )

Using a pivot table here is convenient because we easily excluded all of the columns from df that we weren’t interested in for our analysis. It can also be easier for people to read than the equivalent groupby()-based code.

When working on tasks, it’s important to choose the proper averaging method, since that could impact the findings. In some cases, the arithmetic mean describes the data more precisely, while in others it can yield an incorrect result, making it necessary to calculate the median. The pivot_table() method accepts different aggregating functions for the aggfunc parameter e.g.:

median
count (number of values)
sum
min
max
first (the first value from the group)
last (the last value from the group)

When calling pivot_table(), we can pass the aggfunc parameter several functions at once. For example, aggfunc=['median', 'count'] will calculate both the median and the number of values. They will show up in neighboring columns in the resulting table.

Combining DataFrames 'vertically' with concat()

The concat() method is typically used to combine rows from separate DataFrames.


1import pandas as pd
2
3df = pd.read_csv('/datasets/vg_sales.csv')
4
5rpgs = df[df['genre'] == 'Role-Playing']
6platformers = df[df['genre'] == 'Platform']
7
8df_all_games = pd.concat([rpgs, platformers])

Two DataFrames are stitched into one! Remember that this works here because both smaller DataFrames have the same columns.

Combining DataFrames 'horizontally' Using `merge()`

The merge() method allows combining two DataFrames by some keys.


1df_1.merge(df_2, on='user_id', how='left')

There are several types of merges

'inner': The logical conjunction of both tables (the records are kept that present in both DataFrames).
'left': All the values from the left DataFrame are present in the merged DataFrame. Values from the right DataFrame are only kept for values that match the specified column in the left DataFrame.
'right', It works identically to a left merge except that the merged DataFrame keeps all the values from the right DataFrame instead.
'outer', all the values in the specified column are kept from both original DataFrames, but the merged DataFrame has missing values where there isn’t a match.

The merging mode is set with the how parameter.

pandas will automatically adds suffixes to column names when columns have got the same name in the merged DataFrames. The default suffixes are _x and _y. We can set better suffixes by passing a list of suffix strings to the suffixes= parameter in merge():


1both_pupils = first_pupil_df.merge(second_pupil_df,
2                                   on='author',
3                                   suffixes=['_1st_student', '_2nd_student']
4                                  )

If the columns used to merge on have got different names


1both_pupils = first_pupil_df.merge(second_pupil_df,
2                                   left_on='author',
3                                   right_on='Author',
4                                  )

If an index column is given a name, the name can also be passed to the on parameter. Combining several columns at once is also possible; just pass a list of them to the on argument.