Joining/Merging DataFrames
Combining Data from Two Tables
When working on tasks, it’s important to choose the proper averaging method, since that could impact the findings. In some cases, the arithmetic mean describes the data more precisely, while in others it can yield an incorrect result, making it necessary to calculate the median.
Remember that the pivot_table()
method groups data, while the aggfunc
argument defines how to process the data in each group. For example, there's:
median
count
(number of values)sum
min
max
first
(the first value from the group)last
(the last value from the group)
When calling pivot_table()
, we can pass the aggfunc
parameter several functions at once. For example, aggfunc=['median', 'count']
will calculate both the median and the number of values. They will show up in neighboring columns in the resulting table.
Combining Columns Using the merge()
and join()
Methods
When you have to add several new columns to an existing DataFrame, joining or merging is more efficient than adding them one by one.
Inner merging results in the logical conjunction of both tables (the records that are present in both DataFrames). inner
is the merge()
method's default merging method. Outer merging (outer
) gives the logical disjunction of both tables (the records that are present in either of the two DataFrames). The merging mode is set in the how
parameter.
1data1.merge(data2, on='column', how='inner')
The combination mode left
indicates that all the rows from the left DataFrame should always enter the merge result. If there are missing values in the right DataFrame, they will be replaced by NaN
. You can probably guess that right
gives you all the matching rows and all other entries from the right DataFrame (the right DataFrame being the one inside parentheses).
In the table we got above using the merge()
**method, _x
and _y
were added to column names. Column name endings are specified in the suffixes
argument.
If an index column is given a name, the name can also be passed to the on
parameter. Combining several columns at once is also possible; just pass a list of them to the on
argument.
The join()
method is similar to the merge()
method. Without the on
parameter, join()
will automatically seek matches based on the indices in the first and second DataFrames. If a column is passed to the on
parameter, join()
finds it in the first DataFrame and begins comparing it to the index of the second DataFrame. In join()
, unlike merge()
, the merge type how=‘left’
is set the default. But the suffixes
parameter is divided into two independent ones: lsuffix
("left suffix") and rsuffix
("right suffix"). It's also possible to combine more than two tables using the join()
method: they're passed as a list instead of the second DataFrame.
The merge()
Method
Data can be stored across multiple sheets in Excel tables. Before you can use all the data, you have to join the tables together.
We can combine multiple tables by using the merge()
method.
Arguments:
right
: the name of the DataFrame or Series we’re joining with the source tableon
: a field present in both tables that’s used to combine themhow
: type of merge:left
- table data must not have missing values, data2 can haveNaN
right
- table data2 must not have missing values, data can haveNaN
1data.merge(data2, on='merge_column', how='left')2data.merge(data2, on='merge_column', how='right')