Knowledge Base

Joining/Merging DataFrames

Combining Data from Two Tables

When working on tasks, it’s important to choose the proper averaging method, since that could impact the findings. In some cases, the arithmetic mean describes the data more precisely, while in others it can yield an incorrect result, making it necessary to calculate the median.

Remember that the pivot_table() method groups data, while the aggfunc argument defines how to process the data in each group. For example, there's:

  • median
  • count (number of values)
  • sum
  • min
  • max
  • first (the first value from the group)
  • last (the last value from the group)

When calling pivot_table(), we can pass the aggfunc parameter several functions at once. For example, aggfunc=['median', 'count'] will calculate both the median and the number of values. They will show up in neighboring columns in the resulting table.

Combining Columns Using the merge() and join() Methods

When you have to add several new columns to an existing DataFrame, joining or merging is more efficient than adding them one by one.

Inner merging results in the logical conjunction of both tables (the records that are present in both DataFrames). inner is the merge() method's default merging method. Outer merging (outer) gives the logical disjunction of both tables (the records that are present in either of the two DataFrames). The merging mode is set in the how parameter.

1data1.merge(data2, on='column', how='inner')

The combination mode left indicates that all the rows from the left DataFrame should always enter the merge result. If there are missing values in the right DataFrame, they will be replaced by NaN. You can probably guess that right gives you all the matching rows and all other entries from the right DataFrame (the right DataFrame being the one inside parentheses).

In the table we got above using the merge() **method, _x and _y were added to column names. Column name endings are specified in the suffixes argument.

If an index column is given a name, the name can also be passed to the on parameter. Combining several columns at once is also possible; just pass a list of them to the on argument.

The join() method is similar to the merge() method. Without the on parameter, join() will automatically seek matches based on the indices in the first and second DataFrames. If a column is passed to the on parameter, join() finds it in the first DataFrame and begins comparing it to the index of the second DataFrame. In join(), unlike merge(), the merge type how=‘left’ is set the default. But the suffixes parameter is divided into two independent ones: lsuffix ("left suffix") and rsuffix ("right suffix"). It's also possible to combine more than two tables using the join() method: they're passed as a list instead of the second DataFrame.

The merge() Method

Data can be stored across multiple sheets in Excel tables. Before you can use all the data, you have to join the tables together.

We can combine multiple tables by using the merge() method.

Arguments:

  • right: the name of the DataFrame or Series we’re joining with the source table
  • on: a field present in both tables that’s used to combine them
  • how: type of merge:
    • left - table data must not have missing values, data2 can have NaN
    • right - table data2 must not have missing values, data can have NaN
1data.merge(data2, on='merge_column', how='left')
2data.merge(data2, on='merge_column', how='right')
Send Feedback
close
  • Bug
  • Improvement
  • Feature
Send Feedback
,