Basics of Working with DataFrames
Catching Unexpected Errors (try-except
)
When you’re uploading data from multiple systems, be prepared for surprises:
- Incorrectly formatted data can cause issues when the code runs: a crash. You already have some experience with this. If the numbers in the dataset are strings for whatever reason, you'll need to use the
to_numeric()
method. - Errors can occur toward the end of a file, with code not executing for rows with incorrect values. That means we lose our calculations for the previous, error-free rows.
- Data sometimes changes. For example, a company might start working with a new partner that sends faulty data for accounting, causing the code to crash.
Use the try-except
construction (more info) for such cases.
Special Values
NaN ("not a number") - a special float value used when a computation cannot be carried out (e.g. 0/0) or displayed
None - a special NoneType value used when a value is missing
Copying Columns Between DataFrames
To copy a column from df1
to df2
, create a new column in df2
and assign it the values of the df1
column:
1df2['new_column'] = df1['some_column']
If the new_column
column had already been in df2
, all of its elements would have been replaced with the new ones.
It seems relatively simple: pandas copies a column from df1
and puts it in df2
.
However, if you take a closer look, things aren't so simple. For each row of the first DataFrame, pandas looks for a "mate," a row with the same index in the second DataFrame, and takes a value from that row. In our case, the indices in df1
and df2
are the same, so this is a trivial case: all the values get copied in the same order in which they're positioned. If the indices are different, however, we'll get NaN
values where the indices are absent.
Note that our DataFrames don't have to have the same numbers of rows. If df1
doesn't have as many rows as df2
, then we end up with some NaN
values. If df1
has more rows, they simply won't become part of the new DataFrame.
Columns can also exist separately, outside of DataFrames. A single column can be saved in a Series object – an array of values with indices. Since Series have indices, the assignment of a Series to, say, a column of a DataFrame, will work in the same way that we saw earlier – the values will be copied on the basis of matching indices.
Renaming Columns with Hierarchical Names
We've dealt with a table with two-level column names. This is a MultiIndex, a kind of hierarchical indexing structure that we see when an index contains a list of values rather than a single value.
But what if we don't want those complex column names? Then we need to rename our columns using the columns
attribute:
1df.columns = ['column_name_1', 'column_name_2', 'column_name_3']