Working with Missing Values
NaN
and None
- Explicit Indicators of Missing Values
NaN
and None
indicate that there is no value in a cell. NaN
means "not a number" and is a float type, so you can perform mathematical operations on it. None
is NoneType, meaning you can’t perform mathematical operations with it. NaN
values can lead to incorrect results when grouping data. Don’t simply delete rows with these values: missing values can often be restored.
In pandas, the value_counts()
method returns unique values and their counts.
The isnull()
method returns a Boolean list, in which True
indicates that a value is missing in the column.
To substitute a value for a missing one, use the fillna()
method with the value
argument.
When to fill missing values and when not to
Sometimes you'll see hard rules, like "If more than 20% of a variable is missing, delete the variable entirely." But we hesitate to give any hard rules because so much depends on the context of the problem. As a data analyst, part of your job is to take nuances into account when making decisions. With that said, here are some guidelines:
Never fill the main variable of interest, and never use the main variable of interest to fill missing values.
Often your final goal is to understand the relationships between a main variable and other variables. This is the objective, not a data preprocessing step.
Always document when, where, why, and how missing values were filled.
Always have a justifiable reason for any filling you do. If asked why, you must have an answer.
Filled values should not have a significant impact on your analysis.
If in doubt, run your analysis twice, once with the filled values and once with missing values removed. If the results are significantly different, the filled values are driving the change. You do not want this. This is an example of sensitivity analysis.
Before we talk about how to replace missing categorical values, get into a power pose and repeat after me: "I will not allow filled values to drastically change the results of my analysis."
Dealing with missing values
So, you've determined that a dataset has missing values. What do you do?
Report the issue and find out if there's a way to get the full data. If there isn't, proceed to step 2.
Determine how many missing values there are — call the
value_counts()
method andprint()
.1print(file_name['column_name'].value_counts())Determine how significant their absence is for the dataset. What proportion of the data do they represent? In most cases, if it's not much (say, 5-10%, depending on the situation), you can delete them.
Check how significant their absence is for their category or column — call the
isnull()
andcount()
methods and print.1print(file_name[file_name['row_name'].isnull()].count())Determine whether the missing values belong to categorical or quantitative variables.
If categorical:
Determine whether the missing values exhibit a pattern — whether their appearance in the dataset is random or not. If you can't detect a correlation with other values in the rows where they appear (say, for respondents under 21, a question about alcohol is unanswered), then they're probably random. There are three types of missing values:
- Missing completely at random (MCAR)
- Missing at random (MAR)
- Missing not at random (MNAR)
Depending on the pattern, decide how to handle them:
- If the values are MCAR or MAR, there's no pattern, so you can replace them with default values — an empty string, a particular word. Use the
loc[]
method and boolean indexing. Thefillna()
method may also work, but not in every case. - With MNAR values, there's a pattern. This is the most complex case, and we won't dive into its intricacies in this chapter.
- If the values are MCAR or MAR, there's no pattern, so you can replace them with default values — an empty string, a particular word. Use the
If quantitative:
- Determine whether your data has significant outliers.
- If there aren't any significant outliers, calculate the mean of your data — apply the
mean()
method to the column or the entire dataset. - If your data has significant outliers, calculate the median of your data — apply the
median()
method to the column or the entire dataset. - Replace missing values with the mean or median using the
fillna()
method.
How to delete records with missing data?
To delete rows with missing values, call the dropna()
method. To renumber the index, call the reset_index()
method with the argument drop=True
.