Working with Missing Values
Importance of Missing Data
Don’t simply delete rows with missing values unless there is a good justification. They can be associated with other non-missing values, and missing values can be restored.
Explicit Indicators of Missing Values: NaN
and None
Both NaN
and None
indicate that there is no value in a cell.
NaN
is IEEE 754 floating point representation of Not a Number (NaN), therefore it is typically seen along with a float data, it is compatible with mathematical operations. It often occurs as the np.nan
constant (reference) More interesting reading on NaN
is here.
None
is a built-in constant in Python (reference), we cannot perform mathematical operations with it.
Implicit Indicators of Missing Values
There might be many cases, some examples:
'-'
for thegender
column,-1
for theage
column,999 999 9999
for thephone_number
column
etc
Checking for Missing Values with info()
The info()
method prints the number of values that are not missing in each column. Null values are missing values, while non-null values are non-missing values. Be aware that info()
will only recognize "true" missing values (None or NaN) as null, whereas missing values represented by anything else will not be considered missing.
It’s never a bad idea to call info() on a new dataset.
1df.info()
Checking for Missing Values with isnull()
The isnull()
method returns a Boolean list, in which True
indicates that a value is missing in the column. Summing it up gives the number of such values.
1df['price'].isnull()
1df['price'].isnull().sum()
Checking for Missing Values with value_counts()
The value_counts()
method called for a Series returns the number of times each unique value occurs in a column. Running it with dropna=False
includes None
and NaN
in counting.
1df['price'].value_counts(dropna=False)
When To Fill In Missing Values And When Not To
We should not allow filled values to drastically change the results of my analysis.
Sometimes you'll see hard rules, like "If more than 20% of a variable is missing, delete the variable entirely." But we hesitate to give any hard rules because so much depends on the context of the problem. As a data analyst, part of your job is to take nuances into account when making decisions. With that said, here are some guidelines:
Never fill the main variable of interest, and never use the main variable of interest to fill missing values. Often your final goal is to understand the relationships between a main variable and other variables. This is the objective, not a data preprocessing step.
Always document when, where, why, and how missing values were filled. Always have a justifiable reason for any filling you do. If asked why, you must have an answer.
Filled values should not have a significant impact on your analysis. If in doubt, run your analysis twice, once with the filled values and once with missing values removed. If the results are significantly different, the filled values are driving the change. You do not want this. This is an example of sensitivity analysis.
Strategies for Filling In Missing Data
- Constant.
- Typical Value (the mode, the mean, the median etc).
- On some condition using values of other variables.
Filling In NaN/None with a Constant
To substitute a value for a missing one represented with NaN
/None
, use the fillna()
method with the value
argument.
1df['source'] = df['source'].fillna('email')
Substituting Missing Data with a Constant
To substitute a value for a missing one represented with a normal value, use the replace()
method.
1df['source'] = df['source'].replace('', 'email')
Filling In NaN/None with Statistics
1user_age_avg = users['age'].mean()2users['age'] = users['age'].fillna(user_age_avg)
Filling In NaN/None Conditionally
1desktop_users = users['device_type'] == 'desktop'2users.loc[desktop_users, 'time'] = users[desktop_users]['time'].mean()
Deleting Records With Missing Data?
To delete rows with missing values, call the dropna()
method.