Working with Missing Values

Importance of Missing Data

Don’t simply delete rows with missing values unless there is a good justification. They can be associated with other non-missing values, and missing values can be restored.

Explicit Indicators of Missing Values: `NaN` and `None`

Both NaN and None indicate that there is no value in a cell.

NaN is IEEE 754 floating point representation of Not a Number (NaN), therefore it is typically seen along with a float data, it is compatible with mathematical operations. It often occurs as the np.nan constant (reference) More interesting reading on NaN is here.

None is a built-in constant in Python (reference), we cannot perform mathematical operations with it.

Implicit Indicators of Missing Values

There might be many cases, some examples:

'-' for the gender column,
-1 for the age column,
999 999 9999 for the phone_number column

etc

Checking for Missing Values with info()

The info() method prints the number of values that are not missing in each column. Null values are missing values, while non-null values are non-missing values. Be aware that info() will only recognize "true" missing values (None or NaN) as null, whereas missing values represented by anything else will not be considered missing.

It’s never a bad idea to call info() on a new dataset.


1df.info()

Checking for Missing Values with isnull()

The isnull() method returns a Boolean list, in which True indicates that a value is missing in the column. Summing it up gives the number of such values.


1df['price'].isnull()


1df['price'].isnull().sum()

Checking for Missing Values with value_counts()

The value_counts() method called for a Series returns the number of times each unique value occurs in a column. Running it with dropna=False includes None and NaN in counting.


1df['price'].value_counts(dropna=False)

When To Fill In Missing Values And When Not To

We should not allow filled values to drastically change the results of my analysis.

Sometimes you'll see hard rules, like "If more than 20% of a variable is missing, delete the variable entirely." But we hesitate to give any hard rules because so much depends on the context of the problem. As a data analyst, part of your job is to take nuances into account when making decisions. With that said, here are some guidelines:

Never fill the main variable of interest, and never use the main variable of interest to fill missing values. Often your final goal is to understand the relationships between a main variable and other variables. This is the objective, not a data preprocessing step.
Always document when, where, why, and how missing values were filled. Always have a justifiable reason for any filling you do. If asked why, you must have an answer.
Filled values should not have a significant impact on your analysis. If in doubt, run your analysis twice, once with the filled values and once with missing values removed. If the results are significantly different, the filled values are driving the change. You do not want this. This is an example of sensitivity analysis.

Strategies for Filling In Missing Data

Constant.
Typical Value (the mode, the mean, the median etc).
On some condition using values of other variables.

Filling In NaN/None with a Constant

To substitute a value for a missing one represented with NaN/None, use the fillna() method with the value argument.


1df['source'] = df['source'].fillna('email')

Substituting Missing Data with a Constant

To substitute a value for a missing one represented with a normal value, use the replace() method.


1df['source'] = df['source'].replace('', 'email')

Filling In NaN/None with Statistics


1user_age_avg = users['age'].mean()
2users['age'] = users['age'].fillna(user_age_avg)

Filling In NaN/None Conditionally


1desktop_users = users['device_type'] == 'desktop'
2users.loc[desktop_users, 'time'] = users[desktop_users]['time'].mean()

Deleting Records With Missing Data?

To delete rows with missing values, call the dropna() method.