Working with Duplicate Values
Duplicate entries can lead to flawed conclusions.
Finding Duplicate Date, Method 1: duplicated()
We can use the duplicated() method together with sum() to get the number of duplicate values in a single column or duplicate rows in a DataFrame. Remember that if you call duplicated() without calling sum(), a Boolean Series with the same length as the DataFrame will get printed, with True where there’s a duplicate and False where there isn’t.
1df.duplicated().sum()
Finding Duplicate Date, Method 2: value_counts()
The value_counts()
method identifies all the unique values in a column and calculates how many times each one appears. We can apply this method to a Series to get value-frequency pairs in descending order. The entries duplicated most frequently can be found at the top of the list.
1df['name'].value_counts()
Finding Duplicate Date, Method 3: unique()
The unique()
method is used to get unique values of a column e.g.: df['column'].unique()
. If their number is less than expected, it is the sign of some duplicates in the column.
1df['name'].unique() == len(df)
Special Case with Strings
Duplicates in string data demand special attention due to the case sensitivity factor. From Python’s point of view, a capital 'A'
and a lowercase 'a'
are different symbols. To catch duplicate entries like those, we can change all string values to just one case e.g. to the lower case with the lower()
method.
1df['name_lower'] = df['name'].str.lower()
Handling Duplicates
The drop_duplicates()
method can address completely duplicate rows. If you only want to consider duplicates in one (or some) of the columns rather than fully duplicate rows, you can use the subset= parameter. Pass it the column name (or list of column names) where you want to look for duplicates:
1df.drop_duplicates(subset='col_1')