Working with Duplicates

Looking for Duplicates by Hand

You'll come across duplicate entries often when you’re analyzing data. If you don’t identify them in time, you might end up with flawed conclusions.

There are several ways to look for duplicate data.

Method 1: `duplicated()`

We can use the duplicated() method together with sum() to get the number of duplicates. Remember that if you call duplicated() without calculating the total, every row will get displayed on the screen, and you'll see the value True where there’s a duplicate and False where there isn’t.

Method 2: `value_counts()`

Call the value_counts() method. This method analyzes a column, selects all the unique values, and then calculates how often they appear. We can apply this method to Series objects to get lists of value-frequency pairs in descending order. The entries duplicated most frequently can be found at the top of the list.

Method 3: `unique()`

The unique()method is used to get unique values of a column e.g.: data['column'].unique(). If their number is less than expected, it is the sign of some duplicates in the column.

Strings: Looking for Duplicates by Hand with Case Sensitivity

Duplicates in string data demand special attention. From Python’s point of view, a capital 'A' and a lowercase 'a' are different symbols.

To catch duplicate entries like those, we can change all the characters in the string to lowercase by calling the lower() method.

Over in pandas, we change characters to lowercase with a method that follows a similar syntax: str.lower().

Working with Duplicates

Looking for Duplicates by Hand

Method 1: duplicated()

Method 2: value_counts()

Method 3: unique()

Strings: Looking for Duplicates by Hand with Case Sensitivity

Method 1: `duplicated()`

Method 2: `value_counts()`

Method 3: `unique()`