Pandas Drop Duplicates

Removing duplicates is an essential skill to get accurate counts because we often don’t want to count the same thing multiple times. In Python, this could be accomplished by using the Pandas module, which has a method known as drop_duplicates.

Let’s understand how to use it with the help of a few examples.

Dropping Duplicate Names

Let’s say we have a DataFrame that contains vet visits, & the vet’s office wants to know how many dogs of each breed have visited their office. However, there are dogs like Max & Stella, who have visited the vet more than once in your dataset. Hence, we cannot just count the number of each breed in the breed column.

To achieve this, we would remove rows that contain a dog name already listed earlier, or in other words, we will extract a dog with each name from the dataset once.

We would do this using the drop_duplicates method. It takes an argument subset, which is the column we want to find or duplicates based on - in this case, we want all the unique names.

But, what if we have dogs with the same name?

Dropping Duplicate Pairs

In that case, we need to consider more than just name when dropping duplicates. Since Max & Max are different breeds, we can drop the rows with pairs of names & breeds listed earlier in the dataset.

To base our duplicate dropping on multiple columns, we can pass a list of column names to the subset argument, in this case, name & breed.

Now both Max’s have been included.

Data Scientist & Machine Learning Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store