Removing duplicates is an essential skill to get accurate counts because we often don’t want to count the same thing multiple times. In Python, this could be accomplished by using the Pandas module, which has a method known as
Let’s understand how to use it with the help of a few examples.
Dropping Duplicate Names
Let’s say we have a DataFrame that contains vet visits, & the vet’s office wants to know how many dogs of each breed have visited their office. However, there are dogs like Max & Stella, who have visited the vet more than once in your dataset. Hence, we cannot just count the number of each breed in the breed column.
To achieve this, we would remove rows that contain a dog name already listed earlier, or in other words, we will extract a dog with each name from the dataset once.
We would do this using the
drop_duplicates method. It takes an argument
subset, which is the column we want to find or duplicates based on - in this case, we want all the unique names.
But, what if we have dogs with the same name?
Dropping Duplicate Pairs
In that case, we need to consider more than just
name when dropping duplicates. Since
Max are different breeds, we can drop the rows with pairs of names & breeds listed earlier in the dataset.
To base our duplicate dropping on multiple columns, we can pass a list of column names to the subset argument, in this case, name & breed.
Now both Max’s have been included.