Image for post
Image for post

We’ll learn techniques on how to clean messy data in SQL, which is a must-have skill for any Data Scientist

Real world data is almost always messy. As a data scientist or a data analyst or even as a developer, if you need to discover facts about data, it’s vital to ensure that data is tidy enough for doing that.

In this tutorial, we will be practicing some of the most common data cleaning techniques in SQL. We will create our own dummy dataset, but the techniques can be applied to the real world data (of the tabular form) as well. The contents of this tutorial are as follows:

  • Different data types & their messy values
  • Problems that can raise from messy numbers
  • Cleaning numeric values
  • Messy strings
  • Cleaning string values
  • Messy date values & cleaning them
  • Duplications & removing them

Different Data Types, Messy Values, & Remedies

Messy Numbers

Numbers can be in messy forms in a number of ways. Here, we will be introduced to the most common ones:

  • Undesired type/Type mismatch: Consider there is a column named age in a dataset we are working with. We see the values that are present in that column are of float type - the sample values are like 23.0, 45.0, 34.0 & so on. In this case, you don't need the age column to be of float type.
  • Null values: While this is particularly common with all of the data-types mentioned above, null values here merely means that the values are not available/blank. However, null values can be present in other forms as well.

Let’s now study the problems that can get raised from these issues & how to deal with them.

Problems With Messy Numbers & Dealing With Them

Let’s now take a look at the most common problems that we may face if we don’t clean the messy data

1.) Data Aggregation

Suppose we have null entries for a numeric column & we are calculating summary statistics (like mean, maximum, minimum values) on that column. The results will not get conveyed accurately in this case. There are several ways on how to address this problem:

Image for post
Image for post

Let’s now get hands-on with these problems & the second option for combating null values.

Consider the following PostgreSQL table named entries:

Image for post
Image for post

We can see two null entries in the above table. Suppose we want to the get the average weight value from this table & we executed the following query:

Image for post
Image for post

We got 90.45 as the output. Is this correct? So, what can be done? Let's fill the null entry with this average value with the help of the COALESCE() function.

Let’s fill the missing values first with COALESCE() (remember that COALESCE() does not change the values in the original table, it just returns a temporary view of the table with the values changed):

Image for post
Image for post

We should get an output like:

Image for post
Image for post

Now we can apply the AVG() again:

Image for post
Image for post

This is a much more accurate result than the earlier one. Let’s now study another problem that can take place if we have mismatches in the column data-types.

2.) Table Joins

Consider we are working with the following tables student_metadata & department_details:

Image for post
Image for post

We can see in the student_mtadata table, dept_id is of integer type, & in the department_details table, it is present in text type. Now, suppose, we want to join these two tables & want to produce a report which will contain the following columns:

  • id
  • name
  • dept_name

To do this, we run this query:

Image for post
Image for post

We will encounter this error, then:

ERROR: operator does not exist: smallint = text”

Here’s an amazing infographic which depicts this problem:

Image for post
Image for post

This is happening because the data-types are not getting matched while joining the two tables. Here, we can CAST the dept_id column in the department_details table to integer while joining the tables. Here's how to do that:

Image for post
Image for post

And you get your desired report:

Image for post
Image for post

Let’s now discuss how strings can be present in messy forms, their problems & ways to deal with them.

Messy Strings & Cleaning Theme

String values are also very common. Let’s start this section by looking at the values of a column dept_name (denoting department names) taken from a table named student_details:

Image for post
Image for post

String values like the above can cause a lot of unexpected problems. I.T, Information Technology & i.t all mean the same department, i.e. Information Technology & suppose the specification document requires the values to be present as I.T only. Now, say, you want to count the number of students belonging to the department of I.T. & you run this query:

Image for post
Image for post

And we get:

Image for post
Image for post

Is this an accurate report? — No! So, how can we address this problem?

Let’s first identify the problem in a more detailed way:

  • We have Information Technology as a value which should be converted to I.T
  • We have i.t as another value which should be converted to I.T.

In the first case, we can REPLACE the value Information Technology to I.T, & in the second case, you convert the character to UPPER case. We can accomplish this in a single query though it is advised to address this kind of problems in a step-by-step fashion. Here's the query to address the problem:

Image for post
Image for post

And the report:

Image for post
Image for post

Let’s now discuss some example where date values can be messy and what we can do to clean them.

Messy Dates & Cleaning Them

Consider we are working with a table named employees which contains a column called birthdate but not in an appropriate date type. Now, we want to execute queries with dedicated date functions like DATE_PART(). We will not be able to do that until & unless we CAST the birthdate column to date type. Let's see this in action.

Consider the birthdates to be in the YYYY-MM-DD format.

Here’s what the employees table looks like:

Image for post
Image for post

Now, we run the following query to extract the months from the birthdates:

Image for post
Image for post

And we instantly get this error:

ERROR: function date_part(unknown, text) does not exist”

Along with the error, we get a very good hint also:

HINT: No function matches the given name and argument types. You might need to add explicit type casts.”

Let’s follow the hint and CAST the birthdate to the appropriate date type & then apply DATE_PART():

Image for post
Image for post

We should get the result:

Image for post
Image for post

Let’s now proceed to the pre-concluding section of this tutorial where we will study the effects of data duplications & how we can tackle them.

Data Duplications: Causes, Effects & Solutions

In this section, we will be studying some of the most common causes which lead to data duplications. We will also see their effects & some of the ways using which prevent them. Consider the following two tables band_details & some_festival_record:

Image for post
Image for post

The table band_details conveys information about musical bands, it contains their identifiers, names & the total shows they have delivered. On the other hand, the table some_festival_recordportrays a hypothetical music festival & contains records about the bands performed there.

Now, suppose we want to produce a report which should contain band names, their show counts, & the total number of times they have performed at the festival. INNER joining is needed here. We run the following query:

Image for post
Image for post

And the query produces:

Image for post
Image for post

Don’t we think the total_shows values are erroneous here? Because from the band_details table, we know that Band_1 has delivered a total of 36 shows. Then what went wrong here? Duplicates!

While joining the two tables, we mistakenly aggregated total_show_count column which caused a data duplication in the intermediate join results. If we remove the aggregation & modify the query accordingly we should get the desired results:

Image for post
Image for post

We get our expected results now:

Image for post
Image for post

There is another way to prevent data duplication, i.e. add another field in our JOIN clause so that the tables get joined on stricter conditions.

Taking It To The Next Step

The tutorial introduced us to one of the most vital steps in the data analysis pipeline — data cleaning. We saw different forms of messy data & ways to tackle them. However, there are more advanced techniques to deal with more complex data cleaning problems.

RELATED LINKS

Written by

Data Scientist & Machine Learning Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store