We’ll learn techniques on how to clean messy data in SQL, which is a must-have skill for any Data Scientist
Real world data is almost always messy. As a data scientist or a data analyst or even as a developer, if you need to discover facts about data, it’s vital to ensure that data is tidy enough for doing that.
In this tutorial, we will be practicing some of the most common data cleaning techniques in SQL. We will create our own dummy dataset, but the techniques can be applied to the real world data (of the tabular form) as well. The contents of this tutorial are as follows:
- Different data types & their messy values
- Problems that can raise from messy numbers
- Cleaning numeric values
- Messy strings
- Cleaning string values
- Messy date values & cleaning them
- Duplications & removing them
Different Data Types, Messy Values, & Remedies
Numbers can be in messy forms in a number of ways. Here, we will be introduced to the most common ones:
- Undesired type/Type mismatch: Consider there is a column named
agein a dataset we are working with. We see the values that are present in that column are of
floattype - the sample values are like 23.0, 45.0, 34.0 & so on. In this case, you don't need the
agecolumn to be of
- Null values: While this is particularly common with all of the data-types mentioned above, null values here merely means that the values are not available/blank. However, null values can be present in other forms as well.
Let’s now study the problems that can get raised from these issues & how to deal with them.
Problems With Messy Numbers & Dealing With Them
Let’s now take a look at the most common problems that we may face if we don’t clean the messy data
1.) Data Aggregation
Suppose we have null entries for a numeric column & we are calculating summary statistics (like mean, maximum, minimum values) on that column. The results will not get conveyed accurately in this case. There are several ways on how to address this problem:
Let’s now get hands-on with these problems & the second option for combating null values.
Consider the following PostgreSQL table named
We can see two null entries in the above table. Suppose we want to the get the average weight value from this table & we executed the following query:
90.45 as the output. Is this correct? So, what can be done? Let's fill the null entry with this average value with the help of the
Let’s fill the missing values first with
COALESCE() (remember that
COALESCE() does not change the values in the original table, it just returns a temporary view of the table with the values changed):
We should get an output like:
Now we can apply the
This is a much more accurate result than the earlier one. Let’s now study another problem that can take place if we have mismatches in the column data-types.
2.) Table Joins
Consider we are working with the following tables
We can see in the
dept_id is of integer type, & in the
department_details table, it is present in text type. Now, suppose, we want to join these two tables & want to produce a report which will contain the following columns:
To do this, we run this query:
We will encounter this error, then:
“ERROR: operator does not exist: smallint = text”
Here’s an amazing infographic which depicts this problem:
This is happening because the data-types are not getting matched while joining the two tables. Here, we can
dept_id column in the
department_details table to integer while joining the tables. Here's how to do that:
And you get your desired report:
Let’s now discuss how strings can be present in messy forms, their problems & ways to deal with them.
Messy Strings & Cleaning Theme
String values are also very common. Let’s start this section by looking at the values of a column
dept_name (denoting department names) taken from a table named
String values like the above can cause a lot of unexpected problems.
Information Technology &
i.t all mean the same department, i.e. Information Technology & suppose the specification document requires the values to be present as
I.T only. Now, say, you want to count the number of students belonging to the department of
I.T. & you run this query:
And we get:
Is this an accurate report? — No! So, how can we address this problem?
Let’s first identify the problem in a more detailed way:
- We have
Information Technologyas a value which should be converted to
- We have
i.tas another value which should be converted to
In the first case, we can
REPLACE the value
Information Technology to
I.T, & in the second case, you convert the character to
UPPER case. We can accomplish this in a single query though it is advised to address this kind of problems in a step-by-step fashion. Here's the query to address the problem:
And the report:
Let’s now discuss some example where
date values can be messy and what we can do to clean them.
Messy Dates & Cleaning Them
Consider we are working with a table named
employees which contains a column called
birthdate but not in an appropriate date type. Now, we want to execute queries with dedicated
date functions like
DATE_PART(). We will not be able to do that until & unless we
birthdate column to
date type. Let's see this in action.
birthdates to be in the
Here’s what the
employees table looks like:
Now, we run the following query to extract the months from the birthdates:
And we instantly get this error:
“ERROR: function date_part(unknown, text) does not exist”
Along with the error, we get a very good hint also:
“HINT: No function matches the given name and argument types. You might need to add explicit type casts.”
Let’s follow the hint and
CAST the birthdate to the appropriate
date type & then apply
We should get the result:
Let’s now proceed to the pre-concluding section of this tutorial where we will study the effects of data duplications & how we can tackle them.
Data Duplications: Causes, Effects & Solutions
In this section, we will be studying some of the most common causes which lead to data duplications. We will also see their effects & some of the ways using which prevent them. Consider the following two tables
band_details conveys information about musical bands, it contains their identifiers, names & the total shows they have delivered. On the other hand, the table
some_festival_recordportrays a hypothetical music festival & contains records about the bands performed there.
Now, suppose we want to produce a report which should contain band names, their show counts, & the total number of times they have performed at the festival.
INNER joining is needed here. We run the following query:
And the query produces:
Don’t we think the
total_shows values are erroneous here? Because from the
band_details table, we know that
Band_1 has delivered a total of 36 shows. Then what went wrong here? Duplicates!
While joining the two tables, we mistakenly aggregated
total_show_count column which caused a data duplication in the intermediate join results. If we remove the aggregation & modify the query accordingly we should get the desired results:
We get our expected results now:
There is another way to prevent data duplication, i.e. add another field in our
JOIN clause so that the tables get joined on stricter conditions.
Taking It To The Next Step
The tutorial introduced us to one of the most vital steps in the data analysis pipeline — data cleaning. We saw different forms of messy data & ways to tackle them. However, there are more advanced techniques to deal with more complex data cleaning problems.