Cleaning Data in SQL

We’ll learn techniques on how to clean messy data in SQL, which is a must-have skill for any Data Scientist
Real world data is almost always messy. As a data scientist or a data analyst or even as a developer, if you need to discover facts about data, it’s vital to ensure that data is tidy enough for doing that.
In this tutorial, we will be practicing some of the most common data cleaning techniques in SQL. We will create our own dummy dataset, but the techniques can be applied to the real world data (of the tabular form) as well. The contents of this tutorial are as follows:
- Different data types & their messy values
- Problems that can raise from messy numbers
- Cleaning numeric values
- Messy strings
- Cleaning string values
- Messy date values & cleaning them
- Duplications & removing them
Different Data Types, Messy Values, & Remedies
Messy Numbers
Numbers can be in messy forms in a number of ways. Here, we will be introduced to the most common ones:
- Undesired type/Type mismatch: Consider there is a column named
age
in a dataset we are working with. We see the values that are present in that column are offloat
type - the sample values are like 23.0, 45.0, 34.0 & so on. In this case, you don't need theage
column to be offloat
type. - Null values: While this is particularly common with all of the data-types mentioned above, null values here merely means that the values are not available/blank. However, null values can be present in other forms as well.
Let’s now study the problems that can get raised from these issues & how to deal with them.
Problems With Messy Numbers & Dealing With Them
Let’s now take a look at the most common problems that we may face if we don’t clean the messy data
1.) Data Aggregation
Suppose we have null entries for a numeric column & we are calculating summary statistics (like mean, maximum, minimum values) on that column. The results will not get conveyed accurately in this case. There are several ways on how to address this problem:

Let’s now get hands-on with these problems & the second option for combating null values.
Consider the following PostgreSQL table named entries
:

We can see two null entries in the above table. Suppose we want to the get the average weight value from this table & we executed the following query:

We got 90.45
as the output. Is this correct? So, what can be done? Let's fill the null entry with this average value with the help of the COALESCE()
function.
Let’s fill the missing values first with COALESCE()
(remember that COALESCE()
does not change the values in the original table, it just returns a temporary view of the table with the values changed):

We should get an output like:

Now we can apply the AVG()
again:

This is a much more accurate result than the earlier one. Let’s now study another problem that can take place if we have mismatches in the column data-types.
2.) Table Joins
Consider we are working with the following tables student_metadata
& department_details
:

We can see in the student_mtadata
table, dept_id
is of integer type, & in the department_details
table, it is present in text type. Now, suppose, we want to join these two tables & want to produce a report which will contain the following columns:
- id
- name
- dept_name
To do this, we run this query:

We will encounter this error, then:
“ERROR: operator does not exist: smallint = text”
Here’s an amazing infographic which depicts this problem:

This is happening because the data-types are not getting matched while joining the two tables. Here, we can CAST
the dept_id
column in the department_details
table to integer while joining the tables. Here's how to do that:

And you get your desired report:

Let’s now discuss how strings can be present in messy forms, their problems & ways to deal with them.
Messy Strings & Cleaning Theme
String values are also very common. Let’s start this section by looking at the values of a column dept_name
(denoting department names) taken from a table named student_details
:

String values like the above can cause a lot of unexpected problems. I.T
, Information Technology
& i.t
all mean the same department, i.e. Information Technology & suppose the specification document requires the values to be present as I.T
only. Now, say, you want to count the number of students belonging to the department of I.T.
& you run this query:

And we get:

Is this an accurate report? — No! So, how can we address this problem?
Let’s first identify the problem in a more detailed way:
- We have
Information Technology
as a value which should be converted toI.T
- We have
i.t
as another value which should be converted toI.T
.
In the first case, we can REPLACE
the value Information Technology
to I.T
, & in the second case, you convert the character to UPPER
case. We can accomplish this in a single query though it is advised to address this kind of problems in a step-by-step fashion. Here's the query to address the problem:

And the report:

Let’s now discuss some example where date
values can be messy and what we can do to clean them.
Messy Dates & Cleaning Them
Consider we are working with a table named employees
which contains a column called birthdate
but not in an appropriate date type. Now, we want to execute queries with dedicated date
functions like DATE_PART()
. We will not be able to do that until & unless we CAST
the birthdate
column to date
type. Let's see this in action.
Consider the birthdate
s to be in the YYYY-MM-DD
format.
Here’s what the employees
table looks like:

Now, we run the following query to extract the months from the birthdates:

And we instantly get this error:
“ERROR: function date_part(unknown, text) does not exist”
Along with the error, we get a very good hint also:
“HINT: No function matches the given name and argument types. You might need to add explicit type casts.”
Let’s follow the hint and CAST
the birthdate to the appropriate date
type & then apply DATE_PART()
:

We should get the result:

Let’s now proceed to the pre-concluding section of this tutorial where we will study the effects of data duplications & how we can tackle them.
Data Duplications: Causes, Effects & Solutions
In this section, we will be studying some of the most common causes which lead to data duplications. We will also see their effects & some of the ways using which prevent them. Consider the following two tables band_details
& some_festival_record
:

The table band_details
conveys information about musical bands, it contains their identifiers, names & the total shows they have delivered. On the other hand, the table some_festival_record
portrays a hypothetical music festival & contains records about the bands performed there.
Now, suppose we want to produce a report which should contain band names, their show counts, & the total number of times they have performed at the festival. INNER
joining is needed here. We run the following query:

And the query produces:

Don’t we think the total_shows
values are erroneous here? Because from the band_details
table, we know that Band_1
has delivered a total of 36 shows. Then what went wrong here? Duplicates!
While joining the two tables, we mistakenly aggregated total_show_count
column which caused a data duplication in the intermediate join results. If we remove the aggregation & modify the query accordingly we should get the desired results:

We get our expected results now:

There is another way to prevent data duplication, i.e. add another field in our JOIN
clause so that the tables get joined on stricter conditions.
Taking It To The Next Step
The tutorial introduced us to one of the most vital steps in the data analysis pipeline — data cleaning. We saw different forms of messy data & ways to tackle them. However, there are more advanced techniques to deal with more complex data cleaning problems.
RELATED LINKS