# Cleaning Data in SQL

We’ll learn techniques on how to clean messy data in SQL, which is a must-have skill for any Data Scientist

Real world data is almost always messy. As a data scientist or a data analyst or even as a developer, if you need to discover facts about data, it’s vital to ensure that data is tidy enough for doing that.

In this tutorial, we will be practicing some of the most common data cleaning techniques in SQL. We will create our own dummy dataset, but the techniques can be applied to the real world data (of the tabular form) as well. The contents of this tutorial are as follows:

• Different data types & their messy values
• Problems that can raise from messy numbers
• Cleaning numeric values
• Messy strings
• Cleaning string values
• Messy date values & cleaning them
• Duplications & removing them

# Different Data Types, Messy Values, & Remedies

Messy Numbers

Numbers can be in messy forms in a number of ways. Here, we will be introduced to the most common ones:

• Undesired type/Type mismatch: Consider there is a column named `age` in a dataset we are working with. We see the values that are present in that column are of `float` type - the sample values are like 23.0, 45.0, 34.0 & so on. In this case, you don't need the `age` column to be of `float` type.
• Null values: While this is particularly common with all of the data-types mentioned above, null values here merely means that the values are not available/blank. However, null values can be present in other forms as well.

Let’s now study the problems that can get raised from these issues & how to deal with them.

Problems With Messy Numbers & Dealing With Them

Let’s now take a look at the most common problems that we may face if we don’t clean the messy data

1.) Data Aggregation

Suppose we have null entries for a numeric column & we are calculating summary statistics (like mean, maximum, minimum values) on that column. The results will not get conveyed accurately in this case. There are several ways on how to address this problem:

Let’s now get hands-on with these problems & the second option for combating null values.

Consider the following PostgreSQL table named `entries`:

We can see two null entries in the above table. Suppose we want to the get the average weight value from this table & we executed the following query:

We got `90.45` as the output. Is this correct? So, what can be done? Let's fill the null entry with this average value with the help of the `COALESCE()` function.

Let’s fill the missing values first with `COALESCE()` (remember that `COALESCE()` does not change the values in the original table, it just returns a temporary view of the table with the values changed):

We should get an output like:

Now we can apply the `AVG()` again:

This is a much more accurate result than the earlier one. Let’s now study another problem that can take place if we have mismatches in the column data-types.

2.) Table Joins

Consider we are working with the following tables `student_metadata` & `department_details`:

We can see in the `student_mtadata` table, `dept_id` is of integer type, & in the `department_details` table, it is present in text type. Now, suppose, we want to join these two tables & want to produce a report which will contain the following columns:

• id
• name
• dept_name

To do this, we run this query:

We will encounter this error, then:

ERROR: operator does not exist: smallint = text”

Here’s an amazing infographic which depicts this problem:

This is happening because the data-types are not getting matched while joining the two tables. Here, we can `CAST` the `dept_id` column in the `department_details` table to integer while joining the tables. Here's how to do that:

And you get your desired report:

Let’s now discuss how strings can be present in messy forms, their problems & ways to deal with them.

Messy Strings & Cleaning Theme

String values are also very common. Let’s start this section by looking at the values of a column `dept_name` (denoting department names) taken from a table named `student_details`:

String values like the above can cause a lot of unexpected problems. `I.T`, `Information Technology` & `i.t` all mean the same department, i.e. Information Technology & suppose the specification document requires the values to be present as `I.T` only. Now, say, you want to count the number of students belonging to the department of `I.T.` & you run this query:

And we get:

Is this an accurate report? — No! So, how can we address this problem?

Let’s first identify the problem in a more detailed way:

• We have `Information Technology` as a value which should be converted to `I.T`
• We have `i.t` as another value which should be converted to `I.T`.

In the first case, we can `REPLACE` the value `Information Technology` to `I.T`, & in the second case, you convert the character to `UPPER` case. We can accomplish this in a single query though it is advised to address this kind of problems in a step-by-step fashion. Here's the query to address the problem:

And the report:

Let’s now discuss some example where `date` values can be messy and what we can do to clean them.

Messy Dates & Cleaning Them

Consider we are working with a table named `employees` which contains a column called `birthdate` but not in an appropriate date type. Now, we want to execute queries with dedicated `date` functions like `DATE_PART()`. We will not be able to do that until & unless we `CAST` the `birthdate` column to `date` type. Let's see this in action.

Consider the `birthdate`s to be in the `YYYY-MM-DD` format.

Here’s what the `employees` table looks like:

Now, we run the following query to extract the months from the birthdates:

And we instantly get this error:

ERROR: function date_part(unknown, text) does not exist”

Along with the error, we get a very good hint also:

HINT: No function matches the given name and argument types. You might need to add explicit type casts.”

Let’s follow the hint and `CAST` the birthdate to the appropriate `date` type & then apply `DATE_PART()`:

We should get the result:

Let’s now proceed to the pre-concluding section of this tutorial where we will study the effects of data duplications & how we can tackle them.

# Data Duplications: Causes, Effects & Solutions

In this section, we will be studying some of the most common causes which lead to data duplications. We will also see their effects & some of the ways using which prevent them. Consider the following two tables `band_details` & `some_festival_record`:

The table `band_details` conveys information about musical bands, it contains their identifiers, names & the total shows they have delivered. On the other hand, the table `some_festival_record`portrays a hypothetical music festival & contains records about the bands performed there.

Now, suppose we want to produce a report which should contain band names, their show counts, & the total number of times they have performed at the festival. `INNER` joining is needed here. We run the following query:

And the query produces:

Don’t we think the `total_shows` values are erroneous here? Because from the `band_details` table, we know that `Band_1` has delivered a total of 36 shows. Then what went wrong here? Duplicates!

While joining the two tables, we mistakenly aggregated `total_show_count` column which caused a data duplication in the intermediate join results. If we remove the aggregation & modify the query accordingly we should get the desired results:

We get our expected results now:

There is another way to prevent data duplication, i.e. add another field in our `JOIN` clause so that the tables get joined on stricter conditions.

# Taking It To The Next Step

The tutorial introduced us to one of the most vital steps in the data analysis pipeline — data cleaning. We saw different forms of messy data & ways to tackle them. However, there are more advanced techniques to deal with more complex data cleaning problems.

Data Scientist & Machine Learning Engineer

## More from Jason Joseph

Data Scientist & Machine Learning Engineer

## Finding Mean Reversion in FX

Get the Medium app