Aggregate Functions in SQL

Learn how to use aggregate functions for summarizing results & gaining useful insights about data in SQL.

Building reports from a given dataset is an essential skill if you are working with data. Because ultimately, you want to be able to answer critical business questions using the data at your disposal. Many times, these answers presented in the form of report charts. But sometimes, reports in the form of tables are also needed. In both cases, you might need to summarize the data using simple calculations. In SQL, you can summarize/aggregate the data using aggregate functions. With these functions, you will be able to answer questions like:

  • What is the maximum value for the some_column_from_the_table? or
  • What are the minimum values of some_column_from_the_table with respect to another_column_from_the_table?

Let’s get started to perform some data aggregation.

Setting Up the Database

Let’s first set up a PostgreSQL database & restore this backup which contains the table which we are going to use in this tutorial.

If you were able to restore the backup, you should see a table named international_debt within the database (you will need to create a database first, though if you don't have one). Let's quickly take a look at the first few rows of the table (a simple select query will help you do this):

Image for post
Image for post

The table contains information about the debt statistics of different countries across the globe for this current year in different categories (refer to the columns indicator_name & indicator_code). The debt column shows the amount of debt (in USD) a particular country has in a specific category. The data belongs to the domain of economics & is often used to analyze the economic conditions of different countries. The data was collected from the World Bank.

Now that we have set up the database successfully let’s run some simple queries to know more about the data. Open up the pgAdmin tool & get started.

Simple Information Matters

From the above figure, we can see that there are many duplicate entries for a single country but for different categories. A question that quickly gets raised from here is: “What are the different countries that the table contains records of?”.

If we simply run a select query with the country_name column, we will not get the right answer to this question because the result of the query will contain duplicates. Let's use the DISTINCTkeyword to combat this.

Image for post
Image for post

And this should return something like this:

Image for post
Image for post

Now we have a fair answer to the above question. Now one last question before you proceed to aggregate functions: “How many different kinds of debt indicators are there in the table?”.

The query for answering should be similar to the above one. We just need to change the column name. The result should be similar to the following:

Image for post
Image for post

Aggregate Functions

Let’s start by running a query with an aggregate function & proceed accordingly. Along the way, you will learn more about the syntaxes & the kind of constructs you need to follow when applying aggregate functions in SQL.

Image for post
Image for post

And the result:

Image for post
Image for post

With the SUM() aggregate function, we can calculate the arithmetic sum across a column (that contains numeric values). With the above query, we have to know about the total debt that is pending by the countries listed in the table.

Note that SUM() does not take NULL values into consideration when calculating the sum. Now, let's find the answer to the question: “What is the maximum amount of debt?”.

Here comes the MAX() aggregate function to our rescue:

Image for post
Image for post

And the answer is:

Image for post
Image for post

Like the SUM() function, MAX() does not consider the NULL entries while doing its calculations. There’s a similar MIN() function as well. Now, it will be a good idea to actually find out if there is an invalid entry in the debt column to ensure that the results are correct until now.

Note that you use these functions in lower case as shown above.

When you execute this query:select * from international_debt where debt is null;, we should get an empty result. Let's now find out the total number of distinct countries present in the table.

Image for post
Image for post

And we see that there is a total of 124 distinct countries present in the table. Pay close attention to the succession of functions we applied in the above query. Yes, it is allowed here, i.e. to connect more than one aggregate function in a logical way.

Now, suppose, we want to see the average value of the debt column. The function is AVG(): “select avg(debt) from international_debt;

We see the value to be 1306633214.966397971 (USD). It is a good idea to present these results with suitable column names. From the above results, you can see that PostgreSQL changes the column name to aggregate function name (that is being included in the query) when returning the result. So, it is a good idea to give a proper alias to these columns. We can do this like so:

Image for post
Image for post

The result is much more interpretable:

Image for post
Image for post

Let’s now take this to a bit more complex level. To be able to answer questions like: “What are the minimum values of some_column_from_the_table with respect to another_column_from_the_table?”. We need to pair an aggregate function with theGROUP BY clause. Let's see how.

Aggregate Functions + GROUP BY + More

Let’s say we want to produce a report where we will be showing the country_name & the sum of their debts. Following is an example:

Image for post
Image for post

Reports like this are quite often used in the real world. So, what might be the query for getting a report like this? We will have to use the SUM() function on the debt. And we will also need to show the country_name with the sum of debts. Let's execute the following query:

Image for post
Image for post

It produces the following error:

ERROR: column “international_debt.country_name” must appear in the GROUP BY clause or be used in an aggregate function”.

Let’s now understand what it actually means. When we are using an aggregate function (like SUM()with an unaggregated column like country_name) we need to pass the unaggregated column to a GROUP BY clause. So, the correct query will be:

Image for post
Image for post

And the result is just right:

Image for post
Image for post

Note the use of aliasing in the query.

Now, suppose you need to sort this report in with respect to the total_debt in a descending manner. Remember ORDER BY clause? Yes, you can pair aggregate functions with ORDER BY clause as well:

Image for post
Image for post

The result should be reversely sorted now:

Image for post
Image for post

Note the column that you used in the ORDER BY clause.

Now another important question: “What is the highest amount of debt across different categories (in a reversely sorted manner)?”.

We will need to use the MAX() function here. The query to answer this question should not be a hard one to write now.

Image for post
Image for post

And we get a clean report:

Image for post
Image for post

We can also limit the number of rows in the reports like this. Say you just want to include the top five entries in the above report. We can do by using the LIMIT clause.

Image for post
Image for post

Time for the final report for this tutorial. We need to include the names of the countries to the above report. How can we do it? The following query should let you do this:

Image for post
Image for post

Another good report:

Image for post
Image for post

In the above query, we added the country_name column after the SELECT clause & also added it after GROUP BY. We can extend this format up to any number.

The order of GROUP BY, ORDER BY & LIMIT is very important while generating reports like this. If we mistakenly change the order, we will have to face errors. See this for yourself:

Image for post
Image for post

And we get:

Image for post
Image for post

In the above query, we placed the ORDER BY clause before GROUP BY which is not permissible. In fact, it is also not applicable when we are not using aggregate functions. The correct order is - GROUP BY -> ORDER BY -> LIMIT. Always remember this.

RELATED LINKS

Written by

Data Scientist & Machine Learning Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store