Learn how to use aggregate functions for summarizing results & gaining useful insights about data in SQL.
Building reports from a given dataset is an essential skill if you are working with data. Because ultimately, you want to be able to answer critical business questions using the data at your disposal. Many times, these answers presented in the form of report charts. But sometimes, reports in the form of tables are also needed. In both cases, you might need to summarize the data using simple calculations. In SQL, you can summarize/aggregate the data using aggregate functions. With these functions, you will be able to answer questions like:
- What is the maximum value for the
- What are the minimum values of
some_column_from_the_tablewith respect to
Let’s get started to perform some data aggregation.
Setting Up the Database
Let’s first set up a PostgreSQL database & restore this backup which contains the table which we are going to use in this tutorial.
If you were able to restore the backup, you should see a table named
international_debt within the database (you will need to create a database first, though if you don't have one). Let's quickly take a look at the first few rows of the table (a simple select query will help you do this):
The table contains information about the debt statistics of different countries across the globe for this current year in different categories (refer to the columns
indicator_code). The debt column shows the amount of debt (in USD) a particular country has in a specific category. The data belongs to the domain of economics & is often used to analyze the economic conditions of different countries. The data was collected from the World Bank.
Now that we have set up the database successfully let’s run some simple queries to know more about the data. Open up the
pgAdmin tool & get started.
Simple Information Matters
From the above figure, we can see that there are many duplicate entries for a single country but for different categories. A question that quickly gets raised from here is: “What are the different countries that the table contains records of?”.
If we simply run a select query with the
country_name column, we will not get the right answer to this question because the result of the query will contain duplicates. Let's use the
DISTINCTkeyword to combat this.
And this should return something like this:
Now we have a fair answer to the above question. Now one last question before you proceed to aggregate functions: “How many different kinds of debt indicators are there in the table?”.
The query for answering should be similar to the above one. We just need to change the column name. The result should be similar to the following:
Let’s start by running a query with an aggregate function & proceed accordingly. Along the way, you will learn more about the syntaxes & the kind of constructs you need to follow when applying aggregate functions in SQL.
And the result:
SUM() aggregate function, we can calculate the arithmetic sum across a column (that contains numeric values). With the above query, we have to know about the total debt that is pending by the countries listed in the table.
SUM() does not take
NULL values into consideration when calculating the sum. Now, let's find the answer to the question: “What is the maximum amount of debt?”.
Here comes the
MAX() aggregate function to our rescue:
And the answer is:
MAX() does not consider the
NULL entries while doing its calculations. There’s a similar
MIN() function as well. Now, it will be a good idea to actually find out if there is an invalid entry in the
debt column to ensure that the results are correct until now.
Note that you use these functions in lower case as shown above.
When you execute this query:
select * from international_debt where debt is null;, we should get an empty result. Let's now find out the total number of distinct countries present in the table.
And we see that there is a total of 124 distinct countries present in the table. Pay close attention to the succession of functions we applied in the above query. Yes, it is allowed here, i.e. to connect more than one aggregate function in a logical way.
Now, suppose, we want to see the average value of the
debt column. The function is
AVG(): “select avg(debt) from international_debt;
We see the value to be 1306633214.966397971 (USD). It is a good idea to present these results with suitable column names. From the above results, you can see that PostgreSQL changes the column name to aggregate function name (that is being included in the query) when returning the result. So, it is a good idea to give a proper alias to these columns. We can do this like so:
The result is much more interpretable:
Let’s now take this to a bit more complex level. To be able to answer questions like: “What are the minimum values of some_column_from_the_table with respect to another_column_from_the_table?”. We need to pair an aggregate function with the
GROUP BY clause. Let's see how.
Aggregate Functions + GROUP BY + More
Let’s say we want to produce a report where we will be showing the
country_name & the sum of their debts. Following is an example:
Reports like this are quite often used in the real world. So, what might be the query for getting a report like this? We will have to use the
SUM() function on the
debt. And we will also need to show the
country_name with the sum of debts. Let's execute the following query:
It produces the following error:
“ERROR: column “international_debt.country_name” must appear in the GROUP BY clause or be used in an aggregate function”.
Let’s now understand what it actually means. When we are using an aggregate function (like
SUM()with an unaggregated column like
country_name) we need to pass the unaggregated column to a
GROUP BY clause. So, the correct query will be:
And the result is just right:
Note the use of aliasing in the query.
Now, suppose you need to sort this report in with respect to the
total_debt in a descending manner. Remember
ORDER BY clause? Yes, you can pair aggregate functions with
ORDER BY clause as well:
The result should be reversely sorted now:
Note the column that you used in the
ORDER BY clause.
Now another important question: “What is the highest amount of debt across different categories (in a reversely sorted manner)?”.
We will need to use the
MAX() function here. The query to answer this question should not be a hard one to write now.
And we get a clean report:
We can also limit the number of rows in the reports like this. Say you just want to include the top five entries in the above report. We can do by using the
Time for the final report for this tutorial. We need to include the names of the countries to the above report. How can we do it? The following query should let you do this:
Another good report:
In the above query, we added the
country_name column after the
SELECT clause & also added it after
GROUP BY. We can extend this format up to any number.
The order of
ORDER BY &
LIMIT is very important while generating reports like this. If we mistakenly change the order, we will have to face errors. See this for yourself:
And we get:
In the above query, we placed the
ORDER BY clause before
GROUP BY which is not permissible. In fact, it is also not applicable when we are not using aggregate functions. The correct order is -
GROUP BY ->
ORDER BY ->
LIMIT. Always remember this.