Aggregate Functions in SQL

Learn how to use aggregate functions for summarizing results & gaining useful insights about data in SQL.

Building reports from a given dataset is an essential skill if you are working with data. Because ultimately, you want to be able to answer critical business questions using the data at your disposal. Many times, these answers presented in the form of report charts. But sometimes, reports in the form of tables are also needed. In both cases, you might need to summarize the data using simple calculations. In SQL, you can summarize/aggregate the data using aggregate functions. With these functions, you will be able to answer questions like:

  • What is the maximum value for the some_column_from_the_table? or
  • What are the minimum values of some_column_from_the_table with respect to another_column_from_the_table?

Let’s get started to perform some data aggregation.

Setting Up the Database

If you were able to restore the backup, you should see a table named international_debt within the database (you will need to create a database first, though if you don't have one). Let's quickly take a look at the first few rows of the table (a simple select query will help you do this):

The table contains information about the debt statistics of different countries across the globe for this current year in different categories (refer to the columns indicator_name & indicator_code). The debt column shows the amount of debt (in USD) a particular country has in a specific category. The data belongs to the domain of economics & is often used to analyze the economic conditions of different countries. The data was collected from the World Bank.

Now that we have set up the database successfully let’s run some simple queries to know more about the data. Open up the pgAdmin tool & get started.

Simple Information Matters

If we simply run a select query with the country_name column, we will not get the right answer to this question because the result of the query will contain duplicates. Let's use the DISTINCTkeyword to combat this.

And this should return something like this:

Now we have a fair answer to the above question. Now one last question before you proceed to aggregate functions: “How many different kinds of debt indicators are there in the table?”.

The query for answering should be similar to the above one. We just need to change the column name. The result should be similar to the following:

Aggregate Functions

And the result:

With the SUM() aggregate function, we can calculate the arithmetic sum across a column (that contains numeric values). With the above query, we have to know about the total debt that is pending by the countries listed in the table.

Note that SUM() does not take NULL values into consideration when calculating the sum. Now, let's find the answer to the question: “What is the maximum amount of debt?”.

Here comes the MAX() aggregate function to our rescue:

And the answer is:

Like the SUM() function, MAX() does not consider the NULL entries while doing its calculations. There’s a similar MIN() function as well. Now, it will be a good idea to actually find out if there is an invalid entry in the debt column to ensure that the results are correct until now.

Note that you use these functions in lower case as shown above.

When you execute this query:select * from international_debt where debt is null;, we should get an empty result. Let's now find out the total number of distinct countries present in the table.

And we see that there is a total of 124 distinct countries present in the table. Pay close attention to the succession of functions we applied in the above query. Yes, it is allowed here, i.e. to connect more than one aggregate function in a logical way.

Now, suppose, we want to see the average value of the debt column. The function is AVG(): “select avg(debt) from international_debt;

We see the value to be 1306633214.966397971 (USD). It is a good idea to present these results with suitable column names. From the above results, you can see that PostgreSQL changes the column name to aggregate function name (that is being included in the query) when returning the result. So, it is a good idea to give a proper alias to these columns. We can do this like so:

The result is much more interpretable:

Let’s now take this to a bit more complex level. To be able to answer questions like: “What are the minimum values of some_column_from_the_table with respect to another_column_from_the_table?”. We need to pair an aggregate function with theGROUP BY clause. Let's see how.

Aggregate Functions + GROUP BY + More

Reports like this are quite often used in the real world. So, what might be the query for getting a report like this? We will have to use the SUM() function on the debt. And we will also need to show the country_name with the sum of debts. Let's execute the following query:

It produces the following error:

ERROR: column “international_debt.country_name” must appear in the GROUP BY clause or be used in an aggregate function”.

Let’s now understand what it actually means. When we are using an aggregate function (like SUM()with an unaggregated column like country_name) we need to pass the unaggregated column to a GROUP BY clause. So, the correct query will be:

And the result is just right:

Note the use of aliasing in the query.

Now, suppose you need to sort this report in with respect to the total_debt in a descending manner. Remember ORDER BY clause? Yes, you can pair aggregate functions with ORDER BY clause as well:

The result should be reversely sorted now:

Note the column that you used in the ORDER BY clause.

Now another important question: “What is the highest amount of debt across different categories (in a reversely sorted manner)?”.

We will need to use the MAX() function here. The query to answer this question should not be a hard one to write now.

And we get a clean report:

We can also limit the number of rows in the reports like this. Say you just want to include the top five entries in the above report. We can do by using the LIMIT clause.

Time for the final report for this tutorial. We need to include the names of the countries to the above report. How can we do it? The following query should let you do this:

Another good report:

In the above query, we added the country_name column after the SELECT clause & also added it after GROUP BY. We can extend this format up to any number.

The order of GROUP BY, ORDER BY & LIMIT is very important while generating reports like this. If we mistakenly change the order, we will have to face errors. See this for yourself:

And we get:

In the above query, we placed the ORDER BY clause before GROUP BY which is not permissible. In fact, it is also not applicable when we are not using aggregate functions. The correct order is - GROUP BY -> ORDER BY -> LIMIT. Always remember this.

RELATED LINKS

Data Scientist & Machine Learning Engineer