Chapter Summary: Advanced SQL Features for Analysts

Aggregate Functions

Like Python, SQL has specific functions for calculating the total number of rows, sums, averages, and minimum and maximum values. These are referred to as aggregate functions. They collect, or aggregate, all the objects within a group to calculate a single summary value.

Here's the syntax of a statement with an aggregate function:


1SELECT
2 AGGREGATE_FUNCTION(field) AS here_you_are
3--here_you_are - name of the column where the function's output will be stored
4FROM
5 table;

When you call an aggregate function, it gives the column an unwieldy name. To avoid this, use the AS command and type a new, simpler one.

The COUNT function returns the number of rows in a table:


1SELECT
2 COUNT(*) AS cnt
3FROM
4 table

The number of rows can be calculated in various ways depending on the task:

COUNT(*) returns the total number of table rows
COUNT(column) returns the number of rows in a column
COUNT(DISTINCT column) returns the number of unique rows in a column

The SUM(column) function returns the sum of the values in a column. It ignores missing values.

The AVG(column) returns the average value of column values.

The smallest and largest values can be found using the MIN and MAX functions.

Example of using the aggregation functions:


1SELECT
2 COUNT(*) AS cnt,
3 COUNT(column) AS cnt_column, -- returns the number of rows in the column
4 COUNT(DISTINCT column) AS cnt_distinct_column,
5 /* returns the number of unique values
6 in the column */
7 SUM(column) AS sum_column, -- sum of column values
8 AVG(column) AS sum_column, -- average column value
9 MIN(column) AS sum_column, -- minimum column value
10 MAX(column) AS sum_column -- maximum column value
11FROM
12 table;

Grouping Data

When data is to be divided into groups by field values, the GROUP BY command is used:


1 SELECT
2 field_1,
3 field_2,
4 ...,
5 field_n,
6 AGGREGATE_FUNCTION(field) AS here_you_are
7FROM
8 table_name
9WHERE -- if needed
10 condition
11GROUP BY
12 field_1,
13 field_2,
14 ...,
15 field_n

Once you know which fields you'll be grouping by, make sure all those fields are listed in both the SELECT block and the GROUP BY block. The aggregate function itself shouldn't be included in the GROUP BY block; otherwise, the query won't comply. SQL's GROUP BY operates much like the groupby() method in pandas.

GROUP BY can be used with any aggregate function: COUNT, AVG, SUM, MAX, MIN. You can call several functions at a time.

Sorting Data

Analysis results are usually presented in a certain order. To sort data by a field, you use the ORDER BY command.


1SELECT
2 field_1,
3 field_2,
4 ...,
5 field_n,
6 AGGREGATE_FUNCTION(field) AS here_you_are
7FROM
8 table_name
9WHERE -- if needed
10 condition
11GROUP BY
12 field_1,
13 field_2,
14 ...,
15 field_n,
16ORDER BY -- if needed. List only those fields
17--by which the table data is to sorted
18 field_1,
19 field_2,
20 ...,
21 field_n,
22 here_you_are;

Unlike GROUP BY, with ORDER BY, only those fields by which we want to sort the data should be listed in the the command block.

Two modifiers can be used with the ORDER BY command to sort the data in columns:

ASC (the default) sorts data in ascending order.
DESC sorts data in descending order.

The ORDER BY modifiers are placed right after the field by which the data is sorted:


1ORDER BY
2 field_name DESC
3-- sorting data in descending order
4
5ORDER BY
6 field_name ASC;
7-- sorting data in ascending order

The LIMIT command sets a limit to the number of rows in the result. It always comes at the end of a statement, followed by the number of rows at which the limit is to be set ( $n$ ):


1SELECT
2 field_1,
3 field_2,
4 ...,
5 field_n,
6 AGGREGATE_FUNCTION(field) AS here_you_are
7FROM
8 table_name
9WHERE -- if needed
10 condition
11GROUP BY
12 field_1,
13 field_2,
14 ...,
15 field_n,
16ORDER BY -- if needed. List only those fields
17--by which the table data is to sorted
18 field_1,
19 field_2,
20 ...,
21 field_n,
22 here_you_are
23LIMIT -- if needed
24 n;
25-- n: the maximum number of rows to be returned

Processing Data within a Grouping

The WHERE construction is used to sort data by rows. Its parameters are, in fact, table rows. When we need to sort data by aggregate function results, we use the HAVING construction, which has a lot in common with WHERE:


1 SELECT
2 field_1,
3 field_2,
4 ...,
5 field_n,
6 AGGREGATE_FUNCTION(field) AS here_you_are
7FROM
8 TABLE
9WHERE -- if needed
10 condition
11GROUP BY
12 field_1,
13 field_2,
14 ...,
15 field_n
16HAVING
17 AGGREGATE_FUNCTION(field_for_grouping) > n
18ORDER BY -- if needed. List only those fields
19--by which the data is to be sorted
20 field_1,
21 field_2,
22 ...,
23 field_n,
24 here_you_are
25LIMIT -- if needed
26 n;
27

The resulting selection will include only those rows for which the aggregate function produces results that meet the condition indicated in the HAVING and WHERE blocks.

HAVING and WHERE have a lot in common. So why can't we pass all of our conditions to one of them? The thing is that the WHERE command is compiled before grouping and arithmetic operations are carried out. That's why it's impossible to set sorting parameters for the results of an aggregate function with WHERE. Hence the need for HAVING.

Pay special attention to the order in which the commands are introduced:

GROUP BY
HAVING
ORDER BY

This order is mandatory. Otherwise, the code won't work.

Operators and Functions for Working with Dates

We have two major functions for working with date and time values: EXTRACT and DATE_TRUNC.* Both functions are called in the SELECT block.

Here's what the EXTRACT function looks like:


1SELECT
2  EXTRACT(date_fragment FROM column_name) AS new_column_with_date
3FROM
4 Table_with_all_dates;

EXTRACT, unsurprisingly, extracts the information you need from the timestamp. You can retrieve:

century
day
doy — day of the year, from 1 to 365/366
isodow (day of the week under ISO 8601, the international date and time format); Monday is 1, Sunday is 7
hour
milliseconds
minute
second
month
quarter
week — week of the year
year

DATE_TRUNC truncates **the date when you only need a certain level of precision. (For example, if you need to know what day an order was placed but the hour doesn't matter, you can use DATE_TRUNC with the argument "day.") Unlike with EXTRACT, the resulting truncated date is given as a string. The column from which the full date is to be taken comes after a comma:


1SELECT
2  DATE_TRUNC('date_fragment_to_be_truncated_to', column_name) AS new_column_with_date
3FROM
4 Table_with_all_dates;

You can use the following arguments with the DATE_TRUNC function:

'microseconds' 'milliseconds' 'second' 'minute' 'hour' 'day' 'week' 'month' 'quarter' 'year' 'decade' 'century'

Subqueries

A subquery, or inner query, is a query inside a query. It retrieves information that will later be used in the outer query.

Subqueries can be used at various locations within a query. If a subquery is inside the FROM block, SELECT will select data from the table that gets generated by the subquery. The name of the table is indicated within the inner query, and the outer query refers to the table's columns. Subqueries are always put in parentheses:


1SELECT
2SUBQUERY_1.column_name,
3SUBQUERY_1.column_name_2
4FROM -- to make the code readable, put subqueries in new lines
5 -- indent subqueries
6 (SELECT
7  column_name,
8  column_name_2
9  FROM
10  table_name
11  WHERE
12  column_name = value) AS SUBQUERY_1;
13-- remember to name your subquery in FROM block

You may need subqueries at various places within your query. Let's put one in the WHERE block. The main query will compare the results of the subquery with values from the table in the outer FROM block. When there's a match, the data will be selected:


1SELECT
2 column_name,
3 column_name_1
4FROM
5 table_name
6WHERE
7 column_name =
8  (SELECT
9   column_1
10  FROM
11   table_name_2
12  WHERE
13   column_1  = value);

Now let's add the IN construction to our sample and collect data from several columns:


1SELECT
2 column_name,
3 column_name_1
4FROM
5 table_name
6WHERE
7 column_name IN
8   (SELECT
9    column_1
10   FROM
11    table_name_2
12   WHERE
13    column_1 = value_1 OR column_1 = value_2);

Window Functions

In SQL, a window is a sequence of rows on which the calculations are made. It can be either the entire table or, for instance, the six rows above the current one. Working with such windows is different from working with regular requests.


1SELECT
2 author_id,
3 name,
4 price/SUM(price) AS ratio OVER ()
5FROM
6 books_price;

The function preceding the OVER keyword will be executed on the data inside the window. If we don't indicate any parameters (as here), the entire result of the query will be used.

If we want to group the data, we use PARTITION BY:


1SELECT
2 author_id,
3 name,
4 price/SUM(price) AS ratio OVER (PARTITION BY
5                  author_id)
6FROM
7 books_price;

A Closer Look at Window Functions

More important keywords when using window functions:

ORDER BY — lets us define the sorting order of the rows through which the window will run

ROWS — where we indicate the window frames above which an aggregate function is to be calculated


1SELECT
2 author_id,
3 name,
4 SUM(price) OVER (ORDER BY
5           author_id
6         ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
7FROM
8 books_price;

Indicating frames:

UNBOUNDED PRECEDING — all rows that are above the current one
N PRECEDING — the $n$ rows above the current one
CURRENT ROW — the current row
N FOLLOWING — the $n$ rows below the current one
UNBOUNDED FOLLOWING — all rows below the current one