Chapter Summary: Advanced SQL Features for Analysts
Aggregate Functions
Like Python, SQL has specific functions for calculating the total number of rows, sums, averages, and minimum and maximum values. These are referred to as aggregate functions. They collect, or aggregate, all the objects within a group to calculate a single summary value.
Here's the syntax of a statement with an aggregate function:
1SELECT2 AGGREGATE_FUNCTION(field) AS here_you_are3--here_you_are - name of the column where the function's output will be stored4FROM5 table;
When you call an aggregate function, it gives the column an unwieldy name. To avoid this, use the AS command and type a new, simpler one.
The COUNT function returns the number of rows in a table:
1SELECT2 COUNT(*) AS cnt3FROM4 table
The number of rows can be calculated in various ways depending on the task:
- COUNT(*) returns the total number of table rows
- COUNT(column) returns the number of rows in a
column
- COUNT(DISTINCT column) returns the number of unique rows in a
column
The SUM(column) function returns the sum of the values in a column
. It ignores missing values.
The AVG(column) returns the average value of column
values.
The smallest and largest values can be found using the MIN and MAX functions.
Example of using the aggregation functions:
1SELECT2 COUNT(*) AS cnt,3 COUNT(column) AS cnt_column, -- returns the number of rows in the column4 COUNT(DISTINCT column) AS cnt_distinct_column,5 /* returns the number of unique values6 in the column */7 SUM(column) AS sum_column, -- sum of column values8 AVG(column) AS sum_column, -- average column value9 MIN(column) AS sum_column, -- minimum column value10 MAX(column) AS sum_column -- maximum column value11FROM12 table;
Grouping Data
When data is to be divided into groups by field values, the GROUP BY command is used:
1 SELECT2 field_1,3 field_2,4 ...,5 field_n,6 AGGREGATE_FUNCTION(field) AS here_you_are7FROM8 table_name9WHERE -- if needed10 condition11GROUP BY12 field_1,13 field_2,14 ...,15 field_n
Once you know which fields you'll be grouping by, make sure all those fields are listed in both the SELECT block and the GROUP BY block. The aggregate function itself shouldn't be included in the GROUP BY block; otherwise, the query won't comply. SQL's GROUP BY operates much like the groupby()
method in pandas.
GROUP BY can be used with any aggregate function: COUNT, AVG, SUM, MAX, MIN. You can call several functions at a time.
Sorting Data
Analysis results are usually presented in a certain order. To sort data by a field, you use the ORDER BY command.
1SELECT2 field_1,3 field_2,4 ...,5 field_n,6 AGGREGATE_FUNCTION(field) AS here_you_are7FROM8 table_name9WHERE -- if needed10 condition11GROUP BY12 field_1,13 field_2,14 ...,15 field_n,16ORDER BY -- if needed. List only those fields17--by which the table data is to sorted18 field_1,19 field_2,20 ...,21 field_n,22 here_you_are;
Unlike GROUP BY, with ORDER BY, only those fields by which we want to sort the data should be listed in the the command block.
Two modifiers can be used with the ORDER BY command to sort the data in columns:
- ASC (the default) sorts data in ascending order.
- DESC sorts data in descending order.
The ORDER BY modifiers are placed right after the field by which the data is sorted:
1ORDER BY2 field_name DESC3-- sorting data in descending order45ORDER BY6 field_name ASC;7-- sorting data in ascending order
The LIMIT command sets a limit to the number of rows in the result. It always comes at the end of a statement, followed by the number of rows at which the limit is to be set ():
1SELECT2 field_1,3 field_2,4 ...,5 field_n,6 AGGREGATE_FUNCTION(field) AS here_you_are7FROM8 table_name9WHERE -- if needed10 condition11GROUP BY12 field_1,13 field_2,14 ...,15 field_n,16ORDER BY -- if needed. List only those fields17--by which the table data is to sorted18 field_1,19 field_2,20 ...,21 field_n,22 here_you_are23LIMIT -- if needed24 n;25-- n: the maximum number of rows to be returned
Processing Data within a Grouping
The WHERE construction is used to sort data by rows. Its parameters are, in fact, table rows. When we need to sort data by aggregate function results, we use the HAVING construction, which has a lot in common with WHERE:
1 SELECT2 field_1,3 field_2,4 ...,5 field_n,6 AGGREGATE_FUNCTION(field) AS here_you_are7FROM8 TABLE9WHERE -- if needed10 condition11GROUP BY12 field_1,13 field_2,14 ...,15 field_n16HAVING17 AGGREGATE_FUNCTION(field_for_grouping) > n18ORDER BY -- if needed. List only those fields19--by which the data is to be sorted20 field_1,21 field_2,22 ...,23 field_n,24 here_you_are25LIMIT -- if needed26 n;27
The resulting selection will include only those rows for which the aggregate function produces results that meet the condition indicated in the HAVING and WHERE blocks.
HAVING and WHERE have a lot in common. So why can't we pass all of our conditions to one of them? The thing is that the WHERE command is compiled before grouping and arithmetic operations are carried out. That's why it's impossible to set sorting parameters for the results of an aggregate function with WHERE. Hence the need for HAVING.
Pay special attention to the order in which the commands are introduced:
GROUP BY
HAVING
ORDER BY
This order is mandatory. Otherwise, the code won't work.
Operators and Functions for Working with Dates
We have two major functions for working with date and time values: EXTRACT and DATE_TRUNC.* Both functions are called in the SELECT block.
Here's what the EXTRACT function looks like:
1SELECT2 EXTRACT(date_fragment FROM column_name) AS new_column_with_date3FROM4 Table_with_all_dates;
EXTRACT, unsurprisingly, extracts the information you need from the timestamp. You can retrieve:
century
day
doy
— day of the year, from 1 to 365/366isodow
(day of the week under ISO 8601, the international date and time format); Monday is 1, Sunday is 7hour
milliseconds
minute
second
month
quarter
week
— week of the yearyear
DATE_TRUNC truncates **the date when you only need a certain level of precision. (For example, if you need to know what day an order was placed but the hour doesn't matter, you can use DATE_TRUNC with the argument "day.") Unlike with EXTRACT, the resulting truncated date is given as a string. The column from which the full date is to be taken comes after a comma:
1SELECT2 DATE_TRUNC('date_fragment_to_be_truncated_to', column_name) AS new_column_with_date3FROM4 Table_with_all_dates;
You can use the following arguments with the DATE_TRUNC function:
'microseconds'
'milliseconds'
'second'
'minute'
'hour'
'day'
'week'
'month'
'quarter'
'year'
'decade'
'century'
Subqueries
A subquery, or inner query, is a query inside a query. It retrieves information that will later be used in the outer query.
Subqueries can be used at various locations within a query. If a subquery is inside the FROM block, SELECT will select data from the table that gets generated by the subquery. The name of the table is indicated within the inner query, and the outer query refers to the table's columns. Subqueries are always put in parentheses:
1SELECT2SUBQUERY_1.column_name,3SUBQUERY_1.column_name_24FROM -- to make the code readable, put subqueries in new lines5 -- indent subqueries6 (SELECT7 column_name,8 column_name_29 FROM10 table_name11 WHERE12 column_name = value) AS SUBQUERY_1;13-- remember to name your subquery in FROM block
You may need subqueries at various places within your query. Let's put one in the WHERE block. The main query will compare the results of the subquery with values from the table in the outer FROM block. When there's a match, the data will be selected:
1SELECT2 column_name,3 column_name_14FROM5 table_name6WHERE7 column_name =8 (SELECT9 column_110 FROM11 table_name_212 WHERE13 column_1 = value);
Now let's add the IN construction to our sample and collect data from several columns:
1SELECT2 column_name,3 column_name_14FROM5 table_name6WHERE7 column_name IN8 (SELECT9 column_110 FROM11 table_name_212 WHERE13 column_1 = value_1 OR column_1 = value_2);
Window Functions
In SQL, a window is a sequence of rows on which the calculations are made. It can be either the entire table or, for instance, the six rows above the current one. Working with such windows is different from working with regular requests.
1SELECT2 author_id,3 name,4 price/SUM(price) AS ratio OVER ()5FROM6 books_price;
The function preceding the OVER keyword will be executed on the data inside the window. If we don't indicate any parameters (as here), the entire result of the query will be used.
If we want to group the data, we use PARTITION BY:
1SELECT2 author_id,3 name,4 price/SUM(price) AS ratio OVER (PARTITION BY5 author_id)6FROM7 books_price;
A Closer Look at Window Functions
More important keywords when using window functions:
ORDER BY — lets us define the sorting order of the rows through which the window will run
ROWS — where we indicate the window frames above which an aggregate function is to be calculated
1SELECT2 author_id,3 name,4 SUM(price) OVER (ORDER BY5 author_id6 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)7FROM8 books_price;
Indicating frames:
- UNBOUNDED PRECEDING — all rows that are above the current one
- N PRECEDING — the rows above the current one
- CURRENT ROW — the current row
- N FOLLOWING — the rows below the current one
- UNBOUNDED FOLLOWING — all rows below the current one