6 Most Useful SQL Window Functions You Should Definitely Know about!

20 July 2024

0

This article was published as a part of the Data Science Blogathon

Window Functions or Analytic Functions in SQL

A window function (or Analytic Function) in SQL uses values from multiple rows (or one row) to return values for each row. Do not confuse this with aggregate functions, which return a single value after aggregating multiple rows. There’s a simple way to identify the window functions. They have an OVER clause. Any function in SQL without an OVER clause is not a window function. It will be either an aggregate function or a scalar (single-row) function.

In this article, I am going to use this table, named population to showcase the examples and explain the codes. The first 5 rows of the table are as below. This is the data set which has a population for every state of USA for years from 1990 to 2013 and also has split over the total population and population of people above 18 years of age.

To understand the concepts, we will use the USA population dataset here. You may download the data set from my GitHub page here, and practice along with it.

Note: I am doing these operations in Jupyter Notebook(in Python Environment, on SQLite Syntax), and when you do the same in SQL editor, the table layout may look a bit different, but the content and codes are same.

SELECT * FROM population limit 5;

	state	ages	year	population
0	AL	under18	2012	1117489.0
1	AL	total	2012	4817528.0
2	AL	under18	2010	1130966.0
3	AL	total	2010	4785570.0
4	AL	under18	2011	1125763.0

Example: Here is a SQL Query which uses a windows function to compare the population of each state for every year with the average population of that state over the period of all the years available in dataset.

SELECT *, avg(population) OVER (PARTITION BY state) FROM population;

	state	ages	year	population	avg(population) OVER (PARTITION BY state)
0	AK	total	1990	553290.0	416438.895833
1	AK	under18	1990	177502.0	416438.895833
2	AK	total	1992	588736.0	416438.895833
3	AK	under18	1991	182180.0	416438.895833
4	AK	under18	1992	184878.0	416438.895833
…	…	…	…	…	…
2539	WY	under18	1993	137458.0	321286.541667
2540	WY	total	1991	459260.0	321286.541667
2541	WY	under18	1991	136720.0	321286.541667
2542	WY	under18	1990	136078.0	321286.541667
2543	WY	total	1990	453690.0	321286.541667

2544 rows × 5 columns

The PARTITION BY is grouping the rows into partitions (by year) and the function (average function here) is applied to each partition separately. If you omit the PARTITION BY clause, and the OVER() a clause is empty, then the entire data set is treated as a single partition, and instead of getting the AVG population of states over the years, you will get the average population of all the states for the full period of time.

Now let us dive in and see different Window functions and their Python Counterpart along with codes and results.

LEAD() and LAG()

lead() and LAG() are used to compare the rows with their previous(preceding) or next(following) rows. For this to work best, the data shall be in order or sorted. These functions are most useful when you want to compare one period of time with a previous period of time based on the certain metrics. Some generic examples can be as below.

Example:

The share price difference of a company from yesterday to today.
Increase or decrease in the number of visitors to your website on a daily basis
Sales of this year compared to sales of previous year etc.

Let’s see the comparison of the population for next year, as compared to the population of the previous year.

We can use LEAD() and LAG() to create new columns where we pull values from other rows. LEAD() pulls the value from the next row and LAG() pulls the value from the previous row.

SELECT *,
       LAG(population, 1) OVER
         (PARTITION BY state ORDER BY year) AS lag,
       LEAD(population, 1) OVER
         (PARTITION BY state ORDER BY year) AS lead
  FROM population

	state	ages	year	population	lag	lead
0	AK	total	1990	553290.0	NaN	177502.0
1	AK	under18	1990	177502.0	553290.0	182180.0
2	AK	under18	1991	182180.0	177502.0	570193.0
3	AK	total	1991	570193.0	182180.0	588736.0
4	AK	total	1992	588736.0	570193.0	184878.0
…	…	…	…	…	…	…
2539	WY	under18	2011	135407.0	567329.0	576626.0
2540	WY	total	2012	576626.0	135407.0	136526.0
2541	WY	under18	2012	136526.0	576626.0	582658.0
2542	WY	total	2013	582658.0	136526.0	137679.0
2543	WY	under18	2013	137679.0	582658.0	NaN

2544 rows × 6 columns

Cumulative Sum

These are also known as running totals. Doing the running totals is simple when you use a windows function like sum(). This is particularly used when you want to show the accumulation or growth of any metric over time.

Get a running total of sales from stores over days of a month.
show running total or cumulative time spent on your website per user
Get a running total of customers acquired by a sales representative

The following example will show you how to do the cumulative sum of the population for states over years.

SELECT *,
SUM(population) OVER (PARTITION BY state ORDER BY year) as Total_Population
FROM population
where ages = 'total'

	state	ages	year	population	Total_Population
0	AK	total	1990	553290.0	553290.0
1	AK	total	1991	570193.0	1123483.0
2	AK	total	1992	588736.0	1712219.0
3	AK	total	1993	599434.0	2311653.0
4	AK	total	1994	603308.0	2914961.0
…	…	…	…	…	…
1267	WY	total	2009	559851.0	9957038.0
1268	WY	total	2010	564222.0	10521260.0
1269	WY	total	2011	567329.0	11088589.0
1270	WY	total	2012	576626.0	11665215.0
1271	WY	total	2013	582658.0	12247873.0

1272 rows × 5 columns

Moving Averages

Moving Averages are very effective in forecasting values in time series. If you have ever looked at share price analysis, the analysts talk in terms of DMA. This DMA is nothing but Day Moving Average. So when they say this share is now priced above its 200 DMA, it means it is now trading at a price that is higher than the average price of this share, and this Average is the calculated average of the previous 200 Days closing prices.

Moving Averages can be a good metric of weekly sales
Moving Averages can be used to compare the price of objects with their average price in the previous month
Moving averages can also give an idea of average hits on your website over the last 90 days

Our Population data is for the years 1990 to 2013. Here, I am interested in knowing, how the population for the present year is as compared to the previous 5 years average population. This Moving Average for the last 5 years can be calculated in SQL by following the window function.

SELECT *,
round(AVG(population) OVER (ORDER BY state, year ROWS 5 PRECEDING),2) as Avg_Populationfor5Year
from population
where ages = 'total'

	state	ages	year	population	Avg_Populationfor5Year
0	AK	total	1990	553290.0	553290.00
1	AK	total	1991	570193.0	561741.50
2	AK	total	1992	588736.0	570739.67
3	AK	total	1993	599434.0	577913.25
4	AK	total	1994	603308.0	582992.20
…	…	…	…	…	…
1267	WY	total	2009	559851.0	531116.67
1268	WY	total	2010	564222.0	540302.67
1269	WY	total	2011	567329.0	549164.67
1270	WY	total	2012	576626.0	558157.83
1271	WY	total	2013	582658.0	566121.50

1272 rows × 5 columns

ROW_NUMBER()

When you are working on data that has an element of time in it, or more specifically time series data, you can conclusively say that the events are happening in a particular order. Now you need to work upon groups of data and need to use some particular events, say the latest ones, or the oldest ones. These are still simpler, provided some of the time series functions (not the scope of this article).

But what if you want the 2nd last event or the 3rd from the latest? How nice it would be if you could number the events like they are separate rows, for each group, and then use the row numbers to access them when needed. This ROW_NUMBER() the function helps you do exactly that, and much more.

Let us see an example where we will number the rows, and then use one particular row number for all groups in order to use that data.

SELECT *,
    ROW_NUMBER() OVER (PARTITION BY state ORDER BY year) as row_num
    FROM population
    WHERE ages = 'total'

	state	ages	year	population	row_num
0	AK	total	1990	553290.0	1
1	AK	total	1991	570193.0	2
2	AK	total	1992	588736.0	3
3	AK	total	1993	599434.0	4
4	AK	total	1994	603308.0	5
…	…	…	…	…	…
1267	WY	total	2009	559851.0	20
1268	WY	total	2010	564222.0	21
1269	WY	total	2011	567329.0	22
1270	WY	total	2012	576626.0	23
1271	WY	total	2013	582658.0	24

1272 rows × 5 columns

Now let us use the row numbers to meaningfully take out the third year for each state. This we can now do easily as we have numbered the rows, and 3rd row for each group is what we seek.

SELECT *     FROM population     WHERE row_num = 3

If you notice, in the previous SQL code, we did not create a table to save the row_num column. You can do so if you want to, but here to show you how the code works, I am going to combine the two SQL queries, and they work like charm.

The output table here may look long and overwhelming, but notice that its population of all States and Regions of USA for the year 1992, the third year in the list of years from 1990 to 2013.

SELECT *
    FROM(
        SELECT *,
        ROW_NUMBER() OVER (PARTITION BY state ORDER BY year) as row_num
        FROM population
        WHERE ages = 'total'
        )
    WHERE row_num = 3

	state	ages	year	population	row_num
0	AK	total	1992	588736.0	3
1	AL	total	1992	4154014.0	3
2	AR	total	1992	2415984.0	3
3	AZ	total	1992	3915740.0	3
4	CA	total	1992	30974659.0	3
5	CO	total	1992	3495939.0	3
6	CT	total	1992	3300712.0	3
7	DC	total	1992	597567.0	3
8	DE	total	1992	694927.0	3
9	FL	total	1992	13650553.0	3
10	GA	total	1992	6817203.0	3
11	HI	total	1992	1158613.0	3
12	IA	total	1992	2818401.0	3
13	ID	total	1992	1071685.0	3
14	IL	total	1992	11694184.0	3
15	IN	total	1992	5674547.0	3
16	KS	total	1992	2532395.0	3
17	KY	total	1992	3765469.0	3
18	LA	total	1992	4293003.0	3
19	MA	total	1992	6028709.0	3
20	MD	total	1992	4923369.0	3
21	ME	total	1992	1238508.0	3
22	MI	total	1992	9479065.0	3
23	MN	total	1992	4495572.0	3
24	MO	total	1992	5217101.0	3
25	MS	total	1992	2623734.0	3
26	MT	total	1992	825770.0	3
27	NC	total	1992	6897214.0	3
28	ND	total	1992	638223.0	3
29	NE	total	1992	1611687.0	3
30	NH	total	1992	1117785.0	3
31	NJ	total	1992	7880508.0	3
32	NM	total	1992	1595442.0	3
33	NV	total	1992	1351367.0	3
34	NY	total	1992	18246653.0	3
35	OH	total	1992	11029431.0	3
36	OK	total	1992	3220517.0	3
37	OR	total	1992	2991755.0	3
38	PA	total	1992	12049450.0	3
39	PR	total	1992	NaN	3
40	RI	total	1992	1012581.0	3
41	SC	total	1992	3620464.0	3
42	SD	total	1992	712801.0	3
43	TN	total	1992	5049742.0	3
44	TX	total	1992	17759738.0	3
45	USA	total	1992	256514231.0	3
46	UT	total	1992	1836799.0	3
47	VA	total	1992	6414307.0	3
48	VT	total	1992	572751.0	3
49	WA	total	1992	5160757.0	3
50	WI	total	1992	5025398.0	3
51	WV	total	1992	1806451.0	3
52	WY	total	1992	466251.0	3

RANKING in SQL

We have the following rank functions in SQL.

ROW_NUMBER()
RANK()
DENSE_RANK()
NTILE()

To give rank as per the value of records in your table, you may use RANK() function in SQL. The working of the same is similar to the ROW_NUMBER() in SQL, with a slight difference. The RANK() function gives same rank to equal values. The need for a ranking function is quite intuitive. You might want to

rank the students based on their grades
rank the countries based on their GDP or Population
rank the stores based on their profit

    SELECT *,
    RANK() OVER (PARTITION BY state ORDER BY population) as rank
    FROM population

	state	ages	year	population	rank
0	AK	under18	1990	177502.0	1
1	AK	under18	1991	182180.0	2
2	AK	under18	2008	183124.0	3
3	AK	under18	2007	184344.0	4
4	AK	under18	1992	184878.0	5
…	…	…	…	…	…
2539	WY	total	2009	559851.0	44
2540	WY	total	2010	564222.0	45
2541	WY	total	2011	567329.0	46
2542	WY	total	2012	576626.0	47
2543	WY	total	2013	582658.0	48

2544 rows × 5 columns

Use the DESC option when you want the ranking to be done in Descending order. In this case you will observe that the rank here is in the reverse order of the rank in table above, and row_num in the table before that.

    SELECT *,
    RANK() OVER (PARTITION BY state ORDER BY population DESC) as rank
    FROM population

	state	ages	year	population	rank
0	AK	total	2013	735132.0	1
1	AK	total	2012	730307.0	2
2	AK	total	2011	723375.0	3
3	AK	total	2010	713868.0	4
4	AK	total	2009	698895.0	5
…	…	…	…	…	…
2539	WY	under18	2006	125525.0	44
2540	WY	under18	2002	125495.0	45
2541	WY	under18	2003	124182.0	46
2542	WY	under18	2005	124022.0	47
2543	WY	under18	2004	123974.0	48

2544 rows × 5 columns

Conclusion:

In this article, you got introduced to the windows functions in SQL. These are the backbone of any data preparation operation using SQL. They will come in handy when you prepare data to feed into your machine learning model, Feature engineering or maybe just analyzing some pattern in the data.

In this article, you saw how to use SQL window functions work. But if you want to connect the two most powerful workhorses of the Data Science world, SQL and Python. This is not the end, but only the first step towards getting the “Best of Both Worlds”.

Now you can start using Python to work upon your data which rests in SQL Databases. In able to connect to your SQL databases, go thru my article How to Access & Use SQL Database with pyodbc in Python. Once you brought it as DataFrame, then all the operations are usual Pandas operations or SQL queries being operated on Pandas DataFrame as you saw in this article.

Apart from the function of SQL shown in this article, many other popular SQL functions are easily implementable in Python. Read 15 Pandas functions to replicate basic SQL Queries in Python for learning how to do that.

The implied learning in this article was, that you can use Python to do things that you thought were only possible using SQL. There may or may not be straight forward solution to things, but if you are inclined to find it, there are enough resources at your disposal to find a way out. You can look at the mix and match the learning from my book, PYTHON MADE EASY – Step by Step Guide to Programming and Data Analysis using Python for Beginners and Intermediate Level.

About the Author: I am Nilabh Nishchhal. I like making seemingly difficult topics easy and write about them. Check out more at https://www.authornilabh.com/. My attempt to make Python easy and Accessible to all is Python Made Easy.

Cover Photo Credit: Photo by Mari Madriz on Unsplash

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Nilabh Nishchhal

15 Jul 2021

Beginner Data Exploration Data Science SQL

6 Most Useful SQL Window Functions You Should Definitely Know about!

Window Functions or Analytic Functions in SQL

LEAD() and LAG()

Cumulative Sum

Moving Averages

ROW_NUMBER()

RANKING in SQL

Conclusion:

Run Local AWS Cloud Stack using LocalStack on Linux

Learn Terraform Automation in 3 days using Video Courses

How To Expose Ansible AWX Service using Nginx Ingress

LEAVE A REPLY Cancel reply

Most Popular

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

Interview With Willem Dewulf – CEO of ProBackup by Shauli Zacks

Recent Comments

EDITOR PICKS

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

POPULAR POSTS

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

POPULAR CATEGORY

ABOUT US

FOLLOW US