Saturday, November 16, 2024
Google search engine
HomeData Modelling & AI6 Most Useful SQL Window Functions You Should Definitely Know about!

6 Most Useful SQL Window Functions You Should Definitely Know about!

This article was published as a part of the Data Science Blogathon

Window Functions or Analytic Functions in SQL

A window function (or Analytic Function) in SQL uses values from multiple rows (or one row) to return values for each row. Do not confuse this with aggregate functions, which return a single value after aggregating multiple rows. There’s a simple way to identify the window functions. They have an OVER clause. Any function in SQL without an OVER clause is not a window function. It will be either an aggregate function or a scalar (single-row) function.

In this article, I am going to use this table, named population to showcase the examples and explain the codes. The first 5 rows of the table are as below. This is the data set which has a population for every state of USA for years from 1990 to 2013 and also has split over the total population and population of people above 18 years of age.

To understand the concepts, we will use the USA population dataset here. You may download the data set from my GitHub page here, and practice along with it.

Note: I am doing these operations in Jupyter Notebook(in Python Environment, on SQLite Syntax), and when you do the same in SQL editor, the table layout may look a bit different, but the content and codes are same.

SELECT * FROM population limit 5;
state ages year population
0 AL under18 2012 1117489.0
1 AL total 2012 4817528.0
2 AL under18 2010 1130966.0
3 AL total 2010 4785570.0
4 AL under18 2011 1125763.0

Example: Here is a SQL Query which uses a windows function to compare the population of each state for every year with the average population of that state over the period of all the years available in dataset.

SELECT *, avg(population) OVER (PARTITION BY state) FROM population;
state ages year population avg(population) OVER (PARTITION BY state)
0 AK total 1990 553290.0 416438.895833
1 AK under18 1990 177502.0 416438.895833
2 AK total 1992 588736.0 416438.895833
3 AK under18 1991 182180.0 416438.895833
4 AK under18 1992 184878.0 416438.895833
2539 WY under18 1993 137458.0 321286.541667
2540 WY total 1991 459260.0 321286.541667
2541 WY under18 1991 136720.0 321286.541667
2542 WY under18 1990 136078.0 321286.541667
2543 WY total 1990 453690.0 321286.541667

2544 rows × 5 columns

The PARTITION BY is grouping the rows into partitions (by year) and the function (average function here) is applied to each partition separately. If you omit the PARTITION BY clause, and the OVER() a clause is empty, then the entire data set is treated as a single partition, and instead of getting the AVG population of states over the years, you will get the average population of all the states for the full period of time.

Now let us dive in and see different Window functions and their Python Counterpart along with codes and results.

LEAD() and LAG()

lead() and LAG() are used to compare the rows with their previous(preceding) or next(following) rows. For this to work best, the data shall be in order or sorted. These functions are most useful when you want to compare one period of time with a previous period of time based on the certain metrics. Some generic examples can be as below.

Example:

  • The share price difference of a company from yesterday to today.
  • Increase or decrease in the number of visitors to your website on a daily basis
  • Sales of this year compared to sales of previous year etc.

Let’s see the comparison of the population for next year, as compared to the population of the previous year.

We can use LEAD() and LAG() to create new columns where we pull values from other rows. LEAD() pulls the value from the next row and LAG() pulls the value from the previous row.

SELECT *,
       LAG(population, 1) OVER
         (PARTITION BY state ORDER BY year) AS lag,
       LEAD(population, 1) OVER
         (PARTITION BY state ORDER BY year) AS lead
  FROM population
state ages year population lag lead
0 AK total 1990 553290.0 NaN 177502.0
1 AK under18 1990 177502.0 553290.0 182180.0
2 AK under18 1991 182180.0 177502.0 570193.0
3 AK total 1991 570193.0 182180.0 588736.0
4 AK total 1992 588736.0 570193.0 184878.0
2539 WY under18 2011 135407.0 567329.0 576626.0
2540 WY total 2012 576626.0 135407.0 136526.0
2541 WY under18 2012 136526.0 576626.0 582658.0
2542 WY total 2013 582658.0 136526.0 137679.0
2543 WY under18 2013 137679.0 582658.0 NaN

2544 rows × 6 columns

Cumulative Sum

These are also known as running totals. Doing the running totals is simple when you use a windows function like sum(). This is particularly used when you want to show the accumulation or growth of any metric over time.

  • Get a running total of sales from stores over days of a month.
  • show running total or cumulative time spent on your website per user
  • Get a running total of customers acquired by a sales representative

The following example will show you how to do the cumulative sum of the population for states over years.

SELECT *,
SUM(population) OVER (PARTITION BY state ORDER BY year) as Total_Population
FROM population
where ages = 'total'
state ages year population Total_Population
0 AK total 1990 553290.0 553290.0
1 AK total 1991 570193.0 1123483.0
2 AK total 1992 588736.0 1712219.0
3 AK total 1993 599434.0 2311653.0
4 AK total 1994 603308.0 2914961.0
1267 WY total 2009 559851.0 9957038.0
1268 WY total 2010 564222.0 10521260.0
1269 WY total 2011 567329.0 11088589.0
1270 WY total 2012 576626.0 11665215.0
1271 WY total 2013 582658.0 12247873.0

1272 rows × 5 columns

Moving Averages

Moving Averages are very effective in forecasting values in time series. If you have ever looked at share price analysis, the analysts talk in terms of DMA. This DMA is nothing but Day Moving Average. So when they say this share is now priced above its 200 DMA, it means it is now trading at a price that is higher than the average price of this share, and this Average is the calculated average of the previous 200 Days closing prices.

  • Moving Averages can be a good metric of weekly sales
  • Moving Averages can be used to compare the price of objects with their average price in the previous month
  • Moving averages can also give an idea of average hits on your website over the last 90 days

Our Population data is for the years 1990 to 2013. Here, I am interested in knowing, how the population for the present year is as compared to the previous 5 years average population. This Moving Average for the last 5 years can be calculated in SQL by following the window function.

SELECT *,
round(AVG(population) OVER (ORDER BY state, year ROWS 5 PRECEDING),2) as Avg_Populationfor5Year
from population
where ages = 'total'
state ages year population Avg_Populationfor5Year
0 AK total 1990 553290.0 553290.00
1 AK total 1991 570193.0 561741.50
2 AK total 1992 588736.0 570739.67
3 AK total 1993 599434.0 577913.25
4 AK total 1994 603308.0 582992.20
1267 WY total 2009 559851.0 531116.67
1268 WY total 2010 564222.0 540302.67
1269 WY total 2011 567329.0 549164.67
1270 WY total 2012 576626.0 558157.83
1271 WY total 2013 582658.0 566121.50

1272 rows × 5 columns

ROW_NUMBER()

When you are working on data that has an element of time in it, or more specifically time series data, you can conclusively say that the events are happening in a particular order. Now you need to work upon groups of data and need to use some particular events, say the latest ones, or the oldest ones. These are still simpler, provided some of the time series functions (not the scope of this article).

But what if you want the 2nd last event or the 3rd from the latest? How nice it would be if you could number the events like they are separate rows, for each group, and then use the row numbers to access them when needed. This ROW_NUMBER() the function helps you do exactly that, and much more.

Let us see an example where we will number the rows, and then use one particular row number for all groups in order to use that data.

SELECT *,
    ROW_NUMBER() OVER (PARTITION BY state ORDER BY year) as row_num
    FROM population
    WHERE ages = 'total'
state ages year population row_num
0 AK total 1990 553290.0 1
1 AK total 1991 570193.0 2
2 AK total 1992 588736.0 3
3 AK total 1993 599434.0 4
4 AK total 1994 603308.0 5
1267 WY total 2009 559851.0 20
1268 WY total 2010 564222.0 21
1269 WY total 2011 567329.0 22
1270 WY total 2012 576626.0 23
1271 WY total 2013 582658.0 24

1272 rows × 5 columns

Now let us use the row numbers to meaningfully take out the third year for each state. This we can now do easily as we have numbered the rows, and 3rd row for each group is what we seek.

SELECT *     FROM population     WHERE row_num = 3

If you notice, in the previous SQL code, we did not create a table to save the row_num column. You can do so if you want to, but here to show you how the code works, I am going to combine the two SQL queries, and they work like charm.

The output table here may look long and overwhelming, but notice that its population of all States and Regions of USA for the year 1992, the third year in the list of years from 1990 to 2013.

SELECT *
    FROM(
        SELECT *,
        ROW_NUMBER() OVER (PARTITION BY state ORDER BY year) as row_num
        FROM population
        WHERE ages = 'total'
        )
    WHERE row_num = 3
state ages year population row_num
0 AK total 1992 588736.0 3
1 AL total 1992 4154014.0 3
2 AR total 1992 2415984.0 3
3 AZ total 1992 3915740.0 3
4 CA total 1992 30974659.0 3
5 CO total 1992 3495939.0 3
6 CT total 1992 3300712.0 3
7 DC total 1992 597567.0 3
8 DE total 1992 694927.0 3
9 FL total 1992 13650553.0 3
10 GA total 1992 6817203.0 3
11 HI total 1992 1158613.0 3
12 IA total 1992 2818401.0 3
13 ID total 1992 1071685.0 3
14 IL total 1992 11694184.0 3
15 IN total 1992 5674547.0 3
16 KS total 1992 2532395.0 3
17 KY total 1992 3765469.0 3
18 LA total 1992 4293003.0 3
19 MA total 1992 6028709.0 3
20 MD total 1992 4923369.0 3
21 ME total 1992 1238508.0 3
22 MI total 1992 9479065.0 3
23 MN total 1992 4495572.0 3
24 MO total 1992 5217101.0 3
25 MS total 1992 2623734.0 3
26 MT total 1992 825770.0 3
27 NC total 1992 6897214.0 3
28 ND total 1992 638223.0 3
29 NE total 1992 1611687.0 3
30 NH total 1992 1117785.0 3
31 NJ total 1992 7880508.0 3
32 NM total 1992 1595442.0 3
33 NV total 1992 1351367.0 3
34 NY total 1992 18246653.0 3
35 OH total 1992 11029431.0 3
36 OK total 1992 3220517.0 3
37 OR total 1992 2991755.0 3
38 PA total 1992 12049450.0 3
39 PR total 1992 NaN 3
40 RI total 1992 1012581.0 3
41 SC total 1992 3620464.0 3
42 SD total 1992 712801.0 3
43 TN total 1992 5049742.0 3
44 TX total 1992 17759738.0 3
45 USA total 1992 256514231.0 3
46 UT total 1992 1836799.0 3
47 VA total 1992 6414307.0 3
48 VT total 1992 572751.0 3
49 WA total 1992 5160757.0 3
50 WI total 1992 5025398.0 3
51 WV total 1992 1806451.0 3
52 WY total 1992 466251.0 3

RANKING in SQL

We have the following rank functions in SQL.

  • ROW_NUMBER()
  • RANK()
  • DENSE_RANK()
  • NTILE()

To give rank as per the value of records in your table, you may use RANK() function in SQL. The working of the same is similar to the ROW_NUMBER() in SQL, with a slight difference. The RANK() function gives same rank to equal values. The need for a ranking function is quite intuitive. You might want to

  • rank the students based on their grades
  • rank the countries based on their GDP or Population
  • rank the stores based on their profit
    SELECT *,
    RANK() OVER (PARTITION BY state ORDER BY population) as rank
    FROM population
state ages year population rank
0 AK under18 1990 177502.0 1
1 AK under18 1991 182180.0 2
2 AK under18 2008 183124.0 3
3 AK under18 2007 184344.0 4
4 AK under18 1992 184878.0 5
2539 WY total 2009 559851.0 44
2540 WY total 2010 564222.0 45
2541 WY total 2011 567329.0 46
2542 WY total 2012 576626.0 47
2543 WY total 2013 582658.0 48

2544 rows × 5 columns

Use the DESC option when you want the ranking to be done in Descending order. In this case you will observe that the rank here is in the reverse order of the rank in table above, and row_num in the table before that.

    SELECT *,
    RANK() OVER (PARTITION BY state ORDER BY population DESC) as rank
    FROM population
state ages year population rank
0 AK total 2013 735132.0 1
1 AK total 2012 730307.0 2
2 AK total 2011 723375.0 3
3 AK total 2010 713868.0 4
4 AK total 2009 698895.0 5
2539 WY under18 2006 125525.0 44
2540 WY under18 2002 125495.0 45
2541 WY under18 2003 124182.0 46
2542 WY under18 2005 124022.0 47
2543 WY under18 2004 123974.0 48

2544 rows × 5 columns

Conclusion:

In this article, you got introduced to the windows functions in SQL. These are the backbone of any data preparation operation using SQL. They will come in handy when you prepare data to feed into your machine learning model, Feature engineering or maybe just analyzing some pattern in the data.

In this article, you saw how to use SQL window functions work. But if you want to connect the two most powerful workhorses of the Data Science world, SQL and Python. This is not the end, but only the first step towards getting the “Best of Both Worlds”.

Now you can start using Python to work upon your data which rests in SQL Databases. In able to connect to your SQL databases, go thru my article How to Access & Use SQL Database with pyodbc in Python. Once you brought it as DataFrame, then all the operations are usual Pandas operations or SQL queries being operated on Pandas DataFrame as you saw in this article.

Apart from the function of SQL shown in this article, many other popular SQL functions are easily implementable in Python. Read 15 Pandas functions to replicate basic SQL Queries in Python for learning how to do that.

The implied learning in this article was, that you can use Python to do things that you thought were only possible using SQL. There may or may not be straight forward solution to things, but if you are inclined to find it, there are enough resources at your disposal to find a way out. You can look at the mix and match the learning from my book, PYTHON MADE EASY – Step by Step Guide to Programming and Data Analysis using Python for Beginners and Intermediate Level.

About the Author: I am Nilabh Nishchhal. I like making seemingly difficult topics easy and write about them. Check out more at https://www.authornilabh.com/. My attempt to make Python easy and Accessible to all is Python Made Easy.

Cover Photo Credit: Photo by Mari Madriz on Unsplash

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
Nilabh Nishchhal

15 Jul 2021

Dominic Rubhabha-Wardslaus
Dominic Rubhabha-Wardslaushttp://wardslaus.com
infosec,malicious & dos attacks generator, boot rom exploit philanthropist , wild hacker , game developer,
RELATED ARTICLES

Most Popular

Recent Comments