Skip to main content

Spark Window Functions and Applications

Window Functions

Definition

     Window Function, also called Analytics Function, or OLAP function (Online Anallytical Processing, Online Analytical Processing), can perform real-time analytical processing on database data.

Functions

  1. Has both grouping (partition by) and sorting (order by) functions;

  2. Does not reduce the number of rows in the original table.

      The partition by and order by clauses in window functions are used to sort the results after grouping, similar to group by and order by in ordinary SQL query statements. The difference is that group by grouping aggregation changes the number of rows in the table, while partition by does not reduce the number of rows in the original table. Window functions operate on the results processed by where or group by clauses, so window functions can only be written in select clauses in principle, and can be used in new calculated fields in Guandata BI.

Applicable Scenarios

Scenario 1: The built-in aggregation calculations in Guandata BI, or newly created aggregation measures, are all based on grouping aggregation by fields in the dimension bar. If you need to perform grouping aggregation not according to dimension bar fields, then you need to use window functions for calculation.

Scenario 2: Data obtained by direct aggregation calculation in BI cannot be filtered and calculated again. If you need to filter and calculate aggregation results again, you need to use window functions for calculation.

Basic Syntax

 over (partition by 
order by )
-----------------------------------------------------------------------------
over (partition by xxx) Group by xxx, when no grouping, write as over (partition by 1/null) ;
over (partition by xxx order by xx) Group by xxx, and sort by xx

position can be filled with the following two types of functions:

Dedicated window functions: Including rank, dense_rank, row_number and other dedicated window functions.

Aggregation functions: Such as sum, avg, count, max, min, collect_set, etc.

partition by: Grouping clause, indicates the calculation range of the window function, different groups are independent of each other;

order by: Sorting clause, indicates the sorting method within the group after grouping, defaults to ascending order (asc);

Common Dedicated Window Functions

The following is the usage description of Spark SQL window functions, other types of databases may have different usage methods.

image.png

Case Sharing

Case 1: Grouped sum and cumulative sum

sum1: sum([Number])over(partition by [Class])  Sum by Class grouping, equivalent to subtotal
sum2: sum([Number])over(partition by [Class] order by [Date]) Cumulative sum by Class grouping;
sum3: sum([Number])over(partition by 1)/ sum([Number])over(partition by null) Calculate total without grouping, equivalent to grand total.

image.png

Case 2: Grouped sorting

dense_rank() over(partition by [Class] order by [Number] desc)
row_number() over(partition by 1 order by [Number] desc)

image.png

Case 3: Row offset

lag([Number])over(partition by [Class] order by [Date] ) 
--Group by Class, sort by date, then get previous row data, returns null if no value. Data in figure below is equivalent to month-over-month;
first([Number])over(partition by [Class] order by [Date])
--Group by Class, sort by date, then get first row data, order by [Date], equivalent to getting initial data;
last([Number])over(partition by [Class])
--Group by Class, sort by date, then get last row data;
last([Number])over(partition by [Class] order by [Date])
--Group by Class, sort by date, then get current row data;

image.png

Advanced Usage

 over (partition by 
order by
rows/range window clause)

rows/range: Window clause, is a sub-grouping (also called window) within the group after grouping (partition by).

     There are two types of windows: rows and range, mainly used to limit the number of rows and data range. Window clause must be used together with order by clause, and if order by clause is specified without window clause, it defaults to range  between unbounded preceding and current row, from the beginning of current group to current row. Row comparison analysis functions lead and lag have no window clause.

Common Window Clause Syntax

PRECEDING: Forward
FOLLOWING: Backward
CURRENT ROW: Current row
UNBOUNDED: Unbounded (start or end)
UNBOUNDED PRECEDING: Indicates from the starting point ahead
UNBOUNDED FOLLOWING: Indicates to the ending point behind

Difference between rows and range:

1. rows is a physical window, that is, according to the order by clause sorting, take the data of the first N rows and the last N rows for calculation (unrelated to the value of the current row, only related to the row number after sorting).

sum([Number]) over (partition by 1 order by [Date] rows between 2 preceding and current row)
     As in the example below, sum\_rows is to take the sum of the first 2 rows and current row data after sorting by date. Although many dates correspond to more than one row of data, there are two Classes, but by default the window will sort Class in ascending order and then select the first 2 rows and current row for cumulative sum calculation.

2. range is a logical window, which specifies the range of values corresponding to the current row, including all rows in the sub-grouping (or window), with the same order by value as the current row. If more rows have the same order by value, when using range, more rows will participate in the calculation.

sum([Number]) over (partition by 1 order by [Date] range between 2 preceding and current row)
     As in the example below, sum\_range is to take the sum of the first 2 days and current date (consecutive 3 days) data after sorting by date. The card filters out 2021-01-04 data.\
When Date=2021-01-01, there is no data for the previous two days, only reads 2 pieces of data for that day, sum=1+3=4;

When Date=2021-01-03, takes 6 pieces of data for consecutive 3 days of 2021-01-01, 2021-01-02 and 2021-01-03, sum=(1+3)+(2+3)+(2+4)=15;

When Date=2021-01-05, there is no 2021-01-04 data, only takes 4 pieces of data for 2 days of 2021-01-03 and 2021-01-05, sum=(2+4)+(4+6) =15; and so on, the result is as shown in the example below.

image.png

Case: Calculate Moving Average and Moving Cumulative

avg([Number]) over (partition by [Class] order by [Date] rows between 2 preceding and current row)  --3-day average (including current day)
sum([Number]) over (partition by [Class] order by [Date] rows between 2 preceding and current row) --3-day cumulative (including current day)

image.png

Notes

  1. Filter conditions take effect before window functions, filtered data will not participate in calculation, so even over (partition by 1/null) will not calculate filtered data;

  2. If fields in Partition by grouping clause are used for page filters, this filter must always have options selected and linked to cards. Selecting all or empty may cause incorrect calculation results, because the set grouping cannot be automatically cancelled;

  3. In cases with large data volume (approximately over 1 million rows), using window functions for calculation is resource-intensive, please use carefully as needed.