Skip to main content

Distinct Count Function Window Implementation

Problem Description

The conventional distinct count function is count(distinct ()), which must be used with group by in SQL statements to calculate results. In Guandata BI, if the distinct count result needs to participate in secondary calculations or usage, performing window calculations on count(distinct ()) will cause errors.

Common Errors

Without windowing — Grouping expressions sequence is empty

With windowing — Distinct window functions are not supported

Cause

Spark syntax does not support windowing of count(distinct())

Solutions

Solution 1: Use "Group Aggregation + Self-join Method" in ETL for distinct counting

Step 1: Add a "Group Aggregation" node, drag grouping fields into the dimension bar, drag the field to be calculated into the value bar, and select "Distinct Count" for aggregation method. Take the figure below as an example, counting the number of product SKUs for each province and each region.

image.png

Step 2: Add a "Join Data" node, left join the original table with the temporary table obtained from the previous group aggregation using dimension fields. In this case, the join fields use "Province" and "Region", rename the distinct count result field from the previous step to an appropriate name, and you get the distinct count result for the specified dimensions.

image.png

Note: The "Group Aggregation + Self-join Method" can be replaced with SQL nodes in ETL (generally not recommended); SQL statements are also applicable to view datasets, where necessary fields or filter conditions need to be replaced with global parameters. The SQL writing method roughly refers to the following statement.

SELECT input2.*, t.`Region SKU Count` 
FROM input2
LEFT JOIN
(SELECT input1.`Province`, input1.`Region`, COUNT(DISTINCT input1.`SKU`) AS `Region SKU Count`
FROM input1
GROUP BY input1.`Province`,input1.`Region`) t
ON input2.`Province`=t.`Province` AND input2.`Region`=t.`Region`

Solution 2: Create calculated field using alternative window functions

Distinct count: size(collect_set([Count Field])over(partition by [Group Field]))

Non-distinct count: count([Group Field])over(partition by [Group Field])

image.png

Note: The collect_set()over(partition by ) function is a window function used to generate arrays. It deduplicates multiple rows of data by grouping and then merges them into one row. After processing, use the size() function to count the number of elements in the array. When the data volume is large, this processing process is resource-intensive and time-consuming.

The comparison of the two solutions is as follows. Please choose the appropriate solution according to your usage scenario and data volume.

image.png