How to Do Basket Analysis in Guandata BI
What is Basket Analysis?
Basket analysis was originally used to discover associations between different products in supermarket sales databases. The classic example is the association between beer and diapers. The purpose of basket analysis is to mine and analyze user purchase records to find potential patterns in user buying habits, so as to provide users with desired combinations or packages. Increasing package sales usually raises the average transaction value, thus increasing company revenue.
How to Do Basket Analysis?
Basket analysis is usually done from the perspective of orders (receipts), i.e., analyzing the association between products A and B purchased in the same transaction. In simple terms, does buying product A affect the likelihood of buying product B?
The key metrics for basket analysis are: support, confidence, and lift. To calculate these, you need four base metrics: order count for product A, order count for product B, order count for the combination, and total order count.
The table below explains these 7 metrics:
Metric | Formula | Meaning | Example |
Order count A | num(A) | Number of unique orders containing product A | 600 |
Order count B | num(B) | Number of unique orders containing product B | 400 |
Combo order count | num(A∩B) | Number of unique orders containing both product A and product B | 150 |
Total orders | num(I) | Total number of unique orders | 10000 |
Support | num(A∩B)/num(I) | Probability that A and B are bought together; only combos with high support are valuable | 150/10000=1.5% |
Confidence | num(A∩B)/num(A) | Probability of buying B given A is bought; how much buying A influences buying B | 150/600=25% |
Lift | [num(A∩B)/num(A)]/[num(B)/num(I)] | Probability of buying B after A vs. natural probability of buying B; lift >1 means buying A increases likelihood of buying B | (150/600)/(400/10000)=6.25 |
Basket Analysis in Guandata BI Platform
In Guandata BI, you can do basket analysis by creating cards for ad-hoc analysis or by using Smart ETL to process order data before analysis. Card-based analysis is for exploratory calculations and is resource-intensive, so Smart ETL is recommended. With 40,000+ rows of data, Smart ETL completes the calculation in 3 seconds. This guide shows how to use Smart ETL for basket analysis. For card-based solutions, see the case library.
After processing order data with Smart ETL and visualizing, you can get results like:

Steps for basket analysis in Guandata BI:
First, use a typical transaction log with date, store, product, and order number.

Then, use this transaction log as the input for ETL. The ETL steps are not complex, as shown below:

Calculation Logic Derivation
Before building the ETL, let's clarify the calculation logic. Of the four base metrics for basket analysis (order count for A, B, combo, and total), only the combo order count is tricky. This is the key challenge.
Let's use a simple dataset to derive how to calculate combo order count. For example, we want to calculate the order count for each product combo:
Combo | Product 1 | Product 2 | Order Count |
---|---|---|---|
AB | A | B | 3 |
AC | A | C | 2 |
AD | A | D | 1 |
BC | B | C | 2 |
BD | B | D | 1 |
CD | C | D | 1 |
To get this, we first need to enumerate all product combos in each order:
Order | Products | Combos |
---|---|---|
1 | A | |
2 | A,B | AB |
3 | A,B,C | AB,AC,BC |
4 | A,B,C,D | AB,AC,AD,BC,BD,CD |
So, the key is to enumerate all combos in each order. This is a combinatorial problem.
For order 4 (A,B,C,D):
- Take all pairs of products: box 1 gets product 1, box 2 gets product 2 (any other product). There are 12 possible arrangements.
- Put both boxes into a big box, deduplicate, and keep only one for each combo. After deduplication, there are 6 combos.
This reduces the problem to a combinatorial math problem. The same logic can be used for 3-product, 4-product combos.
Box 1 | Box 2 | Big Box |
---|---|---|
A | B | AB |
A | C | AC |
A | D | AD |
B | A | BA |
B | C | BC |
B | D | BD |
C | A | CA |
C | B | CB |
C | D | CD |
D | A | DA |
D | B | DB |
D | C | DC |
ETL Implementation
Now, let's implement this logic in Guandata BI.
Step 1: Deduplicate and Filter Data

- Add filter to source data, e.g., date >= 2021-01-01.
- Use group aggregation to deduplicate by order number and product, drag order number, product ID, and product name into dimensions.
- Add calculated field: Product Count.
COUNT([Product ID]) OVER(PARTITION BY [Order Number])
- Add filter: product count >=2.
Step 2: Calculate Order Count for Each Product Pair

- Add calculated field: Product Serial List.
array_sort(collect_list([Product ID]) over(partition by [Order Number]))
-- list all product IDs in an order, sorted - Add calculated field: Product B List.
filter([Product Serial List], x -> x<>[Product ID])
-- filter out product A's ID to get product B list - Add calculated field: Product B Serial.
explode([Product B List])
-- expand product B serial - Add calculated field: Product Combo.
array_sort(array([Product ID],[Product B Serial]))
-- create combo array and sort - Use group aggregation to deduplicate by order number and product combo, drag order number and product combo into dimensions.
- Add calculated field: Product A Serial.
[Product Combo][0]
- Add calculated field: Product B Serial.
[Product Combo][1]
- Use group aggregation to calculate combo order count, drag product A serial, product B serial into dimensions; order number into value, aggregation: distinct count, alias: combo order count.
Step 3: Join to Get Product Names, Order Counts, and Total Orders

Branch 1:
- Use group aggregation to calculate product order count, drag product ID and name into dimensions; order number into value, aggregation: distinct count, alias: order count.
- Join: product A serial in combo to product ID. Rename to get product A name and order count.
- Join: product B serial in combo to product ID. Rename to get product B name and order count.
Branch 2:
- Use group aggregation to calculate product order count, aggregation: distinct count, alias: order count.
- Join: constant field a to constant field a. Rename to get total order count.
Finally, save and run the ETL. Now you have the four base metrics: order count for A, B, combo, and total. Use the ETL output dataset to create a card, calculate support, confidence, and lift, and get your analysis results!
Want to try it? Contact Guandata for sample data and ETL files.