Skip to main content

How to Do Basket Analysis in Guandata BI

What is Basket Analysis?

Basket analysis was originally used to discover associations between different products in supermarket sales databases. The classic example is the association between beer and diapers. The purpose of basket analysis is to mine and analyze user purchase records to find potential patterns in user buying habits, so as to provide users with desired combinations or packages. Increasing package sales usually raises the average transaction value, thus increasing company revenue.

How to Do Basket Analysis?

Basket analysis is usually done from the perspective of orders (receipts), i.e., analyzing the association between products A and B purchased in the same transaction. In simple terms, does buying product A affect the likelihood of buying product B?

The key metrics for basket analysis are: support, confidence, and lift. To calculate these, you need four base metrics: order count for product A, order count for product B, order count for the combination, and total order count.

The table below explains these 7 metrics:

MetricFormulaMeaningExample
Order count Anum(A)Number of unique orders containing product A600
Order count Bnum(B)Number of unique orders containing product B400
Combo order countnum(A∩B)Number of unique orders containing both product A and product B150
Total ordersnum(I)Total number of unique orders10000
Supportnum(A∩B)/num(I)Probability that A and B are bought together; only combos with high support are valuable150/10000=1.5%
Confidencenum(A∩B)/num(A)Probability of buying B given A is bought; how much buying A influences buying B150/600=25%
Lift[num(A∩B)/num(A)]/[num(B)/num(I)]Probability of buying B after A vs. natural probability of buying B; lift >1 means buying A increases likelihood of buying B(150/600)/(400/10000)=6.25

Basket Analysis in Guandata BI Platform

In Guandata BI, you can do basket analysis by creating cards for ad-hoc analysis or by using Smart ETL to process order data before analysis. Card-based analysis is for exploratory calculations and is resource-intensive, so Smart ETL is recommended. With 40,000+ rows of data, Smart ETL completes the calculation in 3 seconds. This guide shows how to use Smart ETL for basket analysis. For card-based solutions, see the case library.

After processing order data with Smart ETL and visualizing, you can get results like:

image.png

Steps for basket analysis in Guandata BI:

First, use a typical transaction log with date, store, product, and order number.

image.png

Then, use this transaction log as the input for ETL. The ETL steps are not complex, as shown below:

image.png

Calculation Logic Derivation

Before building the ETL, let's clarify the calculation logic. Of the four base metrics for basket analysis (order count for A, B, combo, and total), only the combo order count is tricky. This is the key challenge.

Let's use a simple dataset to derive how to calculate combo order count. For example, we want to calculate the order count for each product combo:

ComboProduct 1Product 2Order Count
ABAB3
ACAC2
ADAD1
BCBC2
BDBD1
CDCD1

To get this, we first need to enumerate all product combos in each order:

OrderProductsCombos
1A
2A,BAB
3A,B,CAB,AC,BC
4A,B,C,DAB,AC,AD,BC,BD,CD

So, the key is to enumerate all combos in each order. This is a combinatorial problem.

For order 4 (A,B,C,D):

  1. Take all pairs of products: box 1 gets product 1, box 2 gets product 2 (any other product). There are 12 possible arrangements.
  2. Put both boxes into a big box, deduplicate, and keep only one for each combo. After deduplication, there are 6 combos.

This reduces the problem to a combinatorial math problem. The same logic can be used for 3-product, 4-product combos.

Box 1Box 2Big Box
ABAB
ACAC
ADAD
BABA
BCBC
BDBD
CACA
CBCB
CDCD
DADA
DBDB
DCDC

ETL Implementation

Now, let's implement this logic in Guandata BI.

Step 1: Deduplicate and Filter Data

  

image.png

  • Add filter to source data, e.g., date >= 2021-01-01.
  • Use group aggregation to deduplicate by order number and product, drag order number, product ID, and product name into dimensions.
  • Add calculated field: Product Count. COUNT([Product ID]) OVER(PARTITION BY [Order Number])
  • Add filter: product count >=2.

Step 2: Calculate Order Count for Each Product Pair

image.png

  • Add calculated field: Product Serial List. array_sort(collect_list([Product ID]) over(partition by [Order Number])) -- list all product IDs in an order, sorted
  • Add calculated field: Product B List. filter([Product Serial List], x -> x<>[Product ID]) -- filter out product A's ID to get product B list
  • Add calculated field: Product B Serial. explode([Product B List]) -- expand product B serial
  • Add calculated field: Product Combo. array_sort(array([Product ID],[Product B Serial])) -- create combo array and sort
  • Use group aggregation to deduplicate by order number and product combo, drag order number and product combo into dimensions.
  • Add calculated field: Product A Serial. [Product Combo][0]
  • Add calculated field: Product B Serial. [Product Combo][1]
  • Use group aggregation to calculate combo order count, drag product A serial, product B serial into dimensions; order number into value, aggregation: distinct count, alias: combo order count.

Step 3: Join to Get Product Names, Order Counts, and Total Orders

image.png

Branch 1:

  • Use group aggregation to calculate product order count, drag product ID and name into dimensions; order number into value, aggregation: distinct count, alias: order count.
  • Join: product A serial in combo to product ID. Rename to get product A name and order count.                
    image.png
  • Join: product B serial in combo to product ID. Rename to get product B name and order count.

Branch 2:

  • Use group aggregation to calculate product order count, aggregation: distinct count, alias: order count.
  • Join: constant field a to constant field a. Rename to get total order count.              
    image.png

Finally, save and run the ETL. Now you have the four base metrics: order count for A, B, combo, and total. Use the ETL output dataset to create a card, calculate support, confidence, and lift, and get your analysis results!

Want to try it? Contact Guandata for sample data and ETL files.