Skip to main content

How to Build a Usable ChatBI

This article introduces how to build a usable ChatBI using the retail industry as an example.

1. Overview

1.1. Business Research

1.2. Effects Achievable Through ChatBI

  1. Obtain data results through frontend queries with instant answers
  2. When questions involve specific metrics, automatically determine whether metrics have anomalies, what the causes are, and further present the business behaviors behind the data
  3. When the system outputs data analysis conclusions, simultaneously output executable strategic recommendations

2. Topic Building

2.1. Prepare Data

Before creating a ChatBI topic, you need to prepare the datasets to be integrated. For dataset integration, refer to Data Preparation. Dataset requirements are as follows:

  • For a single topic, it's recommended to use datasets of the same type, such as all Spark, MySQL, or StarRocks datasets
  • Dataset table names and field names should avoid using English, numbers, and other descriptions that are difficult to understand business meaning. Dataset table names should avoid using spaces and special symbols, and dataset table names and field names should avoid duplication.
  • Different datasets should avoid using similar names that are difficult to distinguish
  • Time/date fields should avoid using string format as much as possible

Which datasets to integrate can be determined by working backwards from the questions that need to be satisfied.

For example: To view daily customer orders for Hangzhou store since January 2025, we need to integrate the store's daily sales table.

2.2. Enable Permissions

  1. Enable ChatBI-related permissions in the BI platform. For detailed operations, refer to Permission Configuration.

  1. In the ChatBI operations management backend, assign owner/access permissions to different users. For detailed operations, refer to Permission Assignment.
    • Owner Permissions: This user can see the current topic in the operations management backend and modify topic name, basic configuration, knowledge base configuration, and permission configuration; at the same time, the user can ask questions about this topic in the frontend.
    • User Permissions: This user can ask questions about this topic in the frontend.

2.3. Create Topics

Supports creating multiple topics. For how to create topics, refer to New Topic.

Note

When creating a topic for the first time, it's recommended to create based on a single table. After the single table Q&A accuracy reaches 80%, then expand to other tables for Q&A.

Topic creation mainly includes three parts: basic information, associated datasets, and knowledge base.

2.3.1. Topic Basic Information

Topic basic information includes topic name, topic description, Q&A avatar, and welcome message. The display effect during frontend queries is as follows.

For basic information configuration, refer to Basic Configuration.

2.3.2. Topic Data

Dataset descriptions, dataset structures, and field comments will be used as model learning knowledge. If not maintained, it will affect the accuracy of model responses.

Supports associating multiple datasets. For dataset-related operations, see Dataset Configuration.

2.3.3. Topic Knowledge Base

The topic knowledge base includes "Business Knowledge Base" and "Error Set".

  • Business Knowledge Base

    • General Knowledge: Knowledge information referenced in every question
    • Business Knowledge: Knowledge information referenced only when similar/related to question content
  • Error Set: SQL logic referenced only when similar/related to question content

  • Whether to add knowledge depends on Q&A effectiveness, not rigidly maintaining according to the following requirements.
  • Knowledge maintenance priority: "Business Knowledge Base" > "Error Set".

2.3.3.1. Business Knowledge Base

For business knowledge base related configuration, refer to Business Knowledge Base.

2.3.3.1.1. Add "General Knowledge"

  • Knowledge for Datasets

    • Explain [Primary Key]: "The primary key of is month and customer number."
    • Explain [Field Meaning]: "Product status in is used to determine whether it's on sale."
    • Explain [Association Relationship]: " uses product number to associate with , uses store code to associate with ."
    • Explain [Business Scenario]: "Data in includes customer information (account opening date, affiliated business department/branch, customer type, etc.) and customer assets and revenue generation in that month."
    • Explain [Table Selection, Query Priority]: When table names are similar and easily ambiguous, you can add knowledge: "For questions involving products, query <store_product sales daily table>, otherwise query ."
  • Knowledge for Time Conditions

    • [Field Selection]: When there are multiple time fields that easily cause ambiguity, you can add knowledge: "Month is the month when the business occurred (STRING type in YYYYMM format). Account opening date is the user's account opening time, which is the customer's user attribute, only used for counting new accounts and current month account opening metrics."

    • [Week/Month Definition]: When week statistics caliber needs unified definition, you can add knowledge: "The database's dayofweek defaults to Sunday as 1, but in China, the default start of each week is Monday. The method to calculate this week's start is to find this Monday through DATE_SUB(CURRENT_DATE, INTERVAL(DAYOFWEEK(CURRENT_DATE) + 5) % 7 DAY), and the method to calculate last week's start is to first find this week's start, then subtract 7 days."

    • [Semantic Matching]: When fuzzy time conditions need to be defined, you can add knowledge:

      • "When questions involve as a time condition, default to querying data from the last 3 months."
      • "When questions involve as a time condition, default to querying data from the past 30 days."
      • "This week: business date greater than or equal to this week's start, less than or equal to today. Last week: business date greater than or equal to last week's start, less than or equal to last weekend. Last 7 days, last week: business date less than or equal to yesterday, greater than or equal to current date minus 7 days."
    • [Business Common Sense]: When business time statistics caliber needs unified definition, you can add knowledge: "The 6th of each month is the store's member day."

    • [Default Time Condition]: When default time conditions are needed (not recommended), you can add knowledge: "If there's no clear time in the question, default to filtering business date as yesterday, don't directly count cumulative values unless specific time conditions (day/week/month) are specified in the question."

  • Knowledge for almost all questions, or dataset [global filter conditions]

    • When unusable data needs to be excluded, you can add knowledge: "All queries for must limit sales amount to not null."
    • When only data conforming to business logic is used, you can add knowledge: "All queries for must limit store status to 'normal operation'."
  • Knowledge for custom responses, add to "General Knowledge", for example: "When questions involve Meituan, human efficiency, Douyin, inventory, procurement, delivery, ordering, turnover, warehouse, shelf life, reply to users: These business data are not currently included, waiting for subsequent data updates."

2.3.3.1.2. Add "Business Knowledge"

  • Knowledge for Noun Mapping

    • [Field Selection]: When proper nouns refer to specific fields, or specific scenarios use specific fields, you can add knowledge:

      • "Product type, product business type, product source, product category are all the same concept, referring to the product business type (product source category) field (including otc, public funds, private funds)."
      • "Brand: sales brand in , brand name in ."
      • "Supervisor refers to OFC, district manager refers to DM, war zone manager refers to ZM."
      • "When questions involve specific product specifications, such as: grams/kilograms/g/kg/milliliters/liters/ml/L/pieces/strips/branches/pieces, all use product full name for filtering, only when specific specifications are not included use product name for filtering."
    • [Local Filter Conditions]: When proper nouns refer to specific filter conditions, you can add knowledge:

      • "Core members: population asset type IN ('one-order member', 'multi-order member')."
      • "On-sale products: product status = 'normal', department store products: first-level category name = 'non-leisure food', low-temperature products: first-level category name = 'frozen refrigeration'."
      • "Dormant assets/dormant user assets/dormant users/dormant population/dormant members are all the same concept: population asset type IN ('dormant member', 'lost member')."
      • "Plant mask: (sku IN ('50269','99802') OR product name LIKE '%plant%' AND product name LIKE '%mask%')."
      • "Wholesale store, wholesale supermarket, department store are all the same concept: store version LIKE '%department store%'."
  • Knowledge for Calculation Logic

    • Explain [Metric Formulas] can add knowledge:

      • "Conversion rate = converted users / followed users, follow-up rate = followed users / triggered users."
      • "Purchase quantity: SUM(sales quantity), purchase users: COUNT(DISTINCT consumer unique identifier), items per customer: purchase quantity / purchase users."
      • "Store count = COUNT(DISTINCT store ID), gross profit margin = SUM(sales gross profit)/SUM(sales amount), customer unit price = SUM(sales amount)/SUM(sales order count)."
      • ""Loss", "profit and loss" both refer to the same metric: loss amount = SUM(store total cost)-SUM(gross profit amount), net profit = SUM(gross profit amount)-SUM(store total cost)."
    • Explain [Single/Multi-step Query Logic] can add knowledge:

      • "Long-tail products: Sort products by sales amount in ascending order, calculate the cumulative sales amount of these products, when cumulative sales amount reaches 20% of total amount, the already counted products belong to long-tail products; head products: Sort products by sales amount in descending order, calculate the cumulative sales amount of these products, when cumulative sales amount reaches 80% of total amount, the already counted products belong to head products"
      • "[Year-over-year/Sequential] When questions involve week-over-week, month-over-month, week-to-week comparison, month-to-month comparison, all must limit the same statistical days, for example: asking on Friday "this week's sales and sequential performance compared to last week", calculates this Monday to Thursday's sales, compared to last Monday to Thursday's sales; for example: asking on the 20th of a month "this month's sales and sequential performance compared to last month", calculates this month's 1st to yesterday's sales, compared to last month's 1st to the same day's sales."
      • "[About Forecasting] When questions involve 'forecast', 'estimate', 'calculation', default to first calculating the average daily data of the last 7 days, then forecast future data based on the average value of the last 7 days. Forecasting this week refers to from this week's start to this week's end; forecasting this month refers to the complete month, need to count according to complete month days."
  • Knowledge for Visualization Display

    • When default display of certain fields is needed, you can add knowledge:

      • "When queries involve store details, store lists, default to display store name, but don't display store code, unless the question explicitly requires displaying store code."
      • "When questions involve any of ofc, dm, zm, supervisor, war zone manager, all queries must display corresponding major war zone and minor war zone."
    • When specific scenarios need to include specified fields, you can add knowledge:

      • "When questions involve 'loss', 'profit and loss', or 'cost', 'profit', in addition to counting related metrics, also count cost detail items for user reference: rent cost, labor cost, utilities cost, shopping bag fee cost, channel handling cost, miscellaneous cost and logistics cost, etc."
      • "When questions involve 'sales situation', calculate the following metrics: sales amount, sales quantity, gross profit, gross profit margin."
  • Knowledge for Semantic Clarification

    • When questions may refer to multiple fields or filter conditions, or have ambiguity, you can add knowledge:

      • "When users mention , please let users clarify the city name in the question, specifically whether it refers to city, warehouse, or major region."
  • Knowledge for Temporary Solutions

    • [Enumeration Value Matching]: When questions involve specific field enumeration values, but don't completely match the enumeration values, need to query through fuzzy matching, you can refer to knowledge:

      • "When involving stores, product names, according to the value examples already retrieved for corresponding fields, perform word segmentation and fuzzy matching, such as asking 'Luding Chengwu Road Store', example values have 'Luding County Chengwu Road Store', then the condition is like '%Luding%' and like '%Chengwu Road%', this ensures matching to currently retrieved values. If there are no well-matched example values, such as user querying 'Yintai City Nanhu International Store', example values only have 'Tianfu New District Nanhu International Community Store', no 'Yintai City' related stores, then directly ask users for clarification. If there are words with numerical meaning (Arabic numerals/Chinese characters), must match both forms simultaneously, such as like '%24%' or like '%twenty-four%'. Additionally, if store metrics use fuzzy matching, need to aggregate and count by store grouping"
      • "[English Case Matching] When questions involve English characters, need to first convert conditions to uppercase uniformly before matching. For example: what is the sales amount of abc, need to filter UPPER(brand name) = UPPER('abc'). What is the sales amount of Wahaha AD calcium milk, need to filter UPPER(product name) LIKE '%UPPER('Wahaha AD calcium milk')%'"
    • [Error Handling]: When mobile voice-to-text accuracy is not high, need to convert error text to understandable business nouns, you can refer to knowledge:

      • [Wrong Characters, Homophone Conversion] User questions are all related to chain snack wholesale sales scenarios, there will be many store, product, store average sales, daily average sales, related snack dimension metric questions; for input errors like 'store average sales' corresponding to 'store average sales', the character store is most likely store
2.3.3.2. Error Set

For error set related configuration, refer to Error Set.

  • For very small scope, special or unique cases, add to "Error Set"

    Cases other than those specified in the "Business Knowledge Base"

  • For logically complex, condition-heavy (business knowledge difficult to clearly maintain) question scenarios, add to "Error Set"

    • When queries involving complex logic such as window functions according to "long-tail product" definition, you can add error sets:
WITH `store_goods` AS (
SELECT `product_code`, `product_name`, SUM(`sales_amount`) AS `sales_revenue`
FROM `@CHatBI_national_store_product_sales_data_daily`
WHERE `business_date` >= '2025-04-01'
AND `business_date` <= '2025-04-30'
AND `store_name` LIKE '%Dechang%'
AND `store_name` LIKE '%Yucai%'
AND `store_name` LIKE '%Road%'
AND NOT `sales_amount` IS NULL
GROUP BY `product_code`, `product_name`
),
`total_sales` AS (
SELECT SUM(`sales_revenue`) AS `total_sales_revenue`
FROM `store_goods`
),
`ranked_goods` AS (
SELECT `product_code`, `product_name`, `sales_revenue`
, SUM(`sales_revenue`) OVER (ORDER BY `sales_revenue` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS `cumulative_sales_revenue`
FROM `store_goods`
)
SELECT `product_code`, `product_name`, `sales_revenue`
, ROUND(`cumulative_sales_revenue` / `total_sales_revenue`, 4) AS `cumulative_ratio`
FROM `ranked_goods`, `total_sales`
WHERE `cumulative_sales_revenue` / `total_sales_revenue` <= 0.2
ORDER BY `sales_revenue` ASC
    • When this week's week-over-week comparison needs to limit the same number of days as last week, you can add error sets:
WITH `date_ref` AS (
SELECT DATE_SUB(CURRENT_DATE, INTERVAL (DAYOFWEEK(CURRENT_DATE) + 5) % 7 DAY) AS `this_week_start`
, DATE_SUB(DATE_SUB(CURRENT_DATE, INTERVAL (DAYOFWEEK(CURRENT_DATE) + 5) % 7 DAY), INTERVAL '7' DAY) AS `last_week_start`
),
`cur` AS (
SELECT SUM(`sales_amount`) AS `this_week_sales`
FROM `@CHatBI_national_store_sales_data_daily`, `date_ref`
WHERE `major_war_zone` LIKE '%Chuandongnan%'
AND `business_date` >= `date_ref`.`this_week_start`
AND `business_date` <= DATE_SUB(CURRENT_DATE, INTERVAL '1' DAY)
AND NOT `sales_amount` IS NULL
),
`pre` AS (
SELECT SUM(`sales_amount`) AS `last_week_sales`
FROM `@CHatBI_national_store_sales_data_daily`, `date_ref`
WHERE `major_war_zone` LIKE '%Chuandongnan%'
AND `business_date` >= `date_ref`.`last_week_start`
AND `business_date` < DATE_ADD(`date_ref`.`last_week_start`, INTERVAL DATE_DIFF('DAY', DATE_SUB(CURRENT_DATE, INTERVAL '1' DAY), `date_ref`.`this_week_start`) DAY)
AND NOT `sales_amount` IS NULL
)
SELECT `cur`.`this_week_sales`, `pre`.`last_week_sales`, `cur`.`this_week_sales` - `pre`.`last_week_sales` AS `sequential_change`
, CASE
WHEN `pre`.`last_week_sales` = 0
OR `pre`.`last_week_sales` IS NULL
THEN NULL
ELSE ROUND((`cur`.`this_week_sales` - `pre`.`last_week_sales`) / `pre`.`last_week_sales`, 4)
END AS `sequential_change_rate`
FROM `cur`, `pre`
    • When this month's month-over-month and year-over-year, i.e., MTD business metrics month-over-month and year-over-year also need to limit the same number of days, you can add error sets:
WITH `cur_month` AS (
SELECT SUM(`sales_amount`) AS `this_month_sales`
FROM `@CHatBI_national_store_sales_data_daily`
WHERE `city` LIKE '%Chengdu%'
AND `business_date` >= DATE_FORMAT(CURRENT_DATE, '%Y-%m-01')
AND `business_date` <= DATE_SUB(CURRENT_DATE, INTERVAL '1' DAY)
AND NOT `sales_amount` IS NULL
),
`last_month` AS (
SELECT SUM(`sales_amount`) AS `last_month_same_period_sales`
FROM `@CHatBI_national_store_sales_data_daily`
WHERE `city` LIKE '%Chengdu%'
AND `business_date` >= DATE_SUB(DATE_FORMAT(CURRENT_DATE, '%Y-%m-01'), INTERVAL '1' MONTH)
AND `business_date` <= DATE_SUB(DATE_SUB(CURRENT_DATE, INTERVAL '1' DAY), INTERVAL '1' MONTH)
AND NOT `sales_amount` IS NULL
),
`last_year` AS (
SELECT SUM(`sales_amount`) AS `last_year_same_period_sales`
FROM `@CHatBI_national_store_sales_data_daily`
WHERE `city` LIKE '%Chengdu%'
AND `business_date` >= DATE_SUB(DATE_FORMAT(CURRENT_DATE, '%Y-%m-01'), INTERVAL '1' YEAR)
AND `business_date` <= DATE_SUB(DATE_SUB(CURRENT_DATE, INTERVAL '1' DAY), INTERVAL '1' YEAR)
AND NOT `sales_amount` IS NULL
)
SELECT `c`.`this_month_sales`, `l`.`last_month_same_period_sales`, `y`.`last_year_same_period_sales`, `c`.`this_month_sales` - `l`.`last_month_same_period_sales` AS `sequential_change`
, CASE
WHEN `l`.`last_month_same_period_sales` = 0
OR `l`.`last_month_same_period_sales` IS NULL
THEN NULL
ELSE ROUND((`c`.`this_month_sales` - `l`.`last_month_same_period_sales`) / `l`.`last_month_same_period_sales`, 4)
END AS `sequential_change_rate`, `c`.`this_month_sales` - `y`.`last_year_same_period_sales` AS `year_over_year_change`
, CASE
WHEN `y`.`last_year_same_period_sales` = 0
OR `y`.`last_year_same_period_sales` IS NULL
THEN NULL
ELSE ROUND((`c`.`this_month_sales` - `y`.`last_year_same_period_sales`) / `y`.`last_year_same_period_sales`, 4)
END AS `year_over_year_change_rate`
FROM `cur_month` `c`, `last_month` `l`, `last_year` `y`

Avoid adding overly simple or non-long-term effective "Error Sets"

  • [Overly Simple] Can be efficiently maintained through business knowledge base, for example: [Question] What is the sales amount of xxx product in February 2025?
  • [Non-long-term Effective] Questions are relative dates, SQL conditions are absolute dates, for example: [Question] What is yesterday's sales amount? [Error Set SQL] condition date= "2025-03-01"

2.4. Test Topics

Used for batch testing Q&A effectiveness, evaluating Q&A accuracy (accuracy calculation only targets questions in testing).

It's recommended to put into production only after topic test accuracy reaches 90% or above. If accuracy is low, you can improve Q&A accuracy by adding dataset descriptions and field comments to datasets, adding more error sets, and adding more business knowledge.

2.4.1. Testing Steps

①Add questions/batch import → ②Judge whether data results are correct → ③Judge whether SQL is correct → ④Add/modify knowledge → ⑤Retest

2.4.2. Notes

Query principle: Conform to the basic structure of effective questions ([time range] + [conditions (if any)] + [data metrics]), time/conditions/metrics are clear and unambiguous, avoid ambiguity

2.4.3. Error Causes and Handling

Refer to Troubleshooting for handling.

  • Has data, but doesn't conform to business common sense or differs from BI (SQL condition error) Add/modify knowledge and test again

  • No data currently (SQL condition error/missing data source/no data permissions) Ensure data source is not missing, data permissions are enabled, add/modify knowledge and test again

  • Data query failed (SQL syntax error/interface exception) Check whether SQL syntax is correct, including quotes for table names/field names, function usage standards, etc. If this problem occurs frequently, contact GuanYuan for troubleshooting.

  • SQL validity check failed
    It's recommended to check and retest. If this problem occurs frequently, contact GuanYuan for troubleshooting.

2.5. Launch Topics

When topic test accuracy reaches 90% or above, you can formally enable the topic. Enabled topics can be switched and selected during frontend queries.

3. Topic Operations

3.1. Usage Tracking

Used to track historical conversations, troubleshoot Q&A effectiveness, including knowledge application in each conversation.

3.1.1. Frontend Q&A

3.1.1.1. Effective Questions

Basic structure: [time range] + [conditions (if any)] + [data metrics], for example:

  • What are the net assets and total revenue of xxx company's xxx customer in 2024?
  • Which business departments do the 10 employees with the most customers in 2024 come from? What are the annual revenue and year-over-year performance of these business departments?
  • How many orders and ending inventory does xxx product have in March? How many are individual orders, and what's the individual order ratio?
  • What are the sales amount, gross profit, and gross profit margin of xxx product in xxx store in xxx region in Q3?
  • Which stores had continuously declining monthly sales in the second half of 2024?
Note

Whether questions conform to the above basic structure directly affects Q&A accuracy and the difficulty of knowledge base maintenance.

3.1.1.2. Frontend Q&A Steps
  • ①Frontend query → ②Judge whether data results are correct → ③Judge whether SQL is correct → ④View operation logs to locate cause → ⑤Add/modify knowledge → ⑥Query and test again
3.1.1.3. Notes
  • Query principle: Conform to the basic structure of effective questions ([time range] + [conditions (if any)] + [data metrics]), time/conditions/metrics are clear and unambiguous, avoid ambiguity

  • After questions go wrong and knowledge is added/modified, when querying and testing again, avoid being completely identical to the original question (Q&A SQL generation has caching mechanism)

3.1.2. Troubleshooting

3.1.2.1. Troubleshooting Process & Cases

3.1.2.2. Q&A Error Classification

Except for SQL errors, other problem troubleshooting refers to Common Problem Troubleshooting Guide.

3.1.2.3. Operation Logs

Operation logs include the following information, used to determine what affects current Q&A effectiveness

  • Related table information (dataset structure, field enumeration values Value examples)
  • Related business knowledge (entries in "Business Knowledge" related to current questions)
  • Related training samples (entries in "Error Set" related to current questions)

3.2. Knowledge Iteration

Continuously add/modify knowledge during frontend Q&A process, improving Knowledge Base Building.