How to Build a Usable ChatBI
This article introduces how to build a usable ChatBI using the retail industry as an example.
1. Overview
1.1. Business Research

1.2. Effects Achievable Through ChatBI
- Obtain data results through frontend queries with instant answers
- When questions involve specific metrics, automatically determine whether metrics have anomalies, what the causes are, and further present the business behaviors behind the data
- When the system outputs data analysis conclusions, simultaneously output executable strategic recommendations
2. Topic Building

2.1. Prepare Data
Before creating a ChatBI topic, you need to prepare the datasets to be integrated. For dataset integration, refer to Data Preparation. Dataset requirements are as follows:
- For a single topic, it's recommended to use datasets of the same type, such as all Spark, MySQL, or StarRocks datasets
- Dataset table names and field names should avoid using English, numbers, and other descriptions that are difficult to understand business meaning. Dataset table names should avoid using spaces and special symbols, and dataset table names and field names should avoid duplication.
- Different datasets should avoid using similar names that are difficult to distinguish
- Time/date fields should avoid using string format as much as possible
Which datasets to integrate can be determined by working backwards from the questions that need to be satisfied.
For example: To view daily customer orders for Hangzhou store since January 2025, we need to integrate the store's daily sales table.
2.2. Enable Permissions
- Enable ChatBI-related permissions in the BI platform. For detailed operations, refer to Permission Configuration.

- In the ChatBI operations management backend, assign owner/access permissions to different users. For detailed operations, refer to Permission Assignment.
- Owner Permissions: This user can see the current topic in the operations management backend and modify topic name, basic configuration, knowledge base configuration, and permission configuration; at the same time, the user can ask questions about this topic in the frontend.
- User Permissions: This user can ask questions about this topic in the frontend.
2.3. Create Topics
Supports creating multiple topics. For how to create topics, refer to New Topic.
When creating a topic for the first time, it's recommended to create based on a single table. After the single table Q&A accuracy reaches 80%, then expand to other tables for Q&A.
Topic creation mainly includes three parts: basic information, associated datasets, and knowledge base.
2.3.1. Topic Basic Information
Topic basic information includes topic name, topic description, Q&A avatar, and welcome message. The display effect during frontend queries is as follows.
For basic information configuration, refer to Basic Configuration.

2.3.2. Topic Data
Dataset descriptions, dataset structures, and field comments will be used as model learning knowledge. If not maintained, it will affect the accuracy of model responses.
Supports associating multiple datasets. For dataset-related operations, see Dataset Configuration.
2.3.3. Topic Knowledge Base
The topic knowledge base includes "Business Knowledge Base" and "Error Set".
-
Business Knowledge Base
- General Knowledge: Knowledge information referenced in every question
- Business Knowledge: Knowledge information referenced only when similar/related to question content
-
Error Set: SQL logic referenced only when similar/related to question content
- Whether to add knowledge depends on Q&A effectiveness, not rigidly maintaining according to the following requirements.
- Knowledge maintenance priority: "Business Knowledge Base" > "Error Set".

2.3.3.1. Business Knowledge Base
For business knowledge base related configuration, refer to Business Knowledge Base.
2.3.3.1.1. Add "General Knowledge"

-
Knowledge for Datasets
- Explain [Primary Key]: "The primary key of
is month
andcustomer number
." - Explain [Field Meaning]: "
Product status
inis used to determine whether it's on sale." - Explain [Association Relationship]: "
uses product number
to associate with, uses store code
to associate with." - Explain [Business Scenario]: "Data in
includes customer information (account opening date, affiliated business department/branch, customer type, etc.) and customer assets and revenue generation in that month." - Explain [Table Selection, Query Priority]: When table names are similar and easily ambiguous, you can add knowledge: "For questions involving
products
, query <store_product sales daily table>, otherwise query."
- Explain [Primary Key]: "The primary key of
-
Knowledge for Time Conditions
-
[Field Selection]: When there are multiple time fields that easily cause ambiguity, you can add knowledge: "
Month
is the month when the business occurred (STRING type in YYYYMM format).Account opening date
is the user's account opening time, which is the customer's user attribute, only used for counting new accounts and current month account opening metrics." -
[Week/Month Definition]: When week statistics caliber needs unified definition, you can add knowledge: "The database's dayofweek defaults to Sunday as 1, but in China, the default start of each week is Monday. The method to calculate this week's start is to find this Monday through DATE_SUB(CURRENT_DATE, INTERVAL(DAYOFWEEK(CURRENT_DATE) + 5) % 7 DAY), and the method to calculate last week's start is to first find this week's start, then subtract 7 days."
-
[Semantic Matching]: When fuzzy time conditions need to be defined, you can add knowledge:
- "When questions involve
as a time condition, default to querying data from the last 3 months." - "When questions involve
as a time condition, default to querying data from the past 30 days." - "This week:
business date
greater than or equal to this week's start, less than or equal to today. Last week:business date
greater than or equal to last week's start, less than or equal to last weekend. Last 7 days, last week:business date
less than or equal to yesterday, greater than or equal to current date minus 7 days."
- "When questions involve
-
[Business Common Sense]: When business time statistics caliber needs unified definition, you can add knowledge: "The 6th of each month is the store's member day."
-
[Default Time Condition]: When default time conditions are needed (not recommended), you can add knowledge: "If there's no clear time in the question, default to filtering
business date
as yesterday, don't directly count cumulative values unless specific time conditions (day/week/month) are specified in the question."
-
-
Knowledge for almost all questions, or dataset [global filter conditions]
- When unusable data needs to be excluded, you can add knowledge: "All queries for
must limit sales amount
to not null." - When only data conforming to business logic is used, you can add knowledge: "All queries for
must limit store status
to 'normal operation'."
- When unusable data needs to be excluded, you can add knowledge: "All queries for
-
Knowledge for custom responses, add to "General Knowledge", for example: "When questions involve Meituan, human efficiency, Douyin, inventory, procurement, delivery, ordering, turnover, warehouse, shelf life, reply to users: These business data are not currently included, waiting for subsequent data updates."
2.3.3.1.2. Add "Business Knowledge"

-
Knowledge for Noun Mapping
-
[Field Selection]: When proper nouns refer to specific fields, or specific scenarios use specific fields, you can add knowledge:
- "Product type, product business type, product source, product category are all the same concept, referring to the
product business type (product source category)
field (including otc, public funds, private funds)." - "Brand:
sales brand
in, brand name
in." - "Supervisor refers to OFC, district manager refers to DM, war zone manager refers to ZM."
- "When questions involve specific product specifications, such as: grams/kilograms/g/kg/milliliters/liters/ml/L/pieces/strips/branches/pieces, all use
product full name
for filtering, only when specific specifications are not included useproduct name
for filtering."
- "Product type, product business type, product source, product category are all the same concept, referring to the
-
[Local Filter Conditions]: When proper nouns refer to specific filter conditions, you can add knowledge:
- "Core members:
population asset type
IN ('one-order member', 'multi-order member')." - "On-sale products:
product status
= 'normal', department store products:first-level category name
= 'non-leisure food', low-temperature products:first-level category name
= 'frozen refrigeration'." - "Dormant assets/dormant user assets/dormant users/dormant population/dormant members are all the same concept:
population asset type
IN ('dormant member', 'lost member')." - "Plant mask: (
sku
IN ('50269','99802') ORproduct name
LIKE '%plant%' ANDproduct name
LIKE '%mask%')." - "Wholesale store, wholesale supermarket, department store are all the same concept:
store version
LIKE '%department store%'."
- "Core members:
-
-
Knowledge for Calculation Logic
-
Explain [Metric Formulas] can add knowledge:
- "Conversion rate = converted users / followed users, follow-up rate = followed users / triggered users."
- "Purchase quantity: SUM(
sales quantity
), purchase users: COUNT(DISTINCTconsumer unique identifier
), items per customer: purchase quantity / purchase users." - "Store count = COUNT(DISTINCT
store ID
), gross profit margin = SUM(sales gross profit)/SUM(sales amount), customer unit price = SUM(sales amount)/SUM(sales order count)." - ""Loss", "profit and loss" both refer to the same metric:
loss amount
= SUM(store total cost
)-SUM(gross profit amount
),net profit
= SUM(gross profit amount
)-SUM(store total cost
)."
-
Explain [Single/Multi-step Query Logic] can add knowledge:
- "Long-tail products: Sort products by sales amount in ascending order, calculate the cumulative sales amount of these products, when cumulative sales amount reaches 20% of total amount, the already counted products belong to long-tail products; head products: Sort products by sales amount in descending order, calculate the cumulative sales amount of these products, when cumulative sales amount reaches 80% of total amount, the already counted products belong to head products"
- "[Year-over-year/Sequential] When questions involve week-over-week, month-over-month, week-to-week comparison, month-to-month comparison, all must limit the same statistical days, for example: asking on Friday "this week's sales and sequential performance compared to last week", calculates this Monday to Thursday's sales, compared to last Monday to Thursday's sales; for example: asking on the 20th of a month "this month's sales and sequential performance compared to last month", calculates this month's 1st to yesterday's sales, compared to last month's 1st to the same day's sales."
- "[About Forecasting] When questions involve 'forecast', 'estimate', 'calculation', default to first calculating the average daily data of the last 7 days, then forecast future data based on the average value of the last 7 days. Forecasting this week refers to from this week's start to this week's end; forecasting this month refers to the complete month, need to count according to complete month days."
-
-
Knowledge for Visualization Display
-
When default display of certain fields is needed, you can add knowledge:
- "When queries involve store details, store lists, default to display
store name
, but don't displaystore code
, unless the question explicitly requires displayingstore code
." - "When questions involve any of ofc, dm, zm, supervisor, war zone manager, all queries must display corresponding
major war zone
andminor war zone
."
- "When queries involve store details, store lists, default to display
-
When specific scenarios need to include specified fields, you can add knowledge:
- "When questions involve 'loss', 'profit and loss', or 'cost', 'profit', in addition to counting related metrics, also count cost detail items for user reference:
rent cost
,labor cost
,utilities cost
,shopping bag fee cost
,channel handling cost
,miscellaneous cost
andlogistics cost
, etc." - "When questions involve 'sales situation', calculate the following metrics: sales amount, sales quantity, gross profit, gross profit margin."
- "When questions involve 'loss', 'profit and loss', or 'cost', 'profit', in addition to counting related metrics, also count cost detail items for user reference:
-
-
Knowledge for Semantic Clarification
-
When questions may refer to multiple fields or filter conditions, or have ambiguity, you can add knowledge:
- "When users mention
, please let users clarify the city name in the question, specifically whether it refers to city
,warehouse
, ormajor region
."
- "When users mention
-
-
Knowledge for Temporary Solutions
-
[Enumeration Value Matching]: When questions involve specific field enumeration values, but don't completely match the enumeration values, need to query through fuzzy matching, you can refer to knowledge:
- "When involving stores, product names, according to the value examples already retrieved for corresponding fields, perform word segmentation and fuzzy matching, such as asking 'Luding Chengwu Road Store', example values have 'Luding County Chengwu Road Store', then the condition is like '%Luding%' and like '%Chengwu Road%', this ensures matching to currently retrieved values. If there are no well-matched example values, such as user querying 'Yintai City Nanhu International Store', example values only have 'Tianfu New District Nanhu International Community Store', no 'Yintai City' related stores, then directly ask users for clarification. If there are words with numerical meaning (Arabic numerals/Chinese characters), must match both forms simultaneously, such as like '%24%' or like '%twenty-four%'. Additionally, if store metrics use fuzzy matching, need to aggregate and count by store grouping"
- "[English Case Matching] When questions involve English characters, need to first convert conditions to uppercase uniformly before matching. For example: what is the sales amount of abc, need to filter UPPER(
brand name
) = UPPER('abc'). What is the sales amount of Wahaha AD calcium milk, need to filter UPPER(product name
) LIKE '%UPPER('Wahaha AD calcium milk')%'"
-
[Error Handling]: When mobile voice-to-text accuracy is not high, need to convert error text to understandable business nouns, you can refer to knowledge:
- [Wrong Characters, Homophone Conversion] User questions are all related to chain snack wholesale sales scenarios, there will be many store, product, store average sales, daily average sales, related snack dimension metric questions; for input errors like 'store average sales' corresponding to 'store average sales', the character
store
is most likelystore
- [Wrong Characters, Homophone Conversion] User questions are all related to chain snack wholesale sales scenarios, there will be many store, product, store average sales, daily average sales, related snack dimension metric questions; for input errors like 'store average sales' corresponding to 'store average sales', the character
-
2.3.3.2. Error Set
For error set related configuration, refer to Error Set.
-
For very small scope, special or unique cases, add to "Error Set"
Cases other than those specified in the "Business Knowledge Base"
-
For logically complex, condition-heavy (business knowledge difficult to clearly maintain) question scenarios, add to "Error Set"
- When queries involving complex logic such as window functions according to "long-tail product" definition, you can add error sets:
WITH `store_goods` AS (
SELECT `product_code`, `product_name`, SUM(`sales_amount`) AS `sales_revenue`
FROM `@CHatBI_national_store_product_sales_data_daily`
WHERE `business_date` >= '2025-04-01'
AND `business_date` <= '2025-04-30'
AND `store_name` LIKE '%Dechang%'
AND `store_name` LIKE '%Yucai%'
AND `store_name` LIKE '%Road%'
AND NOT `sales_amount` IS NULL
GROUP BY `product_code`, `product_name`
),
`total_sales` AS (
SELECT SUM(`sales_revenue`) AS `total_sales_revenue`
FROM `store_goods`
),
`ranked_goods` AS (
SELECT `product_code`, `product_name`, `sales_revenue`
, SUM(`sales_revenue`) OVER (ORDER BY `sales_revenue` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS `cumulative_sales_revenue`
FROM `store_goods`
)
SELECT `product_code`, `product_name`, `sales_revenue`
, ROUND(`cumulative_sales_revenue` / `total_sales_revenue`, 4) AS `cumulative_ratio`
FROM `ranked_goods`, `total_sales`
WHERE `cumulative_sales_revenue` / `total_sales_revenue` <= 0.2
ORDER BY `sales_revenue` ASC
-
- When this week's week-over-week comparison needs to limit the same number of days as last week, you can add error sets:
WITH `date_ref` AS (
SELECT DATE_SUB(CURRENT_DATE, INTERVAL (DAYOFWEEK(CURRENT_DATE) + 5) % 7 DAY) AS `this_week_start`
, DATE_SUB(DATE_SUB(CURRENT_DATE, INTERVAL (DAYOFWEEK(CURRENT_DATE) + 5) % 7 DAY), INTERVAL '7' DAY) AS `last_week_start`
),
`cur` AS (
SELECT SUM(`sales_amount`) AS `this_week_sales`
FROM `@CHatBI_national_store_sales_data_daily`, `date_ref`
WHERE `major_war_zone` LIKE '%Chuandongnan%'
AND `business_date` >= `date_ref`.`this_week_start`
AND `business_date` <= DATE_SUB(CURRENT_DATE, INTERVAL '1' DAY)
AND NOT `sales_amount` IS NULL
),
`pre` AS (
SELECT SUM(`sales_amount`) AS `last_week_sales`
FROM `@CHatBI_national_store_sales_data_daily`, `date_ref`
WHERE `major_war_zone` LIKE '%Chuandongnan%'
AND `business_date` >= `date_ref`.`last_week_start`
AND `business_date` < DATE_ADD(`date_ref`.`last_week_start`, INTERVAL DATE_DIFF('DAY', DATE_SUB(CURRENT_DATE, INTERVAL '1' DAY), `date_ref`.`this_week_start`) DAY)
AND NOT `sales_amount` IS NULL
)
SELECT `cur`.`this_week_sales`, `pre`.`last_week_sales`, `cur`.`this_week_sales` - `pre`.`last_week_sales` AS `sequential_change`
, CASE
WHEN `pre`.`last_week_sales` = 0
OR `pre`.`last_week_sales` IS NULL
THEN NULL
ELSE ROUND((`cur`.`this_week_sales` - `pre`.`last_week_sales`) / `pre`.`last_week_sales`, 4)
END AS `sequential_change_rate`
FROM `cur`, `pre`
-
- When this month's month-over-month and year-over-year, i.e., MTD business metrics month-over-month and year-over-year also need to limit the same number of days, you can add error sets:
WITH `cur_month` AS (
SELECT SUM(`sales_amount`) AS `this_month_sales`
FROM `@CHatBI_national_store_sales_data_daily`
WHERE `city` LIKE '%Chengdu%'
AND `business_date` >= DATE_FORMAT(CURRENT_DATE, '%Y-%m-01')
AND `business_date` <= DATE_SUB(CURRENT_DATE, INTERVAL '1' DAY)
AND NOT `sales_amount` IS NULL
),
`last_month` AS (
SELECT SUM(`sales_amount`) AS `last_month_same_period_sales`
FROM `@CHatBI_national_store_sales_data_daily`
WHERE `city` LIKE '%Chengdu%'
AND `business_date` >= DATE_SUB(DATE_FORMAT(CURRENT_DATE, '%Y-%m-01'), INTERVAL '1' MONTH)
AND `business_date` <= DATE_SUB(DATE_SUB(CURRENT_DATE, INTERVAL '1' DAY), INTERVAL '1' MONTH)
AND NOT `sales_amount` IS NULL
),
`last_year` AS (
SELECT SUM(`sales_amount`) AS `last_year_same_period_sales`
FROM `@CHatBI_national_store_sales_data_daily`
WHERE `city` LIKE '%Chengdu%'
AND `business_date` >= DATE_SUB(DATE_FORMAT(CURRENT_DATE, '%Y-%m-01'), INTERVAL '1' YEAR)
AND `business_date` <= DATE_SUB(DATE_SUB(CURRENT_DATE, INTERVAL '1' DAY), INTERVAL '1' YEAR)
AND NOT `sales_amount` IS NULL
)
SELECT `c`.`this_month_sales`, `l`.`last_month_same_period_sales`, `y`.`last_year_same_period_sales`, `c`.`this_month_sales` - `l`.`last_month_same_period_sales` AS `sequential_change`
, CASE
WHEN `l`.`last_month_same_period_sales` = 0
OR `l`.`last_month_same_period_sales` IS NULL
THEN NULL
ELSE ROUND((`c`.`this_month_sales` - `l`.`last_month_same_period_sales`) / `l`.`last_month_same_period_sales`, 4)
END AS `sequential_change_rate`, `c`.`this_month_sales` - `y`.`last_year_same_period_sales` AS `year_over_year_change`
, CASE
WHEN `y`.`last_year_same_period_sales` = 0
OR `y`.`last_year_same_period_sales` IS NULL
THEN NULL
ELSE ROUND((`c`.`this_month_sales` - `y`.`last_year_same_period_sales`) / `y`.`last_year_same_period_sales`, 4)
END AS `year_over_year_change_rate`
FROM `cur_month` `c`, `last_month` `l`, `last_year` `y`
Avoid adding overly simple or non-long-term effective "Error Sets"
- [Overly Simple] Can be efficiently maintained through business knowledge base, for example: [Question] What is the sales amount of xxx product in February 2025?
- [Non-long-term Effective] Questions are relative dates, SQL conditions are absolute dates, for example: [Question] What is yesterday's sales amount? [Error Set SQL] condition
date
= "2025-03-01"
2.4. Test Topics
Used for batch testing Q&A effectiveness, evaluating Q&A accuracy (accuracy calculation only targets questions in testing).
It's recommended to put into production only after topic test accuracy reaches 90% or above. If accuracy is low, you can improve Q&A accuracy by adding dataset descriptions and field comments to datasets, adding more error sets, and adding more business knowledge.
2.4.1. Testing Steps
①Add questions/batch import → ②Judge whether data results are correct → ③Judge whether SQL is correct → ④Add/modify knowledge → ⑤Retest
2.4.2. Notes
Query principle: Conform to the basic structure of effective questions ([time range] + [conditions (if any)] + [data metrics]), time/conditions/metrics are clear and unambiguous, avoid ambiguity
2.4.3. Error Causes and Handling
Refer to Troubleshooting for handling.
-
Has data, but doesn't conform to business common sense or differs from BI (SQL condition error) Add/modify knowledge and test again
-
No data currently (SQL condition error/missing data source/no data permissions) Ensure data source is not missing, data permissions are enabled, add/modify knowledge and test again
-
Data query failed (SQL syntax error/interface exception) Check whether SQL syntax is correct, including quotes for table names/field names, function usage standards, etc. If this problem occurs frequently, contact GuanYuan for troubleshooting.
-
SQL validity check failed
It's recommended to check and retest. If this problem occurs frequently, contact GuanYuan for troubleshooting.
2.5. Launch Topics
When topic test accuracy reaches 90% or above, you can formally enable the topic. Enabled topics can be switched and selected during frontend queries.
3. Topic Operations
3.1. Usage Tracking
Used to track historical conversations, troubleshoot Q&A effectiveness, including knowledge application in each conversation.
3.1.1. Frontend Q&A
3.1.1.1. Effective Questions
Basic structure: [time range] + [conditions (if any)] + [data metrics], for example:
- What are the net assets and total revenue of xxx company's xxx customer in 2024?
- Which business departments do the 10 employees with the most customers in 2024 come from? What are the annual revenue and year-over-year performance of these business departments?
- How many orders and ending inventory does xxx product have in March? How many are individual orders, and what's the individual order ratio?
- What are the sales amount, gross profit, and gross profit margin of xxx product in xxx store in xxx region in Q3?
- Which stores had continuously declining monthly sales in the second half of 2024?
Whether questions conform to the above basic structure directly affects Q&A accuracy and the difficulty of knowledge base maintenance.
3.1.1.2. Frontend Q&A Steps
- ①Frontend query → ②Judge whether data results are correct → ③Judge whether SQL is correct → ④View operation logs to locate cause → ⑤Add/modify knowledge → ⑥Query and test again
3.1.1.3. Notes
-
Query principle: Conform to the basic structure of effective questions ([time range] + [conditions (if any)] + [data metrics]), time/conditions/metrics are clear and unambiguous, avoid ambiguity
-
After questions go wrong and knowledge is added/modified, when querying and testing again, avoid being completely identical to the original question (Q&A SQL generation has caching mechanism)
3.1.2. Troubleshooting
3.1.2.1. Troubleshooting Process & Cases

3.1.2.2. Q&A Error Classification

Except for SQL errors, other problem troubleshooting refers to Common Problem Troubleshooting Guide.
3.1.2.3. Operation Logs
Operation logs include the following information, used to determine what affects current Q&A effectiveness
- Related table information (dataset structure, field enumeration values Value examples)
- Related business knowledge (entries in "Business Knowledge" related to current questions)
- Related training samples (entries in "Error Set" related to current questions)




3.2. Knowledge Iteration
Continuously add/modify knowledge during frontend Q&A process, improving Knowledge Base Building.