High-Performance Query Table (Advanced Option)

1. Overview

1.1. Function Description

High-performance query table is an acceleration service for data calculation and storage provided by Guandata BI. Converting Guan-Index datasets, file datasets, and ETL output datasets to high-performance query tables can achieve data analysis and query acceleration, reaching the effect of billion-level data second-level calculation. High-performance query tables have the following advantages:

Ultimate Query Performance: High-performance query tables use distributed computing to achieve high scalability, especially suitable for OLAP queries under massive data, suitable for data aggregation and slicing (filtering) on any dimension on large wide tables, and can also perform detailed data queries. Compared to directly using Spark as the calculation engine, it can provide better ad-hoc query experience.

Simple Usage Method: Through interface-based methods, simple point selection can configure partition fields and perform mode switching. The overall usage process is zero-code, low-threshold, without requiring technical background. Supports complete SQL with simple functions and flexible and powerful operations.

Efficient Storage Solution: Efficient data compression technology provides 10x compression ratio, improving single-machine storage capacity and computing power, meeting high-concurrency, high-throughput data query and analysis scenarios.

Note: High-performance datasets are paid value-added services. For specific usage details, please contact your Guandata consultant.

1.2. Application Scenarios

When "extraction and ETL datasets" face the following problems, it is recommended that users use "Guandata Speed Engine" to achieve one-click switching of datasets to "high-performance query tables".

Transaction Reports: For retail orders and financial data, queries require multi-table joins and large amounts of precise calculations. Large wide table or join table query performance is concerning, massive data reports are difficult to be real-time, and will face complex query situations;

Promotion Analysis: Data screens during promotions need real-time calculation and presentation of various core indicators. SQL has high concurrency, and data availability needs to be verified;

User Analysis: Analysis scenarios involve complex data calculation models such as retention rates and conversion funnels. Analysis dimensions are not fixed, so pre-calculation cannot be done. Analysis covers long data cycles, and there are many large queries for detailed data.

1.3. Prerequisites

Since high-performance query tables utilize ClickHouse's OLAP analysis capabilities, in addition to deploying the BI system main service, you also need to purchase the Guandata Speed Engine module and deploy ClickHouse independently for users.

Configuration requirements: It is generally recommended to deploy ClickHouse independently with configuration no less than 8 cores 64G memory and disk space no less than 300G. Specific deployment and configuration will be completed by Guandata staff for you. For details, please consult Guandata staff.

2. Usage Guide

2.1. Publish Dataset as High-Performance Query Table

Datasets that use ClickHouse for query acceleration are all called "high-performance query tables". All Guan-Index datasets, file datasets, and ETL output datasets can support the acceleration function of "high-performance query tables".

Enter the dataset details page, select "Advanced Options", and click the "Edit" button in the upper right corner. Users can switch ordinary datasets to "high-performance query tables". In this way, data will be imported into ClickHouse, and when the card side performs data query calculations, it will directly connect to ClickHouse for accelerated queries. The introduction of ordinary datasets and high-performance query tables is as follows:

Ordinary Dataset: Suitable for datasets with data volume less than 10 million rows, recommended update frequency not exceeding 4 times per day.
High-Performance Query Table: Suitable for datasets with data volume greater than or equal to 10 million rows, which can greatly accelerate the efficiency of data queries on the card side, but there will be some limitations at the functional level, such as not being able to use window functions. High-performance query table generation and updates take more time, recommended update frequency not exceeding once per day. You need to set appropriate acceleration fields for high-performance query tables according to specific usage scenarios to further optimize query efficiency. Generally, we recommend using date fields as acceleration fields.

When configuring "high-performance query table" mode, users need to set acceleration fields: Setting acceleration fields is to better partition data when storing in ClickHouse. Reasonable partitioning can greatly reduce full table scans during data queries. It is generally recommended to use date fields for partitioning, and the partitioning method is recommended to be set to "month" or "day". Using date fields for partitioning can effectively control the number of partitions without making partitions too coarse or too fine. If there are no date fields, you can also carefully choose other fields for partitioning. At this time, please be sure to control the enumeration quantity of partition fields, and do not choose serial numbers like order IDs or numerical fields as acceleration fields.

After configuring the acceleration fields, click "Confirm" to start mode switching. When the dataset data volume is large, importing data into ClickHouse may take some time, please wait patiently. According to Guandata BI internal testing, importing a dataset of 10 million rows * 20 columns into ClickHouse takes about 2 minutes. Dataset updates will also trigger re-import of data in ClickHouse. It is recommended that the update frequency of "high-performance query tables" not exceed once per day.

The following is a Hive database dataset configured with query acceleration function. From the surface, it seems no different from general Hive database datasets. But when we use it to create cards, we use ClickHouse as the query engine, which can provide lightning-fast experience.

2.2. Use High-Performance Query Tables to Build Visualization Analysis

When creating cards using high-performance query tables, you need to comply with ClickHouse's SQL syntax. Currently, ClickHouse already supports window functions.

At the same time, if you need to apply row and column permissions on high-performance query tables, the representation of row and column permissions also needs to be written using ClickHouse's SQL syntax. For example, if the store content in user attributes is the store number managed by the current user, separated by commas, then the row and column permissions can be set as:

has(splitByChar(',',[CURRENT_USER.门店]),[店铺编号])

3. Precautions

(1) "High-performance query tables" are generally suitable for data volumes of 10 million rows or more per table, which can greatly accelerate the efficiency of data queries on the card side. Datasets with data volumes less than 10 million rows can already provide good response experience based on Spark calculations, and generally it is not recommended to use "high-performance query tables".

(2) It is recommended to avoid aggregation operations on fine-grained dimensions as much as possible. If you clearly know that what you are querying is detailed data, then directly using "detail tables" for queries will provide better performance experience.

(3) It is recommended to include filter conditions related to partition fields when querying, which will greatly improve query efficiency. At the same time, partition fields should also be designed based on commonly used query filter conditions.

(4) It is recommended to run the speed engine under the following environmental conditions:

Software environment for operation:

Under Windows operating system, it is recommended to use Chrome or Firefox browser;
Under iOS operating system, it is recommended to use Safari browser;

Hardware environment for operation:

CPU: Intel processor eight-core 1.9GHz or equivalent and above processing capability;
Memory: RAM 64GB or above;
Hard disk: 300GB or above.

1. Overview​

1.1. Function Description​

1.2. Application Scenarios​

1.3. Prerequisites​

2. Usage Guide​

2.1. Publish Dataset as High-Performance Query Table​

2.2. Use High-Performance Query Tables to Build Visualization Analysis​

3. Precautions​