Skip to main content

How ETL Achieves Incremental Update

Background

Currently, BI's Smart-ETL cannot directly achieve incremental updates for datasets. Each ETL run is a full data overwrite. As the data volume grows, each ETL's performance degrades, and the overall system ETL runtime increases. If we can use BI's Smart-ETL to achieve incremental updates, this problem can be solved.

Solution

In this solution, historical data is updated monthly, with the stock data ETL updating all data before the current month at 6 PM on the 2nd of each month. The incremental data ETL updates the current month's data daily at midnight. On the 1st and 2nd, both the current and previous month's data are updated; after the 3rd, only the current month's data is updated.

Overall ETL Layer Design

  1. The DW layer ETL is divided into DW-Sample Table (before this month) and DW-Sample Table (this month). Since the DW layer is not used by cards, there is no need to merge datasets.

  2. The DM layer ETL is divided into DM-Sample Table (before this month) and DM-Sample Table (note: this ETL merges incremental DM and DM data before this month).

Detailed ETL Design

1. DW-Sample Table (before this month)

-- Ensure the data is before this month
SELECT * FROM input1 WHERE `Creation Time` < DATE_TRUNC('month',CURRENT_DATE())

2. DW-Sample Table (this month)

-- '-2' ensures that on the 1st and 2nd, last month's data is still in the incremental dataset, as the historical dataset is updated on the evening of the 2nd.
SELECT * FROM input1 WHERE `Creation Time` >= DATE_TRUNC('month',CURRENT_DATE())

3. DM-Sample Table (before this month)

Directly use DW-Sample Table (before this month) as the DW layer dataset to generate the corresponding DM.

image.png

4. DM-Sample Table

The DM for the current month's incremental update and the DM before this month are concatenated to form a complete DM dataset, which can be used directly by the business.

image.png

Business Value

Through incremental updates, ETL performance is improved by 3~5 times compared to full updates. The larger the stock data and the more complex the ETL, the greater the performance improvement. If BI optimizes row concatenation performance in the future, performance can be further improved.