Skip to main content

DB Dataflow

Overview

ETL operators in a Dataflow Node typically run on Spark compute. DB Dataflow converts ETL logic into SQL supported by the database and pushes the computation down to the database so that warehouse compute resources can be used directly.

Supported Data Warehouses

Supported databases include StarRocks, GaussDB, Doris, SelectDB, MySQL, and PostgreSQL.

Procedure

Entry Points

You can either drag a DB Dataflow node directly onto the canvas or convert a regular Dataflow Node into a DB Dataflow.

  • Drag DB Dataflow directly

  • Convert a regular Dataflow Node into DB Dataflow

Develop Data in DB Dataflow

DB Dataflow includes the following operators. Refer to the corresponding documents as needed.

Note

To ensure that the task can be pushed down to the database successfully, data sources in DB Dataflow only support database types. In addition, Database Input and Database Output must use the same data account. Data sources such as datasets, files, and ERP sources should be configured in a regular Dataflow Node.

When writing SQL in DB Dataflow, use functions and statements that conform to the target database syntax. Do not use Spark functions or Spark-specific syntax, or the pushdown execution will fail.

Input and Output

OperatorReference
Database InputDatabase Input
Database OutputDatabase Output

Column Editing Operators

OperatorReference
Add Calculated ColumnAdd Calculated Column
Merge ColumnsMerge Columns
Group AggregateGroup Aggregate
Select ColumnsSelect Columns
Pivot Rows to ColumnsPivot Rows to Columns
Unpivot Columns to RowsUnpivot Columns to Rows

Data Editing Operators

OperatorReference
Filter Data RowsFilter Data Rows
DeduplicateDeduplicate
Replace ValuesReplace Values
Replace Null ValuesReplace Null Values

Dataset Combination

OperatorReference
Append RowsAppend Rows
Join DataJoin Data

Advanced Calculation

SQL operator: SQL Input

Split DB Dataflow Calculation Logic

Use Case

When the logic in DB Dataflow is complex, and ETL splitting is not enabled, the platform converts the entire flow into one complex SQL statement and pushes it directly to the database. If the SQL becomes too complex, the database may fail to execute it efficiently.

Steps

In this case, enable ETL Split. The system automatically divides the logic in DB Dataflow into multiple SQL statements and pushes them to the database for execution, improving task reliability.


ETL Split can also be configured globally in Management Center:

Advanced Operations

Manually Configure Split Points

Use Case

For complex DB Dataflow logic or large data volumes, database memory overflow may occur during execution and cause task failures.

Solution

You can manually configure split points so the platform parses and executes SQL in segments, helping avoid failures caused by complex SQL or large volumes of data.

Steps

  • Edit DB Dataflow -> Split -> Select a node and click + to use it as a split point.

  • Run the DB Dataflow, open the instance details, and enter the DB Dataflow instance view to see the SQL generated for each split point, along with execution status and error messages from the database.

Pre-SQL

Use Case

If a Join Data node or Group Aggregate node processes too much data, even splitting that node separately may still cause database memory overflow.

Solution

Configure Pre-SQL in the DB Dataflow node to limit concurrency or force data to disk, reducing memory usage.

Steps

In DB Dataflow -> Pre-SQL, configure the SET parameters for SQL execution. Consult your DBA for recommended settings.