Skip to main content

ETL Advanced Settings

Overview

This article explains the feature description and operations of Smart ETL advanced settings. The ETL advanced settings support four parameters: ETL Intermediate Result Cache, BroadcastJoin, Spark Single-Job Timeout (Minutes), and Custom Spark Parameters.

Entry Point: Data Preparation > Smart ETL > ETL Details Page > Advanced Settings

ETL Intermediate Result Cache

When an ETL is relatively complex, meaning it contains at least two Output Datasets and exceeds the configured complexity threshold, enabling this setting allows the system to automatically cache intermediate computation results and improve overall ETL performance. In special cases, this feature can be disabled at the ETL level.

The Follow Global Setting path is: Admin Center > System Settings > General Settings > Runtime Parameters, where administrators can configure the ETL Intermediate Result Cache parameter. The recommended default system-wide complexity threshold is 100, and the adjustable range is 50-999. If the threshold is set too low, more ETLs may automatically cache intermediate results during execution.

For details about ETL complexity calculation, see ETL Intermediate Result Cache.

BroadcastJoin

Spark is the default engine used by Guandata ETL. When both joined tables are large, Spark's BroadcastJoin is often not suitable. In this case, you can disable it through the BroadcastJoin option to reduce risk. For special cases, you can contact Guandata technical support to confirm whether it should be enabled.

Spark Single-Job Timeout (Minutes)

By setting a Spark runtime timeout, you can limit task execution time and ensure jobs complete within a reasonable period, avoiding delays and resource waste caused by excessive runtime and improving analysis and processing efficiency. Users can choose either Follow Global Setting or Custom. The default is Follow Global Setting.

  • Follow Global Setting: Administrators can modify Spark Single-Job Timeout (Minutes) on Admin Center > System Settings > General Settings > Runtime Parameters.

  • Custom: ETL owners can set a timeout between 1 and 300 minutes.

Custom Spark Runtime Parameters

Notes

Takes effect starting from version 6.5.

The ETL compute engine has some built-in Spark runtime rules. In extreme cases, some of these rules may cause certain ETLs to run abnormally. In such cases, you can configure parameters and rules specifically for those ETLs so that they run with dedicated settings. Multiple rules are supported, and different rules should be separated by line breaks.

The following parameters are currently supported:

Configuration ItemParameterUnit
shuffleBytesguandata.jobLimit.shuffleBytes=200G
Output Dataset Row Countguandata.jobLimit.numOfOutputRows=1100 Million Rows
shuffleDisk Thresholdguandata.shuffleDisk.threshold=0.85
tableExpansionRateguandata.jobLimit.tableExpansionRate=100
Broadcast Join Thresholdspark.sql.autoBroadcastJoinThreshold=10MbMb

Configuration example:

If you are not familiar with specific parameter settings, do not configure them on your own. Contact your Guandata representative in advance if needed.

Disable Specific Spark Optimization Rules

Note

Effective before version 6.5. No longer effective in version 6.5 and later.

The ETL compute engine includes built-in Spark optimization rules such as OptimizeRepartition and ColumnPruning. In rare cases, some of these rules may cause certain ETLs to run abnormally. In such cases, you can configure the ETL so that these built-in optimization rules are not applied during execution.

Multiple rules can be excluded or disabled at the same time. Separate different rules with English commas, as shown below:

CombineUnions,ConstantPropagation

If you encounter issues during configuration, contact your Guandata representative.