Skip to main content

ETL Operation

Scenario 1: High CPU Usage Time for a Single ETL Update Task

Cause of the Problem

It is recommended to check the metric "Top 20 ETLs by CPU Usage Time in the Last 30 Days (CPU Usage Time ≥10s)". If the CPU usage time of a single ETL exceeds 60 minutes, it indicates that the ETL task is occupying the CPU for too long. Possible causes include:

(1) ETL update task is blocked and cannot run normally

(2) ETL update task has complex logic (such as multiple output datasets), resulting in long update time (>60min)

(3) Many other tasks are running in parallel during the ETL update, leading to long queue time or affected ETL execution

Troubleshooting Ideas

We recommend troubleshooting step by step as follows:

(1) Check whether the runtime of such ETL update tasks is abnormal.

You can click the ETL name to jump and view the ETL run records to see if the historical duration is always similar.

Optimization Measures

If it is an unexpected job duration, it is recommended to check the logs to specifically troubleshoot the cause of the long occupation. You can configure the maximum task runtime in "Admin Settings - Operation & Maintenance Management - Parameter Configuration" to limit abnormal tasks from occupying resources.

(2) Check whether there is room for optimization in the update time of such ETLs.

Optimization Measures

For cases with multiple output dataset nodes:

a. If there are many shared nodes in the ETL, separate the shared part into a standalone ETL, and use the output dataset of this ETL as the input dataset for the original ETL to avoid repeated calculations of the same data;

b. If there are few shared nodes in the ETL, split different processing flows into independent ETLs to allow multiple ETLs to update in parallel and improve computing efficiency.

(3) Check whether there is room for optimization in the runtime of such ETL update tasks.

Optimization Measures

a. If it does not affect business, check whether such tasks have upstream dependencies. For tasks without upstream dependencies, it is recommended to schedule them during off-peak periods;

b. Refer to the metrics "Node CPU Usage Trend Chart" and "Server CPU Load (System Load) Trend Chart" to understand CPU usage peaks and CPU load peaks. It is recommended to avoid running such tasks during these periods.

Scenario 2: Many Unexpected ETL Jobs

Cause of the Problem

It is recommended to check the metric "ETL Runtime Distribution Yesterday". If the actual number of ETL runs exceeds twice the planned number, it indicates that there were many unexpected ETL jobs yesterday. Possible causes include:

(1) Many manually triggered ETL tasks during the problematic period

(2) Scheduled tasks from previous periods were delayed to this period due to task backlog

(3) ETL is set to cascade update, where the completion of one ETL/dataset update triggers multiple ETL updates, aggravating the backlog after cascading updates

Troubleshooting Ideas

We recommend troubleshooting step by step as follows:

(1) Check whether there are many manually triggered ETL tasks during the problematic period.

Observe the "ETL Runtime Distribution Chart Yesterday". Click the bar with a warning sign to drill down and check whether the "User" column is set to auto-update.

(2) Check whether there are many cascading update ETL tasks during the problematic period.

Observe the "ETL Runtime Distribution Chart Yesterday". Click the bar with a warning sign to drill down, select the operation object to jump, and check the resource lineage. In the resource lineage, you can trace whether the ETL run was triggered by an upstream task.

(3) Check whether there are large tasks blocking scheduled tasks, causing delays during the problematic period.

Observe the bar before the warning sign in the "ETL Runtime Distribution Chart Yesterday". Click to drill down and sort by runtime.

Optimization Measures

If there are tasks with runtime exceeding 60min, it is recommended to adjust according to the methods provided in the "High CPU Usage Time for a Single ETL Update Task" scenario.

(4) Check whether the ETL task queue time is long (queue time >30min) during the problematic period

Observe the bar before the warning sign in the "ETL Runtime Distribution Chart Yesterday". Click to drill down and observe the task queue time.

Optimization Measures

If the queue time is long (queue time >30min), check the memory usage during this period. If it does not reach the 95% warning line, consult Guandata to confirm whether the ETL scheduling task concurrency can be increased in "Admin Settings - Operation & Maintenance Management - Parameter Configuration". If increasing concurrency does not solve the problem, consider expansion. For specific expansion plans, please contact Guandata for evaluation.

Other Suggestions

Refer to the metrics "Node CPU Usage Trend Chart" and "Server CPU Load (System Load) Trend Chart" to understand CPU usage peaks and CPU load peaks. It is recommended to avoid running manually triggered ETLs, complex ETLs, or tasks with many downstream cascades during these periods.