Skip to main content

Memory Load

Scenario 1: High Memory Usage

Cause of the Problem

It is recommended to check the metric "Node Memory Usage Trend Chart". If the memory usage continuously exceeds the 95% warning line for 30 minutes, it indicates high memory usage. Possible causes include:

(1) Task runtimes are too concentrated, resulting in resource queuing

(2) ETL operations and card export tasks occupy a large amount of memory

Troubleshooting Ideas

We recommend troubleshooting as follows:

(1) Refer to the metric "Node Memory Usage Trend Chart" to observe whether memory usage is persistently high

Persistently high memory usage indicates insufficient memory resources.

Optimization Measures

If you want to ensure business operation efficiency and avoid business impact due to insufficient memory, consider the following solutions:

a. For a single Job-engine, it is recommended to set resource isolation in Control Tower to ensure the operation of important tasks;

b. If budget allows, consider expansion. For specific expansion plans, please contact Guandata for evaluation.

(2) Check whether tasks are concentrated (a large number of tasks running during memory usage peaks) in the following charts: "Average Queue and Runtime Distribution of Guan-Index Cards in the Last 30 Days", "Dataset Update Task Distribution Yesterday", and "ETL Task Distribution Yesterday"

Check the "Average Queue and Runtime Distribution of Guan-Index Cards in the Last 30 Days" chart, find the date with a significant dark area in the queue time distribution (proportion exceeds 20%), click the bar to drill down, and find the time period with a significant dark area on that day. If there are more than 5 such periods, it means that most of the system runtime has queuing, resulting in high memory usage.

Check the "Dataset Update Task Distribution Yesterday" and "ETL Task Distribution Yesterday" charts to see if the peak periods coincide with memory usage peaks.

Optimization Measures

It is recommended to stagger jobs during memory peaks as much as possible to reduce the number of tasks running during peak periods while ensuring business operation. You can click the bar with a warning sign or dark bar in the distribution chart to drill down and sort tasks by runtime. Click the operation object to jump and view the resource lineage of the ETL/dataset to see if there are upstream dependencies. If possible, set scheduled update tasks to run during off-peak periods.

Scenario 2: High CPU Usage

Cause of the Problem

It is recommended to check the metric "Node CPU Usage Trend Chart". If the CPU usage continuously exceeds the 95% warning line for 30 minutes, it indicates high CPU usage. Possible causes include:

(1) Tasks run for too long, causing resource blocking

(2) Task runtimes are too concentrated, resulting in uneven CPU usage distribution

Troubleshooting Ideas

We recommend troubleshooting as follows:

(1) Refer to the metric "Node CPU Usage Trend Chart" to observe whether CPU usage is persistently high

Persistently high CPU usage indicates insufficient CPU resources.

Optimization Measures

If you want to ensure business operation efficiency and avoid business impact due to high CPU usage, consider the following solutions:

a. For a single Job-engine, it is recommended to set resource isolation in Control Tower to ensure the operation of important tasks;

b. If budget allows, consider expansion. For specific expansion plans, please contact Guandata for evaluation.

(2) Check the running tasks during peak periods

You can click the "Node CPU Usage Trend Chart" to drill down, sort by runtime, and find tasks with long runtimes (runtime >60min).

Optimization Measures

(1) For card tasks and dataset update tasks, you can search by "Task Name" as the "Operation Object" in "Admin Settings - Operation & Maintenance Management - Task Management" to understand the historical runtime of such tasks and determine whether the runtime is "abnormal". If not, check the size of the corresponding dataset and the number of rows and columns to determine whether there is room for optimization in the dataset update method (prefer incremental updates).

(2) For ETL tasks, in cases with multiple output dataset nodes:

a. If there are many shared nodes in the ETL, separate the shared part into a standalone ETL, and use the output dataset of this ETL as the input dataset for the original ETL to avoid repeated calculations of the same data;

b. If there are few shared nodes in the ETL, split different processing flows into independent ETLs to allow multiple ETLs to update in parallel and improve computing efficiency.

(3) After completing the above optimizations, if you need to balance resource usage time, click the operation object to jump and view the resource lineage of the ETL/dataset to see if there are upstream dependencies. If possible, set scheduled update tasks to run during off-peak periods.

(3) Check whether tasks are concentrated (a large number of tasks running during CPU usage peaks) in the following charts: "Dataset Update Task Distribution Yesterday" and "ETL Task Distribution Yesterday"

You can click the bar with a warning sign in the distribution chart to drill down and adjust according to the methods provided above.

Scenario 3: High CPU Load

Cause of the Problem

It is recommended to check the metric "Server CPU Load (System Load) Trend Chart". If the CPU load continuously exceeds the 95% warning line for 30 minutes, it indicates high CPU load, which will cause task queuing. Possible causes include:

(1) Tasks run for too long, causing resource blocking

(2) Task runtimes are too concentrated, resulting in uneven CPU usage distribution

Troubleshooting Ideas

We recommend troubleshooting as follows:

(1) Refer to the metric "Server CPU Load (System Load) Trend Chart" to observe whether CPU load is persistently high

Persistently high CPU load indicates insufficient CPU resources.

Optimization Measures

If you want to ensure business operation efficiency and avoid business impact due to high CPU usage, consider the following solutions:

a. For a single Job-engine, it is recommended to set resource isolation in Control Tower to ensure the operation of important tasks;

b. If budget allows, consider expansion. For specific expansion plans, please contact Guandata for evaluation.

(2) Check the "Node CPU Usage Trend Chart"

Find the running tasks during peak periods. You can click the "Node CPU Usage Trend Chart" to drill down, sort by runtime, and find tasks with long runtimes (runtime >60min).

Optimization Measures

You can click the operation object to jump and view the resource lineage of the ETL/dataset to see if there are upstream dependencies. If possible, set scheduled update tasks to run during off-peak periods.

(3) Check whether tasks are concentrated (a large number of tasks running during CPU load peaks) in the following charts: "Dataset Update Task Distribution Yesterday" and "ETL Task Distribution Yesterday"

It is recommended to stagger jobs during CPU load peaks as much as possible to reduce the number of tasks running during peak periods while ensuring business operation. You can click the bar with a warning sign in the distribution chart to drill down and sort by runtime.

Optimization Measures

For tasks with long runtimes (over 60min), if possible, consider moving them to off-peak periods.