Memory Load
Scenario 1: High Memory Usage
Cause of the Problem
It is recommended to check the metric "Node Memory Usage Trend Chart". If the memory usage continuously exceeds the 95% warning line for 30 minutes, it indicates high memory usage. Possible causes include:
(1) Task runtimes are too concentrated, resulting in resource queuing
(2) ETL operations and card export tasks occupy a large amount of memory
Troubleshooting Ideas
We recommend troubleshooting as follows:
(1) Refer to the metric "Node Memory Usage Trend Chart" to observe whether memory usage is persistently high
Persistently high memory usage indicates insufficient memory resources.
Optimization Measures
If you want to ensure business operation efficiency and avoid business impact due to insufficient memory, consider the following solutions:
a. For a single Job-engine, it is recommended to set resource isolation in Control Tower to ensure the operation of important tasks;
b. If budget allows, consider expansion. For specific expansion plans, please contact Guandata for evaluation.
(2) Check whether tasks are concentrated (a large number of tasks running during memory usage peaks) in the following charts: "Average Queue and Runtime Distribution of Guan-Index Cards in the Last 30 Days", "Dataset Update Task Distribution Yesterday", and "ETL Task Distribution Yesterday"
Check the "Average Queue and Runtime Distribution of Guan-Index Cards in the Last 30 Days" chart, find the date with a significant dark area in the queue time distribution (proportion exceeds 20%), click the bar to drill down, and find the time period with a significant dark area on that day. If there are more than 5 such periods, it means that most of the system runtime has queuing, resulting in high memory usage.
Check the "Dataset Update Task Distribution Yesterday" and "ETL Task Distribution Yesterday" charts to see if the peak periods coincide with memory usage peaks.
Optimization Measures
It is recommended to stagger jobs during memory peaks as much as possible to reduce the number of tasks running during peak periods while ensuring business operation. You can click the bar with a warning sign or dark bar in the distribution chart to drill down and sort tasks by runtime. Click the operation object to jump and view the resource lineage of the ETL/dataset to see if there are upstream dependencies. If possible, set scheduled update tasks to run during off-peak periods.
Scenario 2: High CPU Usage
Cause of the Problem
It is recommended to check the metric "Node CPU Usage Trend Chart". If the CPU usage continuously exceeds the 95% warning line for 30 minutes, it indicates high CPU usage. Possible causes include:
(1) Tasks run for too long, causing resource blocking
(2) Task runtimes are too concentrated, resulting in uneven CPU usage distribution
Troubleshooting Ideas
We recommend troubleshooting as follows:
(1) Refer to the metric "Node CPU Usage Trend Chart" to observe whether CPU usage is persistently high
Persistently high CPU usage indicates insufficient CPU resources.
Optimization Measures
If you want to ensure business operation efficiency and avoid business impact due to high CPU usage, consider the following solutions:
a. For a single Job-engine, it is recommended to set resource isolation in Control Tower to ensure the operation of important tasks;
b. If budget allows, consider expansion. For specific expansion plans, please contact Guandata for evaluation.
(2) Check the running tasks during peak periods
You can click the "Node CPU Usage Trend Chart" to drill down, sort by runtime, and find tasks with long runtimes (runtime >60min).
Optimization Measures
(1) For card tasks and dataset update tasks, you can search by "Task Name" as the "Operation Object" in "Admin Settings - Operation & Maintenance Management - Task Management" to understand the historical runtime of such tasks and determine whether the runtime is "abnormal". If not, check the size of the corresponding dataset and the number of rows and columns to determine whether there is room for optimization in the dataset update method (prefer incremental updates).
(2) For ETL tasks, in cases with multiple output dataset nodes:
a. If there are many shared nodes in the ETL, separate the shared part into a standalone ETL, and use the output dataset of this ETL as the input dataset for the original ETL to avoid repeated calculations of the same data;
b. If there are few shared nodes in the ETL, split different processing flows into independent ETLs to allow multiple ETLs to update in parallel and improve computing efficiency.
(3) After completing the above optimizations, if you need to balance resource usage time, click the operation object to jump and view the resource lineage of the ETL/dataset to see if there are upstream dependencies. If possible, set scheduled update tasks to run during off-peak periods.
(3) Check whether tasks are concentrated (a large number of tasks running during CPU usage peaks) in the following charts: "Dataset Update Task Distribution Yesterday" and "ETL Task Distribution Yesterday"
You can click the bar with a warning sign in the distribution chart to drill down and adjust according to the methods provided above.
Scenario 3: High CPU Load
Cause of the Problem
It is recommended to check the metric "Server CPU Load (System Load) Trend Chart". If the CPU load continuously exceeds the 95% warning line for 30 minutes, it indicates high CPU load, which will cause task queuing. Possible causes include:
(1) Tasks run for too long, causing resource blocking
(2) Task runtimes are too concentrated, resulting in uneven CPU usage distribution
Troubleshooting Ideas
We recommend troubleshooting as follows:
(1) Refer to the metric "Server CPU Load (System Load) Trend Chart" to observe whether CPU load is persistently high
Persistently high CPU load indicates insufficient CPU resources.
Optimization Measures
If you want to ensure business operation efficiency and avoid business impact due to high CPU usage, consider the following solutions:
a. For a single Job-engine, it is recommended to set resource isolation in Control Tower to ensure the operation of important tasks;
b. If budget allows, consider expansion. For specific expansion plans, please contact Guandata for evaluation.
(2) Check the "Node CPU Usage Trend Chart"
Find the running tasks during peak periods. You can click the "Node CPU Usage Trend Chart" to drill down, sort by runtime, and find tasks with long runtimes (runtime >60min).
Optimization Measures
You can click the operation object to jump and view the resource lineage of the ETL/dataset to see if there are upstream dependencies. If possible, set scheduled update tasks to run during off-peak periods.
(3) Check whether tasks are concentrated (a large number of tasks running during CPU load peaks) in the following charts: "Dataset Update Task Distribution Yesterday" and "ETL Task Distribution Yesterday"
It is recommended to stagger jobs during CPU load peaks as much as possible to reduce the number of tasks running during peak periods while ensuring business operation. You can click the bar with a warning sign in the distribution chart to drill down and sort by runtime.
Optimization Measures
For tasks with long runtimes (over 60min), if possible, consider moving them to off-peak periods.