Memory Load
Scenario 1: Memory Usage Is Too High
Cause
Check the metric Node Memory Usage Trend by Time Period. If memory usage remains above the 95% warning line for 30 consecutive minutes, memory usage is too high. Possible causes include:
(1) Task running times are too concentrated, causing resource queues
(2) ETL runs and card export tasks consume a large amount of memory
Troubleshooting Approach
We recommend troubleshooting as follows:
(1) Refer to Node Memory Usage Trend by Time Period to see whether memory usage stays high for a long time
Long-term high memory usage indicates insufficient memory resources.
?Optimization Measures
To ensure business operation efficiency and prevent insufficient memory from affecting business, consider the following options:
a. When there is a single Job-engine, configure resource isolation in Control Tower to ensure important tasks can run.
b. If budget allows, expand capacity. Contact Guandata to evaluate the specific expansion plan.
(2) Check whether tasks are concentrated in Average Queue and Run Duration Distribution of Guan-Index Cards in the Last 30 Days, Yesterday Dataset Update Task Distribution, and Yesterday ETL Run Task Distribution, especially whether many task periods overlap with memory usage peaks
In Average Queue and Run Duration Distribution of Guan-Index Cards in the Last 30 Days, find dates where the dark area of queue time distribution is obvious, accounting for more than 20%. Click the bar to drill down and find the time periods with obvious dark areas on that day. If there are more than five such periods, queueing occurs during most system running time and causes high memory usage.
Check Yesterday Dataset Update Task Distribution and Yesterday ETL Run Task Distribution to see whether bar chart peaks overlap with memory usage peaks.
?Optimization Measures
While ensuring business operation, spread out jobs during memory peak periods and reduce tasks running during peak periods. Click bars with warning signs or dark bars in the distribution chart to drill down, then sort tasks by run duration. Click the operation object to jump to it and view ETL/dataset lineage to determine whether upstream dependencies exist. Configure scheduled update tasks to run off-peak when business is not affected.
Scenario 2: CPU Usage Is Too High
Cause
Check the metric Node CPU Usage Trend by Time Period. If CPU usage remains above the 95% warning line for 30 consecutive minutes, CPU usage is too high. Possible causes include:
(1) Tasks run too long and block resources
(2) Task running times are too concentrated, causing uneven CPU usage distribution
Troubleshooting Approach
We recommend troubleshooting as follows:
(1) Refer to Node CPU Usage Trend by Time Period to see whether CPU usage stays high for a long time
Long-term high CPU usage indicates insufficient CPU resources.
?Optimization Measures
To ensure business operation efficiency and prevent high CPU usage from affecting business, consider the following options:
a. When there is a single Job-engine, configure resource isolation in Control Tower to ensure important tasks can run.
b. If budget allows, expand capacity. Contact Guandata to evaluate the specific expansion plan.
(2) Check running tasks during peak periods
Click Node CPU Usage Trend by Time Period to drill down, sort by run duration, and find tasks with long run duration, greater than 60 minutes.
?Optimization Measures
(1) For card tasks and dataset update tasks, go to Admin Settings > Operations Management > Task Management and search by Task Name as the operation object to review historical run durations and determine whether the task run duration is abnormal. If it is not abnormal, check the corresponding card dataset and dataset row/column count to determine whether the dataset update method can be optimized, preferably by using incremental updates.
(2) For ETL run tasks, when there are multiple output dataset nodes:
a. If many nodes are shared in the ETL, process the shared part in a separate ETL and use that ETL output dataset as the input dataset of the original ETL to avoid repeated calculation of the same data.
b. If few nodes are shared in the ETL, split different processing steps into independent ETLs so multiple ETLs can update in parallel and improve calculation efficiency.
(3) After the above optimization, if resource usage needs to be balanced over time, click the operation object to jump to it and view ETL/dataset lineage to understand whether upstream dependencies exist. Configure scheduled update tasks to run off-peak when business is not affected.
(3) Check whether tasks are concentrated in Yesterday Dataset Update Task Distribution and Yesterday ETL Run Task Distribution, especially whether many task periods overlap with CPU usage peaks
Click bars with warning signs in the distribution chart to drill down, then adjust according to the handling method in the previous step.
Scenario 3: CPU Load Is Too High
Cause
Check the metric Server CPU Load (System Load) Trend. If CPU load remains above the 95% warning line for 30 consecutive minutes, CPU load is too high and task queues will occur. Possible causes include:
(1) Tasks run too long and block resources
(2) Task running times are too concentrated, causing uneven CPU usage distribution
Troubleshooting Approach
We recommend troubleshooting as follows:
(1) Refer to Server CPU Load (System Load) Trend to see whether CPU load stays high for a long time
Long-term high CPU load indicates insufficient CPU resources.
?Optimization Measures
To ensure business operation efficiency and prevent high CPU usage from affecting business, consider the following options:
a. When there is a single Job-engine, configure resource isolation in Control Tower to ensure important tasks can run.
b. If budget allows, expand capacity. Contact Guandata to evaluate the specific expansion plan.
(2) Check Node CPU Usage Trend by Time Period
Find running tasks during peak periods. Click Node CPU Usage Trend by Time Period to drill down, sort by run duration, and find tasks with long run duration, greater than 60 minutes.
?Optimization Measures
Click the operation object to jump to it and view the resource lineage of the ETL/dataset to understand whether upstream dependencies exist. Configure scheduled update tasks to run off-peak when business is not affected.
(3) Check whether tasks are concentrated in Yesterday Dataset Update Task Distribution and ETL Run Task Distribution, especially whether many task periods overlap with CPU load peaks
Spread out jobs during CPU load peak periods and reduce tasks running during peak periods when business operation is not affected. Click bars with warning signs in the distribution chart to drill down and sort by run duration.
?Optimization Measures
For tasks with long run duration, greater than 60 minutes, consider moving them to off-peak periods when business is not affected.