This topic describes how to use the job analysis feature in the MaxCompute console to analyze job-level resources and learn about the resource consumption of jobs in typical scenarios. This topic also provides suggestions on how to handle job runtime issues if a job runs slowly.
Background information
If a job runs for a long period of time and you cannot identify the cause by using LogView, or if the runtime of a completed job does not meet your expectation, you must analyze whether the issue is caused by insufficient resources.
MaxCompute provides the job analysis feature. Data developers and administrators can view the resource consumption information of historical jobs and running jobs on the job analysis page of the MaxCompute console.
Precautions
A simple evaluation method is used in the following typical cases. In the actual business implementation process, we recommend that you adjust the job attributes based on the actual situation and take note of the effect of the adjustment.
Scenario 1: A job runs slowly due to insufficient reserved subscription resources
A company purchased 50 compute units (CUs) of the subscription billing method for jobs. More than 10 batches of jobs (more than 1,000 jobs) are regularly run on these resources every day.
A data development engineer finds that the job whose instance ID is 20240717020015831xxxxxxxxxxxx runs for an excessively long period of time and affects subsequent processing. The engineer performs the following operations to view the details of the job: Log on to the MaxCompute console. In the left-side navigation pane, choose Workspace > Jobs. On the Jobs page, search for the job and click Analyze in the Actions column.
The resource consumption chart for the job shows that all 50 CUs are occupied and most of the computing resources are used by the current job. However, the number of CUs that jobs are waiting for at the quota level remains high due to limited resources. This indicates that the reserved computing resources are insufficient to process the requests of all jobs in a timely manner. The computing resources that are allocated to the current job are also insufficient. As a result, the job runs slowly.
To handle the preceding issue, use one of the following methods:
Evaluate whether the job initiation time can be changed to avoid insufficient reserved resources when a large number of requests require resources.
If the resources for job requests cannot be adjusted, you must increase the number of subscription resources. You can go to the Cost Optimization page, specify the expected job completion time, and then view the recommended resource allocation plan.
Scenario 2: A job waits for resources for a long period of time due to resource competition
A company purchased 50 CUs of the subscription billing method for jobs. More than 10 batches of jobs (more than 1,000 jobs) are regularly run on these resources every day.
A data development engineer finds that the job whose instance ID is 20240717020020365xxxxxxxxxxxx runs for an excessively long period of time and affects subsequent processing. The engineer performs the following operations to identify the cause of the issue: Log on to the MaxCompute console. In the left-side navigation pane, choose Workspace > Jobs. On the Jobs page, search for the job and click Analyze in the Actions column. The job information shows that the job runs for 21 minutes and 17 seconds but more than half of the time is spent waiting for resources.
The resource consumption chart for the job shows that the job waits for resources in the first 13 minutes after the job is submitted, but the resource usage at the quota level reaches the upper limit. This indicates that other jobs occupy all resources. As a result, the job cannot obtain computing resources to run. After 13 minutes, the job slowly obtains resources, but the resource usage at the quota level does not reach the upper limit.
You can click a time point in the x-axis of the resource consumption chart to view the quota-level resource allocation at the time point, including resource allocation for all running jobs and resource allocation for all waiting jobs. The following figure shows that the current job does not obtain resources, three jobs with a priority of 9 are consuming computing resources, and five jobs are waiting for resources at 10:04.
Click the color bar for Resource Allocation for Wait Jobs to view the list of jobs that are waiting for resources. The following figure shows that the job whose instance ID is 20240717020015831gza7jdf21uv3 occupies a large number of resources at the time point.
The resource consumption of the job 20240717020015831gza7jdf21uv3 shows that the job occupies a large number of computing resources at the time point.
To handle the preceding issue, use one of the following methods:
Evaluate whether the job initiation time can be changed to avoid insufficient resources caused by resource competition.
Evaluate whether job priorities can be adjusted. If multiple jobs must be initiated at the same time, you can increase the priorities of important jobs. This way, when multiple jobs request resources at the same time, resources are preferentially allocated to the jobs with higher priorities.
Increase the number of subscription resources. You can go to the Cost Optimization page, specify the expected job completion time, and then view the recommended resource allocation plan.
The data development engineer changes the task priority of the node to 0. After the adjustment, the time for which the job waits for resources decreases and the job quickly obtains 50% of the reserved computing resources for computing. The total runtime of the job is reduced from 21 minutes to 6 minutes.