Intelligent Diagnostics - MaxCompute - Alibaba Cloud Documentation Center

This topic provides an overview of the intelligent diagnostics feature for MaxCompute SQL jobs, offering diagnostic results and suggestions to help resolve job errors or enhance query performance. This topic includes guidance on accessing diagnostic results and interpreting analyses. Intelligent diagnostics may only identify certain anomalies and suggestions for overall query performance due to the multifaceted nature of query efficiency.

For more information about job diagnostics and optimization, see Diagnostic cases of Logview and Optimize SQL statements.

Limits

Intelligent diagnostics are available exclusively for SQL jobs.

View intelligent diagnostic results and suggestions

Log on to the MaxCompute console and select your desired region from the upper-left corner.
In the left-side navigation pane, choose Workspace > Jobs to access the Jobs page.
Note
The default time range for querying jobs is one hour, which can be adjusted based on the specific requirements of the job execution of your project.
On the Intelligent diagnostics column of the desired job, click the diagnostic result tag to be redirected to the Job Insights page. Click the Job Summary tab, you can find detailed explanations of diagnostic results and optimization recommendations.

Diagnostic result descriptions

An empty intelligent diagnostics column may indicate one of the following:
- The job has run successfully without any diagnosed anomalies.
- The intelligent diagnosis results will be generated the next day after the job is completed.
- The job was executed before November 1, 2023.
- SQL jobs executed in specific regions, including China (Hong Kong), China East 2 Finance, China North 2 Finance (Preview), China North 2 Ali Gov 1, China South 1 Finance, Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Germany (Frankfurt), UK (London), US (Silicon Valley), US (Virginia), UAE (Dubai), and SAU (Riyadh - Partner Region).
To manually initiate diagnostics and view detailed results, click Insights in the Actions column of the target job on the Jobs page.
Red tags represent job error message diagnostics, while orange tags indicate job performance diagnostics.

Interpret intelligent diagnostic results

The following sections explain the meanings of SQL job intelligent diagnostic results and their potential solutions.

Insufficient resources

A job is considered to have insufficient resources if it uses less than 95% of the requested computing resources for over five minutes continuously.

For pay-as-you-go computing resource jobs, the shared resource pool means that jobs cannot reserve resources and must compete for them. Insufficient resources can occur due to user preemption when there is a high number of concurrent jobs.
Subscription resource jobs may face resource shortages due to large data volumes, high resource demands, and lower job priority, leading to resource queuing.

To address this, visit the Job Insights page for the job in question. Click the Resource Consumption tab, View the resource consumption of a job and the resource allocation of computing quotas at a specific point in time to identify the specific cause of the resource shortfall. Then, you can optimize task execution based on business needs, adjust job priority, or manage computing resources accordingly.

Data skew

Data skew is a common issue in big data computing, often manifested by the job execution progress stalling at 99%, giving the impression that the job is stuck. This phenomenon stems from an uneven distribution of data, causing some workers to complete their computations quickly while others take much longer. In the era of explosive growth in data volume, data skew can seriously affect the execution efficiency of distributed programs. Therefore, it is crucial to identify data skew issues as early as possible, analyze their causes, and address them promptly.

MaxCompute identifies data skew based on the following criteria:

The time of the longest-running worker is at least three times the average time of all workers, and the average time exceeds 30 seconds.
The input record count of any worker is at least three times the average of all workers.

MaxCompute provides the node name of workers experiencing data skew, allowing for troubleshooting and optimization through LogView. For more information, see Use LogView to view job information.

For more scenarios and solutions related to data skew, see Data skew tuning.

Data inflation

When the number of output records of a job exceeds ten times the number of input records, the Fuxi Task is determined to have a data inflation issue.

MaxCompute provides the Fuxi Tasks name with data inflation, allowing for troubleshooting and optimization through LogView. For more information, see Use LogView to view job information.

For more information on the causes and solutions of data inflation, see Handle data expansion.

Mode fallback

MaxCompute jobs can be executed in both MaxCompute Query Acceleration and normal modes.

Jobs with large data volumes that do not require returning query results, only the normal mode can be used. Therefore, under normal conditions of resources and jobs, the runtime duration of the job typically does not exhibit significant fluctuations.
Interactive query jobs with smaller data volumes typically trigger a query acceleration mode, in which the job execution speed is faster than that of normal jobs. However, MaxCompute does not guarantee that every job enters the query acceleration mode. As a result, some query acceleration jobs may revert to normal jobs, leading to the runtime duration of the job not meeting expectations.

Mode fallback issues are determined based on the Task Rerun sub-status. To avoid the uncertainty of query acceleration mode, you can run jobs in normal mode by adding set odps.service.mode=off; at the start of the job script, thereby preventing mode-triggered failures and time loss.

Job error message diagnostics

For failed jobs, MaxCompute correlates the error message with the type of error and provides relevant descriptions and solutions. Only some SQL error types are covered. For failed jobs without diagnostic results, see Error code overview to identify and address the issue.

If you have any questions or require assistance, please join the MaxCompute developer community (DingTalk group number: 11782920) or contact us through your exclusive DingTalk group.

References

For more information about long-period metrics, see Optimize the calculation for long-period metrics.
For job-level resource analysis by using the Job Insights feature in the MaxCompute console to understand job resource consumption and optimize runtime, see Best practices for job-level resource analysis.