When a MaxCompute SQL job runs slower than expected, the cause is often a mismatch between its concurrency and your available resources. For example, generating too many concurrent tasks can overwhelm your resource quota, causing tasks to queue and increasing scheduling overhead. Conversely, generating too few tasks fails to fully use available resources, leaving them idle while your job takes longer to complete. The split_size hint lets you control the data split size for a table, which in turn determines the number of map tasks. The value is specified in MB, with a default of 256 MB. This lets you fine-tune concurrency: increase it to maximize resource use or decrease it to alleviate contention, ultimately shortening the job's execution time. To adjust the concurrency, add the hint /*+split_size(value_in_mb)*/ after the table name in your query.
Usage notes
The query optimizer ignores the
split_sizehint for a clustered table if it uses the table's bucketing properties for optimization.For optimal performance, it is recommended to set the
split_sizevalue in multiples of 256 MB (e.g., 512 MB). The value is specified in megabytes.If a query reads the same table multiple times with different
split_sizehints, MaxCompute uses the smallest value specified.If one hint specifies
1and another specifies10, MaxCompute uses1.If one hint specifies
1and another is omitted, MaxCompute still uses1.
Use cases
To reduce task queuing when resources are limited: If a job has many subtasks waiting for resources, increase the split size. This reduces the total number of subtasks (concurrency), which lowers scheduling overhead and can shorten the overall job runtime.
To increase parallelism with idle resources: If a job has low concurrency while cluster resources are idle, decrease the split size. This creates more subtasks that can run in parallel, improving resource utilization and shortening the total job runtime.
Example
-- Set the split size to 1 MB. This setting indicates that a job is split into subtasks based on a size of 1 MB when data in the src table is read.
SELECT a.key FROM src a /*+split_size(1)*/ JOIN src2 b ON a.key=b.key;