Split Size hint - MaxCompute - Alibaba Cloud Documentation Center

When a MaxCompute SQL job runs slower than expected, the cause is often a mismatch between its concurrency and your available resources. For example, generating too many concurrent tasks can overwhelm your resource quota, causing tasks to queue and increasing scheduling overhead. Conversely, generating too few tasks fails to fully use available resources, leaving them idle while your job takes longer to complete. The split_size hint lets you control the data split size for a table, which in turn determines the number of map tasks. The value is specified in MB, with a default of 256 MB. This lets you fine-tune concurrency: increase it to maximize resource use or decrease it to alleviate contention, ultimately shortening the job's execution time. To adjust the concurrency, add the hint /*+split_size(value_in_mb)*/ after the table name in your query.

Usage notes

The query optimizer ignores the split_size hint for a clustered table if it uses the table's bucketing properties for optimization.
For optimal performance, it is recommended to set the split_size value in multiples of 256 MB (e.g., 512 MB). The value is specified in megabytes.
If a query reads the same table multiple times with different split_size hints, MaxCompute uses the smallest value specified.
- If one hint specifies 1 and another specifies 10, MaxCompute uses 1.
- If one hint specifies 1 and another is omitted, MaxCompute still uses 1.

Use cases

To reduce task queuing when resources are limited: If a job has many subtasks waiting for resources, increase the split size. This reduces the total number of subtasks (concurrency), which lowers scheduling overhead and can shorten the overall job runtime.
To increase parallelism with idle resources: If a job has low concurrency while cluster resources are idle, decrease the split size. This creates more subtasks that can run in parallel, improving resource utilization and shortening the total job runtime.

Example

-- Set the split size to 1 MB. This setting indicates that a job is split into subtasks based on a size of 1 MB when data in the src table is read.
SELECT a.key FROM src a /*+split_size(1)*/ JOIN src2 b ON a.key=b.key;