Since the launch of Remote Shuffle Service (RSS) in 2020, Alibaba Cloud EMR has helped many customers deal with problems of performance and stability of Spark jobs and implemented the architecture of memory and computing separation. Alibaba Cloud made RSS open-source in early 2022 to make it more convenient to use and expand. All developers are welcome to help build RSS. Please refer to [1] for the overall architecture of RSS. This article introduces the latest two important features of RSS: support for Adaptive Query Execution (AQE) and throttling.
Adaptive Query Execution (AQE) is an important feature of Spark 3 [2]. The subsequent execution plan is dynamically adjusted by collecting runtime stats to solve the problem that the generated execution plan is not good because the optimizer cannot accurately estimate stats. AQE mainly has three optimization scenarios: partition coalescing, Join strategy switching, and skew Join optimization. All three scenarios impose new requirements on the capabilities of the shuffle framework.
The purpose of partition coalescing is to make the amount of data processed by reducer moderate and even as far as possible. First, the mapper shuffles writes according to the larger number of partitions. AQE framework counts the size of each partition. If the amount of data of multiple partitions is relatively small, these partitions are merged into one and handed over to a reducer for processing. Here is the procedure:
According to the figure above, the optimized Reducer 2 needs to read the data that originally belonged to Reducers 2-4. The requirement for the shuffle framework is that ShuffleReader needs to support the range partition:
def getReader[K, C](
handle: ShuffleHandle,
startPartition: Int,
endPartition: Int,
context: TaskContext): ShuffleReader[K, C]
The purpose of switching the Join policy is to correct when the optimizer incorrectly selects SortMerge Join or ShuffleHash Join rather than Broadcast Join, which should be done due to inaccurate stats estimation. Specifically, after the two joined tables have shuffled writes, the AQE framework counts the actual size of the tables. If the small table meets the conditions of Broadcast Join, the small table is broadcast o ut and joined with the local shuffle data of the large table. Here are the steps:
There are two optimizations for switching the Join policy:
In terms of the second optimization, the new requirement for the shuffle framework is to support local reads.
The purpose of skew Join optimization is to allow skew partitions to be handled by more reducers to avoid long tails. Specifically, after shuffling writes ends, the AQE framework counts the size of each partition and determines whether there is a skew according to specific rules. If there is, the partition is divided into multiple splits, and each split is joined with the corresponding partition of another table (as shown in the following figure):
The method of partition splitting is to accumulate the size of the shuffle output in the order of MapId. The splitting is triggered when the accumulated value exceeds a threshold. The new requirement for the shuffle framework is that ShuffleReader can support range MapId. Combined with the requirements for range partitions of partition coalescing, the interface of ShuffleReader evolves to:
def getReader[K, C](
handle: ShuffleHandle,
startMapIndex: Int,
endMapIndex: Int,
startPartition: Int,
endPartition: Int,
context: TaskContext,
metrics: ShuffleReadMetricsReporter): ShuffleReader[K, C]
The core design of RSS is to push shuffle and partition data aggregation. Different mappers push data from the same partition to the same worker for aggregation, and a reducer directly reads the aggregated files (as shown in the following figure):
In addition to the core design, RSS implements multi-copy, full-link fault tolerance, Primary HA, disk fault tolerance, adaptive Pusher, rolling upgrade, and other features. Please see [1] for details.
The requirement of partition coalescing for the shuffle framework is to support range partitions. Each partition corresponds to a file in RSS, so it is naturally supported (as shown in the following figure):
The requirement for the shuffle framework to switch the Join policy is to be able to support LocalShuffleReader. Due to the remote attribute of RSS, data is stored in RSS clusters and only exists locally when RSS and computing clusters are mixed. Therefore, local reads are not supported now, but mixed scenarios will be optimized and supported in the future. Note: Although local reads are not supported, the rewriting of Join is not affected. The following figure shows that RSS supports the rewriting optimization of Join:
Among the three scenarios of AQE, the support of RSS for Join skew optimization is the most difficult one. The core design of RSS is partition data aggregation. The purpose is to convert random reads of Shuffle Read into sequential reads, thereby improving performance and stability. Multiple mappers are pushed to RSS workers at the same time. RSS is brushed after memory aggregation. Therefore, data from different mappers in the partition file is unordered (as shown in the following figure):
Join skew optimization requires reading range maps, such as reading Map1-2 data. There are two general practices
The problems with these two methods are clear-cut. Method 1 results in a large number of redundant disk reads, while Method 2 essentially falls back into random reads, losing the core advantage of RSS. In addition, the index file becomes generic overhead, even for non-skewed data. It is difficult to accurately predict whether there is a skew in the Shuffle Write process.
We proposed a new design to solve the two problems above: active split and sort on read.
It is very possible that the size of a skewed partition is large. The disk will be exploded directly in extreme cases. The probability of a large partition is still high even in non-skewed scenarios. Therefore, from the perspective of disk SLB, it is necessary to monitor the size of partition files and split actively. The threshold is 256 MB by default.
When a split occurs, RSS reassigns a pair of workers (primary replicas) for the current partition, and subsequent data is pushed to the new workers. We proposed a method of soft split to avoid the impact of split on running mappers. When the split is triggered, RSS asynchronously prepares a new worker, and PSS heats up and updates the information of PartitionLocation of the mapper when it is ready. Therefore, it will not cause any interference to the PushData of the mapper. The following figure shows the whole process:
RSS adopts a Sort on Read strategy to avoid the problem of random reads. Specifically, the first range read of file split will trigger sorting, while the non-range read will not. Then, the ordered file will be written back to the disk along with its location index. It ensures that subsequent range reads are sequential reads (as shown in the following figure):
We broke up the order in which each sub-reducer reads the splits to avoid multiple sub-reducers waiting for the sort of the same file split (as shown in the following figure):
Thanks to Sort on Read, redundant and random reads can be effectively avoided, but the split file (256 MB) needs to be sorted. This section discusses the implementation and overhead of sorting. File sorting includes three steps: reading files, sorting MapId, and writing files. The default size of the RSS block is 256 KB, and the number of blocks is about 1,000. Thus, the sorting process is very fast, and the main overhead is from file reading and writing. There are three schemes for the entire sorting process:
From the perspective of I/O, at first glance, scheme 1 has sufficient memory and does not use sequential reads and writes. Scheme 2 has random reads and writes. Scheme 3 has random writes. Intuitively, scheme 1 has better performance. However, due to PageCache, it is possible that the original file is cached in PageCache when files are written in scheme 3, so the performance of scheme 3 is better in the test (as shown in the following figure):
At the same time, scheme 3 does not need to occupy additional memory of the process, so RSS uses the algorithm of scheme 3. Meanwhile, we also tested and compared Sort on Read and the method above using random reads, which are not sorted but only indexed (as shown in the following figure):
The following figure shows the overall process of RSS support for Join skew optimization:
The main purpose of throttling is to prevent RSS worker memory from being exploded. There are usually two ways of throttling:
PushData is a very high-frequency and critical-performance operation. Therefore, if an additional RPC interaction is performed for each push, the overhead is too high. As a result, we adopted a backpressure strategy. There are two sources of incoming data from the perspective of a worker:
As shown in the following figure, Worker 2 receives both the data from Partition 3 pushed by mappers and the replica data of Partition 1 sent by Worker 1 and sends the data of Partition 3 to the corresponding secondary replica.
The data pushed from the mappers is released only if the following conditions are met at the same time:
Data pushed from the primary replica is only released if the following condition is met:
When designing the throttling strategy, we should consider throttling (reducing the inflow of data) and also discharging (releasing memory in time). Specifically, we have defined two memory thresholds corresponding to 85% and 95% memory usage for high levels and only one memory threshold corresponding to 50% memory usage for low levels. When the first-gear threshold of the high level is reached, throttling is triggered to suspend receiving the data pushed by mappers and force the disk to be brushed at the same time to discharge the disk. Only limiting the inflow from mappers does not control the traffic from the primary replica. Therefore, we have defined the second-gear threshold of high levels. When this threshold is reached, receiving data sent by the primary replica will be suspended at the same time. When the level is lower than the low level, the normal state is restored. The following figure shows the whole process:
We compared the AQE performance of RSS and native External Shuffle Service (ESS) on Spark3.2.0. RSS uses a hybrid mode and does not occupy any additional machine resources. In addition, RSS uses 8 GB of memory, accounting for only 2.3% of the machine's memory, which is 352 GB. The following part describes the specific test environments:
Hardware:
Header machine group 1x ecs.g5.4xlarge
Worker machine group 8x ecs.d2c.24xlarge,96 CPU,352 GB,12x 3700GB HDD
Spark AQE-Related Configurations:
spark.sql.adaptive.enabled true
spark.sql.adaptive.coalescePartitions.enabled true
spark.sql.adaptive.coalescePartitions.initialPartitionNum 1000
spark.sql.adaptive.skewJoin.enabled true
spark.sql.adaptive.localShuffleReader.enabled false
RSS-Related Configurations:
RSS_PRIMARY_MEMORY=2g
RSS_WORKER_MEMORY=1g
RSS_WORKER_OFFHEAP_MEMORY=7g
We tested 10TB TPCDS. In terms of E2E, ESS takes 11,734s, while RSS single replica and two replicas take 8,971s and 10,110s, respectively, which are faster than ESS by 23.5% and 13.8% (as shown in the following figure). The network bandwidth reached the upper limit when RSS enables two replicas, which is also the main factor that two replicas are lower than a single replica.
The time of each query is compared below:
All developers are welcome to participate in the discussion and construction of RSS.
GitHub: https://github.com/alibaba/RemoteShuffleService
Adaptive Query Execution: Speeding Up Spark SQL at Runtime: https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html
The Principles of EMR StarRocks' Blazing-Fast Data Lake Analytics
62 posts | 6 followers
FollowApache Flink Community - January 31, 2024
Alibaba Cloud_Academy - August 25, 2023
Apache Flink Community - July 18, 2024
Alibaba Cloud MaxCompute - August 31, 2020
Apache Flink Community China - February 28, 2022
Apache Flink Community China - March 17, 2023
62 posts | 6 followers
FollowAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreA Big Data service that uses Apache Hadoop and Spark to process and analyze data
Learn MoreApsaraDB for HBase is a NoSQL database engine that is highly optimized and 100% compatible with the community edition of HBase.
Learn MoreMore Posts by Alibaba EMR