Remote Shuffle Service (RSS) is an extension provided by Alibaba Cloud E-MapReduce (EMR) to improve the stability and performance of Spark Shuffle. This topic describes how to associate a Spark cluster with a Shuffle Service cluster on the EMR on ACK page.
Background information
Spark Shuffle in Container Service for Kubernetes (ACK) clusters encounters the following issues:
Spark Shuffle requires local storage. If a server implements compute-storage separation or does not have local disks, such as an elastic container instance, you must purchase and attach disks. This increases costs and reduces efficiency.
Spark 2 does not support dynamic allocation. Spark 3 supports dynamic allocation based on Shuffle tracking. However, the recycling efficiency of executors is low.
Spark Shuffle also has the following disadvantages:
- A data overflow occurs if a large amount of data exists in a shuffle write task. This causes write amplification.
- A large number of small-size network packets exist in a shuffle read task. This causes connection reset.
- A large number of small-size I/O requests and random reads exist in a shuffle read task. This causes high disk and CPU loads.
- If thousands of mappers (M) and reducers (N) are used, a large number of connections are generated, which makes it difficult for jobs to run. The number of connections is calculated by using the following formula: M × N.
RSS provided by EMR overcomes the disadvantages of Spark Shuffle and supports dynamic allocation in ACK clusters. For more information, see RSS.
Prerequisites
A Spark cluster is created on the EMR on ACK page. For more information, see Step 2: Create a cluster.
A Shuffle Service cluster is created on the EMR on ACK page. For more information, see Step 2: Create a cluster.
Limits
Spark clusters can be associated with only Shuffle Service clusters in the same ACK cluster.
If you want to associate a Spark cluster with a Shuffle Service cluster, make sure that the major EMR versions of the clusters are the same to prevent compatibility issues. You can view the versions of the clusters on the Cluster Details tab.
Procedure
Log on to the EMR console. In the left-side navigation pane, click EMR on ACK.
Associate a Spark cluster with a Shuffle Service cluster.
On the EMR on ACK page, find the Spark cluster that you created and click the name of the Spark cluster in the Cluster ID/Name column.
In the Basic Information section of the Cluster Details tab, click Associate Now to the right of Associate RSS Cluster.
In the Associated Cluster section of the Service Details tab, click Add.
In the Associated Cluster dialog box, select the Shuffle Service cluster that you created from the Cluster drop-down list and click Associate.
Optional. Configure the parameters of RSS. For more information, see Parameters.