Learning about Distributed Systems - Part 17: Shuffle

Shuffle

We discussed the poor performance of MapReduce caused by a large number of I/O operations in the previous article. Then, Spark, a memory-based distributed computing framework, and Alluxio, a memory-based distributed storage middleware, are introduced to reduce I/O operations.

These two frameworks significantly reduce I/O pressure and improve performance. However, the focus introduced in the previous article is relatively coarse-grained. We can go further into the execution process to see if there is any possibility of optimization.

The previous article also gave this MapReduce execution process diagram. A lot of I/O operations are concentrated in the shuffle process.

Today, let's take a look at the shuffle process in the two typical distributed computing frameworks: MapReduce and Spark.

In the following description, we use Spark terms under the typical map → reduce process to understand easier.

We can roughly divide the execution process of the program into the following three phases:

The map phase involves reading and parsing data.
The shuffle phase involves distributing the corresponding data to the corresponding reducer.
The reduce phase involves merging data and processing the final business logic.

Since we want to optimize I/O, we need to pay attention to the files and focus on the shuffle phase, so we can draw a diagram like this:

The entire shuffle phase is split into two steps:

Shuffle write refers to the step that mappers write processed data to the local disk, and the data is generally organized in a well-processed form for the reduce phase.
Shuffle read refers to the step that reducers read the data scattered in each mapper to the local device and then merge it (there is no subdivision step, such as copy).

Since shuffle read comes after shuffle write, it is relatively passive. No matter how many files are written upstream and how they are written, the downstream has to deal with them accordingly. So let's focus on shuffle write.

Hash Shuffle

Shuffle, as the name implies, is to disperse data and then distribute it to the corresponding downstream.

Therefore, the key is to provide data independently for each downstream, which means writing data for each reducer together.

What's more, each mapper generates a file for each reducer, so we can write the corresponding data into it. This is also known as Hash Shuffle.

Spark 0.8 introduced Hash Shuffle and used it as the default shuffle method until version 1.2.

As shuffle read pulls data in partitions, it will be more radical that each mapper generates a file for each partition, but this will result in too many files. So, we can write all partitions processed by the same reducer to one file.

Since we only focus on the shuffle write step, we can scroll the mouse wheel (like looking at a map) to see how mappers on a local machine write output to files.

As shown in the preceding figure, two Executors are allocated to a machine, and each Executor executes two map tasks successively. All maps write the processing results to the local hard disk.

This way, at the end of the map phase, there will be M*R files on the local hard disk, where M is the number of Mappers and R is the number of Reducers.

On the whole, if a large task requires 40K mappers and 20K reducers, 0.8 billion files will be generated.

From the perspective of a cluster, running many such tasks at the same time will leave unbearable pressure on the disk and network of all machines.

Therefore, optimization is required.

As mentioned in the fifth article of the series, one of the two purposes of building distributed systems is to improve the parallelism of computing for faster calculation.

However, computing resources are limited, so it is unnecessary to run all mappers and reducers at once. Instead, we run one part at a time with limited concurrency and then run the other part until the run is finished.

As shown in the preceding figure, each Executor runs two maps, but at any specific moment, only one is running. Therefore, it is entirely possible for these maps to share files, but as the downstream pulls and processes data in partitions, files should be shared in partitions.

As shown above, we can call it Consolidated Hash Shuffle.

Executor maintains a file pool in which a file is opened for each reducer. When each map task is executed, these files are obtained and data is written to these files. After execution, these files are returned. Then, the next map starts and repeats the operations above.

In this case, the number of files on the local hard disk is reduced to P*R, where P is the number of concurrent tasks. For a single machine, we can divide the number of CPU cores of this machine (C) by the number of cores required for each map (T), which is P = C/T.

Spark introduced the consolidation mechanism for Hash Shuffle in version 0.8.1 to ensure that it does not become obsolete due to performance issues.

Sort Shuffle

Consolidated Hash Shuffle did optimize performance, changing the M in the original M*R to the number of concurrent tasks, so the number of files is independent of the number of maps.

However, when the number of reducers is large, the number of files is still large. Is it possible to decouple the number of files from the number of reducers?

According to the figure above, the number of files is related to reducers because we create a file for each reducer. We can write the data of multiple reducers to the same file or write the data of all reducers to one file to break this one-to-one relationship.

However, this way, all the data are mixed together, resulting in the fact that the reducer can no longer read its data directly in the shuffle read phase, as it does in hash shuffle.

It is impossible for every shuffle read operation to traverse all files. It is too inefficient.

Naturally, referring to the database approach, create an index, so we can know where the desired data is.

However, shuffle read does not use value as a condition to query data like a database but pulls data by partition. Each reducer may pull many partitions. Therefore, it is best to use the partition as the unit of index rather than the reducer to decrease the number of files, as in the case of hash shuffle.

Since the index is organized in partitions, the data must be consistent, so the data in the same partition can be stored together.

Therefore, a sorting operation with partition as the key is required to sort all data by partitionId and write all data to a unique file.

The figure below shows Sort Shuffle:

At this time, the number of output files in the map phase is 2M. The first M indicates the number of data files, and the second M indicates the number of index files.

Since sorting is involved, it is impossible to append the data directly to the file, and memory will be insufficient, so it will be necessary to sort the remaining data in batches and then spill it to the hard disk. This way, there will be many spill files, and we need to merge and sort them into one file.

In addition, if the processing in reduce phase is considered, the data with the same key should be aggregated together. If the data with the same key is scattered everywhere, it must be cached in memory, so all data can be collected and calculated, especially for operations such as avg(). If the real-time aggregation (avg can also aggregate data in real-time by saving (sum,count)) is carried out, the intermediate result set of all keys will quickly occupy a large amount of memory, and OOM is prone to occur.

However, if we put the data with the same key together in advance, we can finish the calculation at one time and append the data to the file in batch before clearing the cache. In order to determine whether a key ends, we only need to see whether the current key is equal to the key being processed. As I said just now, if we want to put them together, we have to sort them.

In addition, sorting is indeed required in many application scenarios. Therefore, sorting both partitionId and data keys becomes the default behavior.

However, in the shuffle read phase, since reduce has to read data from a lot of maps and cannot sort the data in memory, it can only perform external sorting (such as multi-way merge sort). This requires us to do local sorting in the shuffle write phase to reduce the sorting pressure in the shuffle read phase.

The Sort Shuffle diagram for the sort, spill, and merge processes is listed below:

Sort Shuffle is the default shuffler for MapReduce. After the Hash Shuffle attempt, Spark also implemented Sort Shuffle in version 1.1 modeled on MapReduce and made it the default Shuffler in version 1.2. Later, Hash Shuffle was officially removed in version 2.0.

Those carefully following my thinking may think of solutions for further optimization. It aims to combine the decoupling of the map number in the consolidated hash shuffle with the decoupling of the reduce number in Sort Shuffle.

This way, the number of files is only related to parallelism. At the same time, the number of files is further reduced to 2P without the constraint of the number of maps and reduces. We can call it consolidated Sort Shuffle.

Data in the same partition may appear in multiple places in the data file and index file and needs to be processed in the shuffle read phase.

As the 2M files of Sort Shuffle are acceptable in most scenarios, the complexity brought by consolidated Sort Shuffle may not be cost-effective, so it has not been widely adopted. But from the continuity of thought, it is still worth our understanding.

In addition, we have to admit that the multi-round sort of Sort Shuffle will slow down the performance, especially when the number of maps and reduces is small, and the proportion of sorting to execution time is extremely high. Also, sorting is unnecessary for some application scenarios that do not require sorting.

Therefore, a judgment logic can be introduced to Sort Shuffle. When the number of partitions is greater than a certain number, Sort Shuffle is enabled. Otherwise, hash shuffle will be adopted, and a non-sorting merge will be performed at the end to ensure only one data file and its index file are output.

The implementation by Spark is represented by different shuffleWriters: SortShuffleWrite and BypassMergeSortShuffleWriter.

Tungsten Sort Shuffle

Sort Shuffle has been working well, but there are some things we have always believed have gradually changed:

The popularity of G bandwidth and other hardware improvements have significantly improved the network I/O performance.
SSDs, disk arrays, and SQL pruning (we will mention later) have led to significant improvements in disk I/O performance.
The computing resources occupied by serialization and hash calculation remain huge, and the proportion is getting higher.

This makes the CPU gradually become the bottleneck of some big data applications.

Spark started the Tungsten project to solve this problem.

Tungsten is a very important milestone for Spark. However, I will only cover the parts related to the topic here. Interested readers can research to learn more about it.

Tungsten implements a new memory management method that stores data in the off-heap memory. It has the following two common benefits:

It saves data in binary form, which is a significant memory saving option over the JVM object.
Only pointers are stored in the on-heap memory, and data is stored in the off-heap memory, significantly reducing the number of objects in the heap and alleviating GC pressure.

Shuffle involves operations (such as caching, sorting, and merging a large amount of data), so it can also benefit from Tungsten. Since it uses the Java Unsafe package for implementation, it is also called Unsafe Sort Shuffle.

In addition to these two general benefits, Tungsten has the following advantages in the shuffle process (the overall process of Tungsten Sort Shuffle and Sort Shuffle is similar):

Tungsten Sort Shuffle sorts the serialized data directly, eliminating the overhead of deserialization.
Tungsten Sort Shuffle sorts an array of pointers with only 8 bytes and makes better use of CPU multi-level cache, which is more efficient than random-access memory.

Tungsten uses a two-level architecture similar to the operating system to manage memory. It locates the page first and then finds the specific data by offset.

As mentioned earlier, the data in the Shuffle process is written through the ShuffleWrite class.

We trace the call chain from UnsafeShuffleWriter to ShuffleExternalSorter to ShuffleInMemorySorter and finally to PackedRecordPointer to find the code that packs the pointer:

public static long packPointer(long recordPointer, int partitionId) {
assert (partitionId <= MAXIMUM_PARTITION_ID);
// Note that without word alignment we can address 2^27 bytes = 128 megabytes per page.
// Also note that this relies on some internals of how TaskMemoryManager encodes its addresses.
final long pageNumber = (recordPointer & MASK_LONG_UPPER_13_BITS) >>> 24;
final long compressedAddress = pageNumber | (recordPointer & MASK_LONG_LOWER_27_BITS);
return (((long) partitionId) << 40) | compressedAddress;
}

This code is not difficult to understand. Since there is a lot of information to be saved, Tungsten has compressed the pointer to some extent. The following figure is more intuitive:

We can also summarize some limitations of Tungsten Sort Shuffle from this structure:

In order to sort the data without deserializing it, the partitionId is put in a pointer and consists of 24 bits, so the maximum number of partitions supported is 16,777,216.
The maximum size of a single piece of data after serialization is 2**27 = 128MB, as the offset is 27 bits.
The maximum data volume managed by a single task is 2*13128MB = 1TB, as the page occupies 13 bits.
In order to avoid data deserialization, Tungsten Sort Shuffle only supports the sorting of partitions, not the sorting of data keys.
In order to avoid data deserialization, Tungsten Sort Shuffle does not support aggregation.
In order to avoid data deserialization, serialization protocol support (such as KryoSerializer and the built-in Serializer of Spark) is required if you want to merge spill files automatically.

Spark introduced Tungsten Sort Shuffle in version 1.4 and Sort Shuffle in version 1.6.

After several stages of evolution, Spark Shuffle gradually stabilized. From the perspective of ShuffleWrite, the selection order becomes BypassMergeSortShuffleWriter > UnsafeShuffleWriter > SortShuffleWriter. Since each has its own applicable scene, when certain conditions are met, it will be automatically selected.

Another point worth mentioning is Spark supports dynamic allocation to improve resource utilization, which causes the executor to be recycled. This way, the output file in the shuffle write phase is lost.

In order to solve this problem, Spark starts an independent and resident service on each machine to ensure the downstream can still perform shuffle read after the executor is recycled.

In terms of configuration, after we set spark.dynamicAllocation.enabled=true, we also need to set spark.shuffle.service.enabled=true to start this external shuffle service.

TL;DR

This article introduced several possible shuffle methods and their adoption in MapReduce and Spark.

Shuffle is a very performance-impacting and complex process. However, there are many methods to optimize it (such as combining with the combiner, choosing reduceByKey instead of groupByKey as much as possible, and setting parameters, such as spark.shuffle.compress reasonably). Due to the topic and limited space, this part will not be discussed further.

The main points of this article are listed below:

Although Spark and Alluxio significantly reduce I/O and improve overall performance, the focus of the previous article is relatively coarse-grained.
According to the specific implementation process, shuffle is the worst-hit area by I/O. If it can be optimized, the performance will be significantly improved.
The most intuitive idea is Hash Shuffle. Each mapper generates a file for each reducer, and the total number of files is M*R.
As the total number of files, MR, is too large, considering the parallelism issue, mappers of the same executor can share files, which is called consolidation. The total number of files is PR.
When R is very large, the consolidated hash shuffle is unsatisfactory. In order to get rid of the effect of R on the number of files, we can merge the data of the same reducer. Considering shuffle read is easy to handle, we can sort partitionIds and keys at the same time and create an index, which is called Sort Shuffle. The total number of files is 2M.
Sorting is very performance-consuming, so the Bypass mechanism is introduced to Sort Shuffle. When the number of partitions is less than the threshold, Hash Shuffle is adopted, and a merge will be performed at the end to ensure only one data file is output.
In order to solve the CPU performance problem that has gradually become a bottleneck, Spark started the Tungsten project and applied it to Shuffle, which is Tungsten Sort Shuffle. Tungsten saves data in the off-heap memory and avoids serialization as much as possible, thus significantly improving the performance of Shuffle. It is applicable in specific scenarios.
Spark supports dynamic allocation. Therefore, an independent and resident service is required to support the shuffle read operation on the shuffle data of the removed executor.

The performance problem of Shuffle has been solved well, and some parameters and methods are provided to enable applications to flexibly select tuning methods.

Performance optimization is endless, so let's stop for a while and discuss it later. The next article will continue along the lines of this series.

As discussed in the last few articles, we have solved the performance problems of masters, slaves, and workers in distributed systems.

However, in terms of efficiency, there are execution efficiency and development efficiency. Performance involves execution efficiency, but what about development efficiency?

In the next article, we will learn how to improve the application development efficiency on distributed systems.

This is a carefully conceived series of 20-30 articles. I hope to give everyone a core grasp of the distributed system in a storytelling way. Stay tuned for the next one!

Community

Learning about Distributed Systems - Part 17: Shuffle

Shuffle

Hash Shuffle

Sort Shuffle

Tungsten Sort Shuffle

TL;DR

Read previous post:

Read next post:

Alibaba Cloud_Academy

You may also like

Comments

Alibaba Cloud_Academy

Related Products

Hybrid Cloud Distributed Storage

Storage Capacity Unit

Hybrid Cloud Storage

Data Lake Storage Solution