Cloud-Native Compute Engine: Challenges and Solutions

Cloud-native: Background and Considerations

Figure 1 shows E-MapReduce (EMR) architecture based on Elastic Compute Service (ECS). As an open-source big data ecosystem, the EMR architecture provides an indispensable open-source big data solution for every digital enterprise in the last decade. It has the following layers:

ECS physical resource layer, also known as the IaaS layer.
Data access layer, such as real-time Kafka and offline Sqoop.
Storage layer, such as Hadoop Distributed File System (HDFS), Object Storage Service (OSS), and the cache-accelerated JindoFS developed on EMR.
Engine layer, including popular compute engines such as Spark, Presto, and Flink.
Data application layer, such as Alibaba's self-developed Dataworks, PAI, and open-source Zeppelin and Jupyter.

Each layer has many open-source components that constitute the classic big data solution - the EMR architecture. Here are our further considerations about it:

Can we achieve better elasticity? This means allowing customers to scale dynamically based on the actual peak and trough by providing high-velocity scaling and ensuring timely availability of resources.
Regardless of the current situation, look at the development trend in the next few years and decide whether it is still necessary to support all the computing and storage engines. This practical issue has two facets. The first is whether customers can maintain so many engines. The second is whether a general-purpose engine can help solve all problems in specific scenarios. For example, Spark will be a good substitute for MapReduce, even though some users prefer Hive over MapReduce on a large scale.
Separation architecture in storage and computing is the generally accepted direction, which provides independent scalability. This enables customers to synchronize data in a data lake and scale the compute engines as needed through the cost-effective decoupling method.

Figure 1: Open-source big data solution based on ECS

Based on the considerations above, let's focus on the concept of cloud-native. Kubernetes is a more promising implementation for cloud-native. Therefore, when we mention cloud-native, it is actually about Kubernetes. As Kubernetes becomes increasingly popular, many customers are showing interest in this technology. Most of the customers have moved their online business to Kubernetes too. They want to build a unified and complete big data architecture on a similar operating system with such a migration plan. Therefore, we summarize the following features:

Achieve integrative O&M system in online services, big data, AI, and other architectures through Kubernetes.
Achieve architecture separation in storage and computing. It is the key to deploy the big data engine in Kubernetes following the trend.
Reduce cost and improve efficiency through the inherent isolation feature of Kubernetes to enable better offline and online hybrid deployment.
Save time spent in basic O&M to focus on the business through a set of tools in the Kubernetes ecosystem.

EMR Compute Engine on ACK

Figure 2 shows the architecture of the EMR compute engine on Container Service for Kubernetes (ACK). As a Kubernetes solution of Alibaba Cloud, ACK is compatible with APIs of the Kubernetes community version. ACK and Kubernetes are not differentiated in this article and represent the same concept.

Based on the initial discussion, we believe that the more promising batch processing engines of big data are Spark and Presto. We will add some more promising engines gradually as we iterate the ACK version.

The EMR compute engine provides products based on Kubernetes. This is essentially a combination of Custom Resource Definition (CRD) and Operator, which is also the basic cloud-native philosophy. We classify components into service components and batch components. According to this classification, Hive metastore is the service component, and Spark is the batch component.

The green parts in the following figure are operators, which are improved in many aspects based on open-source architecture. They are also compatible with the basic module of ACK on the product layer. This helps to perform control operations in the cluster conveniently. The right section in the figure contains components of log, monitoring, data analytics, and ECM control, which are the infrastructure components of Alibaba Cloud. Let's discuss the features of the bottom part:

JindoFS is Alibaba's OSS cache acceleration layer to achieve computing and storage separation.
Open HDFS of the existing EMR cluster so that customers can use the existing EMR cluster data.
The introduction of the Shuffle service to decouple Shuffle data is the key to differentiate EMR containerization from open-source solutions, which we will explain later in the article.

It is relatively easy to deploy Presto in ACK as it is a stateless MPP architecture. Thus, this article mainly discusses the solution of Spark on ACK.

Figure 2: Kubernetes-based compute engine

Spark on Kubernetes: Challenges

Generally, Spark on Kubernetes faces the following challenges:

We believe the most important thing is the process of Shuffle. According to the current Shuffle method, we cannot process dynamic resources. Moreover, cloud disk also needs to be attached to Shuffle, which issues with data volume. A large cloud disk may be wasteful, and a small one cannot support the tasks of Shuffle Heavy.
There are scheduling and queue management issues. The standard scheduling performance is to ensure that there should be no performance bottleneck when a large number of jobs are started at the same time. It provides a resource management view that allows us to control and share resources among queues.
Compared with HDFS, the R/W performance of the data lake is reduced in Rename and List scenarios. Besides, OSS bandwidth is an inevitable problem.

Let's look at the solutions that address these problems:

Spark on Kubernetes: Solutions

Remote Shuffle Service Architecture

To solve the Spark Shuffle problem, we designed the Shuffle R/W separation architecture called Remote Shuffle Service. First, let's explore the possible reasons for refusing to use cloud disk in Kubernetes:

When the file system Docker is applied, the slow performance and limited capacity will limit the Shuffle process.
When hostPath is applied, hostPath can't process dynamic resource, which is a great waste. Moreover, you should know that the architecture does not support hostPath when migrating the business to Serverless Kubernetes.
When the executor is applied, it also does not support dynamic resources. Besides, the amount of Shuffle data for each executor needs to be known in advance. A large cloud disk will waste space, while a small one may fail to contain Shuffle data.

Therefore, the remote Shuffle architecture can significantly optimize the existing Shuffle mechanism to solve this problem. It shows a lot of control flows in figure 3, which we will not discuss in detail here. For more information about the architecture, see the article EMR Shuffle Service - a Powerful Elastic Tool of Serverless Spark. The focus here is the data flow All Mappers and Reducers of executor marked in blue are running in the Kubernetes container. In the figure, the middle architecture is the Remote Shuffle Service. The blue part of Mapper writes Shuffle data remotely into service, eliminating the dependency of the executor's task on the local disk. Shuffle service performs merge operation on data in the same partition from different Mappers and then writes the data into the distributed file system. In the Reduce stage, the Reducer can improve the performance by reading files sequentially. The major implementation difficulties of this system are the control flow design, fault tolerance in all aspects, data deduplication, and metadata management.

In short, we have summarized the difficulties in the following three points:

Shuffle data is written through a network, and intermediate data is separated in computing and storage.
DFS 2 copies eliminate recalculation caused by Fetch Failed, creating a more stable environment for job processing of Shuffle Heavy.
Reading data from disks in sequence avoids random IO of the existing version at Reduce stage, which improves performance significantly.

Figure 3: Architecture of Remote Shuffle Service

Remote Shuffle Service Performance

Regarding performance, figure 4 shows the Benchmark score of TeraSort. The reason for choosing the TeraSort workload for testing is that it is a large Shuffle task with only three stages. Therefore, it is very easy to observe the changes in Shuffle performance. In the left part of the figure, the blue bars show the runtime of the Shuffle service, and the orange ones show the runtime of the original Shuffle. With data volumes of 2 TB, 4 TB, and 10 TB, it is clear that the larger the data volume is, the more obvious the advantage of Shuffle service is.

In the right part of the figure, the performance improvement is reflected at the Reduce stage. The duration of Reduce with 10 TB of data is reduced from 1.6 hours to 1 hour. Earlier, we have explained the reason clearly. Those familiar with the Spark Shuffle mechanism know that the original sort Shuffle is M*N times of random IO. In this example, M is 12000 and N is 8000. Remote Shuffle has only N times of sequential IO. Therefore, a remote Shuffle with 8000 times of sequential IO is the fundamental reason for performance improvement.

Figure 4: Benchmark of Remote Shuffle Service Performance**

Other Optimization

Other aspects of EMR optimization are as follows:

Scheduling performance optimization: We have solved some shortcomings of the open-source Spark Operator. We can add multiple configurations of executor pod to the Webhook through Spark Operator, which is unfriendly for scheduling. During the performance test, scheduling takes a detour in API Server, resulting in a significant performance loss. We can avoid the performance bottleneck by converting this process into the Spark kernel. We reserved two options for customers at the scheduler layer, including the Scheduler Framework V2 and the Yunicorn provided by ACK.
R/W performance optimization of OSS: We have introduced JindoFS as a cache to solve bandwidth issues. Besides, EMR provides Jindo Job Committer for OSS scenarios to optimize the job commit process. This greatly reduces time-consuming operations, such as Rename and List.
EMR has accumulated many years of technical advantages over Spark. It has also achieved good results in the official TPCDS testing, including performance and stability. Moreover, it integrates optimizations of Delta Lake.
We provide all-in-one management and control, including Notebook job development, monitoring and alerting logs, and other services. It also inherits the ResourceQuota of Namespace.

In general, the EMR version of Spark on ACK has greatly improved in architecture, performance, stability, and usability.

Prospects on CloudNative Spark

From our perspective, the direction for cloud-native containerization of Spark is to achieve unified O&M with cost-effectiveness. These are our summaries:

Kubernetes computing resources can be divided into the fixed cluster and serverless cluster of hybrid architectures. Fixed clusters are subscription clusters with high resource usage. In contrast, the serverless cluster is for on-demand elasticity.
Customers can use scheduling algorithms to flexibly schedule the tasks with a lower SLA level to Spot instances. This helps to leverage preemptible ECI containers, which further reduces costs.
Spark can fully decouple the local disks in the architecture with the remote Shuffle service cluster, which can be canceled once the executor is idle. By combining the storage-computing separation provided by OSS, it will become the mainstream.
Enhancements are required to improve the scheduling capability so that customers can schedule a large number of jobs in a short time without performance bottlenecks. Besides, we hope to implement more powerful resource control and resource sharing in job queue management.

Figure 5: Hybrid Architecture of Spark on Kubernetes

Implementing native-cloud big data has many challenges. To solve these challenges, the EMR team collaborates with communities and its partners to develop better technologies and ecosystems.

Here are our visions:

Storage-computing separation with on-demand expansion
Ultimate elasticity with high availability
Closed-loop O&M with cost-effectiveness

Community

Cloud-Native Compute Engine: Challenges and Solutions

Cloud-native: Background and Considerations

EMR Compute Engine on ACK

Spark on Kubernetes: Challenges

Spark on Kubernetes: Solutions

Remote Shuffle Service Architecture

Remote Shuffle Service Performance

Other Optimization

Prospects on CloudNative Spark

Read previous post:

Read next post:

Alibaba EMR

You may also like

Comments

Alibaba EMR

Related Products

Hybrid Cloud Distributed Storage

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

OSS(Object Storage Service)