The serverless Spark engine of Data Lake Analytics (DLA) uses a cloud-native architecture to provide data analysis and computing services for data lake scenarios. After you activate DLA, you can submit Spark jobs by completing simple configurations. This frees you from the complex deployment of Spark virtual clusters (VCs).
DLA is discontinued. AnalyticDB for MySQL supports the features of DLA and provides additional features and enhanced performance. For more information about how to use AnalyticDB for MySQL, see Spark application development.
Challenges facing Apache Spark
Apache Spark is a prevailing engine in the big data field. It applies to data lake scenarios and uses built-in connectors to access data sources. These connectors allow you to extend API operations with ease. Apache Spark supports SQL and allows you to write DataFrame in multiple programming languages. This makes Apache Spark easy to use and flexible. Apache Spark serves as an end-to-end engine to support features, such as SQL, streaming, machine learning, and graph computing.
Before you use Apache Spark, you must deploy a set of open source and basic big data components. These components include Yarn, Hadoop Distributed File System (HDFS), and ZooKeeper. After you deploy these components, the following issues may occur:
Complex development and O&M operations: To complete the development and perform O&M operations, developers must be familiar with a variety of big data components. If they encounter issues, they must conduct in-depth research on the source code provided by the Apache Spark community.
High O&M costs: Enterprises require an O&M team to maintain open source components. The O&M team needs to configure resource nodes, configure and deploy open source software, monitor and update open source components, and scale clusters. Customized development is also required to meet enterprise-level requirements, such as permission isolation and monitoring and alerting.
High resource costs: Loads of Spark jobs significantly fluctuate over time. During off-peak hours, large amounts of idle resources exist in Apache Spark clusters. Cluster management and control components still consume resources during off-peak hours but do not bring business value to customers. These components include master nodes, ZooKeeper, and Hadoop.
Lack of elasticity: During peak hours, enterprises need to accurately estimate resource requirements and add machines at the earliest opportunity. If you add a large number of machines, some machines may not be used. If you add only a small number of machines, your business may be affected due to insufficient resources. In addition, the cluster scale-out process is complex and time-consuming, and resources may become insufficient.
Solution
The serverless Spark engine of DLA is a big data analysis and computing service. This engine is developed based on Apache Spark and uses a service-oriented architecture (SOA). The following figure shows the architecture of the serverless Spark engine of DLA.
DLA deeply integrates this engine with Spark, serverless, and cloud-native technologies. Compared with Apache Spark, the serverless Spark engine of DLA provides the following benefits:
Easy to use: provides simple API operations and scripts without requiring developers to understand basic components at the underlying layer. In addition, the serverless Spark engine provides an easy way to perform operations in the DLA console. It allows developers with only a basic knowledge of Apache Spark to develop big data services.
Zero O&M: provides product interfaces for you to manage Spark jobs. You do not need to configure servers or Hadoop clusters, or perform O&M operations such as scaling.
Job-based scalability: allows you to create resources based on the driver and executors. Compared with compute units (CUs) in Apache Spark clusters, the serverless Spark engine of DLA reduces the probability of insufficient resources. This engine allows you to start up to 500 to 1,000 CUs in a minute. This meets business resource requirements.
Low costs: uses the pay-as-you-go billing method. You are charged only for the jobs that you used. You are not charged for resource management and control. In addition, you do not need to pay for idle computing resources during off-peak hours.
Superior performance: improves performance threefold to fivefold in typical scenarios when Alibaba Cloud services, such as Object Storage Service (OSS), are deployed. To achieve this purpose, the development team of DLA customizes and optimizes the serverless Spark engine based on Apache Spark.
NoteFor more information about performance comparison results, see Test results.
Enterprise-level capability: shares metadata with the serverless Presto engine of DLA. You can execute the GRANT and REVOKE statements to manage permissions granted to Resource Access Management (RAM) users. The serverless Spark engine provides a user-friendly web UI. Compared with the Apache Spark history server, the serverless Spark engine takes only a few seconds to open the web UI, no matter how complex a job is.
Terms
Virtual cluster
The serverless Spark engine of DLA uses the multitenancy architecture. The Spark processes run in an isolated environment. A VC is a unit that implements resource and security isolation. A VC does not have fixed computing resources. Therefore, you need to only allocate the resource quota based on your business requirements and configure the network to which the destination data that you want to access belongs. You do not need to configure or maintain CUs. You can also configure default parameters for Spark jobs of a VC. This facilitates unified management of Spark jobs.
Compute unit
A CU is a basic unit of the serverless Spark engine of DLA. One CU equals 1 CPU core and 4 GB of memory. After a job is complete, DLA calculates the number of CUs consumed on the driver and executors by using the following formula: Total number of CUs used on the driver and executors × Number of hours in which CUs are used. For more information, see Billing overview.
Resource specifications
Elastic container instances are used for the serverless Spark engine at the underlying layer. Similar to ECS instances, elastic container instances have their specifications. You do not need to configure the detailed specifications of elastic container instances. Instead, you need to only set resource specifications to small, medium, or large. By default, the serverless Spark engine preferentially uses elastic container instances with higher specifications.
Resource specifications
Computing resource specifications
Number of CUs consumed
c.small
1 CPU core 2 GB
0.8
small
1 CPU core 4 GB
1
m.small
1 CPU core 8 GB
1.5
c.medium
2 CPU cores 4 GB
1.6
medium
2 CPU cores 8 GB
2
m.medium
2 CPU cores 16 GB
3
c.large
4 CPU cores 8 GB
3.2
large
4 CPU cores 16 GB
4
m.large
4 CPU cores 32 GB
6
c.xlarge
8 CPU cores 16 GB
6.4
xlarge
8 CPU cores 32 GB
8
m.xlarge
8 CPU cores 64 GB
12
c.2xlarge
16 CPU cores 32 GB
12.8
2xlarge
16 CPU cores 64 GB
16
m.2xlarge
16 CPU cores 128 GB
24
References
For more information about how to submit a Spark job, see Quick start of the serverless Spark engine.
For more information about how to access a data source in AnalyticDB for PostgreSQL, see AnalyticDB for PostgreSQL.
For more information about how to compute spatio-temporal data, see Product introduction.
For more information about how to obtain technical support, see Expert service.