This topic describes the features and scenarios of Data Lake Analytics (DLA).
The following table describes the features of DLA.
Feature | Description | Reference |
---|---|---|
Account management | Accounts are classified into DLA sub-accounts and RAM users. You can create a DLA sub-account and bind this sub-account with a RAM user. Then, you can submit a Spark job as the RAM user. | Account types |
Virtual cluster (VC) management | If you use DLA CU Edition, you must create a VC. This feature is suitable for scenarios in which data is frequently queried and the amount of data queried is large. This feature also helps you determine the costs of using DLA. | Create a virtual cluster |
Metadata management | This feature provides a visualized global management view that allows you to perform metadata operations, for example, you can create a schema, view the information of databases and tables, and query data. | Query details about a schema |
Metadata discovery | When this feature is enabled, DLA automatically creates and updates data lake metadata for data files on OSS. This facilitates data analysis and computing. In addition, DLA can automatically detect data fields and types in files as well as new columns and partitions, map directories to partitions, and group files to create tables. | Crawl metadata |
One-click data ingestion into data lakes | This feature allows you to configure a data source (such as ApsaraDB RDS or a self-managed database hosted on ECS instances) and a destination data warehouse OSS in the DLA console. DLA automatically and seamlessly synchronizes data from the data source to OSS at a specified time, creates a schema that is the same as the table schema of the data source in DLA and OSS, and analyzes data based on the data in OSS. This operation does not affect the business of the data source. | Overview |
Real-time data lake | A real-time data lake is built based on Spark Streaming of the DLA serverless Spark engine and Apache Hudi. After incremental OSS data is written into a Hudi table, DLA automatically creates metadata in its metadata system. | Build a real-time data lake by using DLA and DTS to synchronize data from ApsaraDB RDS |
DLA Serverless Presto | The serverless Presto engine of DLA is a Presto-based engine that is used for interactive analysis. Presto was designed to cope with the time-consuming online analysis that uses Hive. Presto uses an in-memory streamlined execution engine. Compared with other engines that write intermediate data to disks, Presto provides faster computing and execution. Presto is especially suitable for data analysis workloads, such as ad hoc queries, business intelligence (BI) analysis, and lightweight extract, transform, and load (ETL). | Overview |
DLA Serverless Spark | The cloud-native serverless Spark engine of DLA provides data analytics and computing in data lake scenarios. After you activate DLA, you can submit Spark jobs with simple configurations, which frees you from the complex deployment of Spark clusters. | Overview of Serverless Spark |