DataWorks allows you to create nodes such as Hive, MR, Presto, and Spark SQL nodes based on an E-MapReduce (EMR) compute engine. In the DataWorks console, you can configure EMR nodes, enable periodic scheduling of tasks on the nodes, and manage the metadata of the nodes to ensure that data is generated and managed in an efficient and stable manner. This topic describes the usage notes for the development of EMR tasks in DataWorks. The usage notes cover the basic development process, fee description, environment preparation, and permission management.
Background information
EMR is a big data processing solution provided by Alibaba Cloud.
EMR is developed based on open source Apache Hadoop and Apache Spark. EMR allows you to use peripheral systems in the Hadoop and Spark ecosystems to analyze and process data with ease. Alibaba Cloud provides EMR on ECS, EMR on ACK, and EMR Serverless StarRocks to meet the business requirements of different users. For more information, see Product Overview.
Supported EMR cluster types
You must register an EMR cluster to DataWorks before you can use the cluster in the DataWorks console to run tasks. Before you can perform operations related to EMR in the DataWorks console, you must create required EMR clusters. You can register the following types of EMR clusters to DataWorks:
If your cluster cannot be registered to DataWorks, submit a ticket to contact technical support.
Limits
Task type: You cannot run EMR Flink tasks in the DataWorks console.
Task running: You can use a serverless resource group (recommended) or an old-version exclusive resource group for scheduling to run an EMR task.
Task governance:
Only SQL tasks in EMR Hive, EMR Spark, and EMR Spark SQL nodes can be used to generate data lineages. If your EMR cluster is of V3.43.1, V5.9.1, or a minor version later than V3.43.1 or V5.9.1, you can view the table-level lineages and field-level lineages of the preceding nodes that are created based on the cluster.
NoteFor Spark-based EMR nodes, if the EMR cluster is of V5.8.0, V3.42.0, or a minor version later than V5.8.0 or V3.42.0, the Spark-based EMR nodes can be used to view table-level and field-level lineages. If the EMR cluster is of a minor version earlier than V5.8.0 or V3.42.0, only the Spark-based EMR nodes that use Spark 2.x can be used to view table-level lineages.
If you want to manage metadata for a DataLake or custom cluster in DataWorks, you must configure EMR-HOOK in your cluster first. If you do not configure EMR-HOOK in the desired cluster, metadata cannot be displayed in real time, audit logs cannot be generated, and data lineages cannot be displayed in DataWorks. In addition, EMR governance tasks cannot be run. EMR-HOOK can be configured for EMR Hive and EMR Spark SQL services. For more information, see Use the Hive extension feature to record data lineage and historical access information and Use the Spark SQL extension feature to record data lineage and historical access information.
Supported regions: EMR Serverless Spark is available only in the China (Zhangjiakou) region.
Prerequisites
DataWorks is activated and a workspace is created. For more information, see Activate DataWorks and Create and manage workspaces.
An EMR cluster is created. For more information, see Create a cluster.
NoteYou can use different EMR services to run EMR tasks in DataWorks. The optimal configurations of the EMR services vary. When you create an EMR cluster, you can refer to the Appendix: Suggestions for EMR cluster configuration section in this topic to select EMR services based on your business requirements.
A DataWorks serverless resource group is purchased.
By default, DataWorks resource groups are not connected to the networks of other cloud services after the resource groups are purchased. An EMR cluster must be connected to a specific resource group before you can use the EMR cluster.
NoteDataWorks releases serverless resource groups that are used for general purposes, and we recommend that you purchase this type of resource group. Serverless resource groups are suitable for scenarios in which different task types are used, such as data synchronization and task scheduling. For information about how to purchase a serverless resource group, see Create and use a serverless resource group. New users can purchase only serverless resource groups.
If you have purchased an old-version exclusive resource group, you can also use the resource group to run EMR tasks. An old-version exclusive resource group that you can select varies based on the type of the task that you want to run. For example, to run a data synchronization task, you must use an exclusive resource group for Data Integration. To run a data scheduling task, you must use an exclusive resource group for scheduling. For more information, see Use old-version resource groups.
Usage notes
The following table describes the usage notes for the development of EMR tasks in DataWorks.
No. | Description |
When you develop EMR tasks in DataWorks, you are charged for not only DataWorks resources but also the resources of other Alibaba Cloud services. | |
Before you develop EMR tasks in DataWorks, you must purchase DataWorks of a specific edition and a resource group based on your business requirements, register an EMR cluster, and prepare the development environment. | |
DataWorks provides a comprehensive permission management system for you to manage product-level permissions and module-level permissions. You can grant different permissions to different users based on your business requirements to implement fine-grained permission management. | |
DataWorks Data Integration allows you to read data from and write data to EMR Hive. DataWorks provides a variety of data synchronization scenarios, such as batch synchronization and full and incremental synchronization. | |
DataWorks provides the Data Modeling service that is used to structure and manage large volumes of unordered and complex data. DataWorks also provides the DataStudio service for development of tasks that are scheduled to run. After the tasks are developed, you can go to Operation Center to monitor and perform O&M operations on the tasks. | |
DataWorks DataAnalysis provides the EMR data analysis and service sharing capabilities. | |
DataWorks allows you to manage EMR metadata and govern EMR data. | |
DataWorks provides the DataService Studio service to help you manage API services for internal and external systems in a centralized manner. | |
DataWorks provides openness capabilities that allow your application systems to quickly integrate with DataWorks. You can use DataWorks to manage data-related processes, govern data, perform O&M operations on data, and quickly respond to changes to the business status in the application systems. |
Billing
1. Fees for DataWorks resources
This section describes the fees that are included in your DataWorks bill. For information about the billable items of DataWorks, see Billing overview.
Fee | Description |
Fees for the DataWorks edition that you use | You must activate DataWorks before you can develop tasks in DataWorks. If you activate DataWorks Standard Edition, DataWorks Professional Edition, or DataWorks Enterprise Edition, you are charged the fees for the edition when you purchase the edition. |
Fees for the scheduling resources that you use to schedule tasks | After tasks are developed, scheduling resources are required to schedule the tasks. You can purchase a serverless resource group or an old-version exclusive resource group for scheduling, and pay for the resource group. We recommend that you purchase a serverless resource group. Note A purchased serverless resource group can be used for task scheduling and data synchronization. |
Fees for the resources that you use to synchronize data | A data synchronization task consumes scheduling resources and synchronization resources. You can purchase a serverless resource group or an old-version exclusive resource group for Data Integration, and pay for the resource group. We recommend that you purchase a serverless resource group. |
2. Fees for the resources of other Alibaba Cloud services
This section describes the fees that are not included in your DataWorks bill.
You are charged for the resources of other Alibaba Cloud services based on the billing logic of the Alibaba Cloud services. For more information, see the billing documentation of the Alibaba Cloud services. For information about the billing details of an EMR compute engine, see Billing overview.
Fee | Description |
Database fees | When you run data synchronization tasks to read data from and write data to databases, database fees may be generated. |
Computing and storage fees | When you run tasks of a specific compute engine type, computing and storage fees of this type of compute engine may be generated. |
Network service fees | When you establish network connections between DataWorks and other related services, network service fees may be generated. For example, if you use services, such as Express Connect, Elastic IP Address (EIP), and Internet Shared Bandwidth, to establish network connections between DataWorks and other related services, you may be charged network service fees. |
Environment preparation
1. Resource preparation
Item | Description | References |
Select a DataWorks edition | DataWorks Basic Edition allows you to perform the following basic operations during the development of EMR data: migrate data to the cloud, develop data, schedule EMR tasks, and govern data. If you want to use more advanced data governance and data security solutions, you can purchase DataWorks of an advanced edition, such as DataWorks Standard Edition, DataWorks Professional Edition, or DataWorks Enterprise Edition. | |
Select a resource group | You can use only serverless resource groups or old-version exclusive resource groups to run EMR tasks. We recommend that you use serverless resource groups. |
2. Development environment preparation
You must register an EMR cluster with a DataWorks workspace before you can develop EMR tasks in DataStudio. You must add users to the workspace as members. This facilitates collaborative data development.
Item | Description | References |
Prepare a data synchronization environment | Before you run a data synchronization task based on an EMR service, you must add the EMR service to DataWorks as a data source. | |
Prepare an environment for data development and analysis | Before you enable DataWorks to periodically schedule EMR tasks, you must add an EMR cluster to DataWorks as a data source. Then, you can use the data source to perform operations, such as data development, data analysis, and periodic task scheduling. | |
Prepare a collaborative development environment | To ensure that RAM users can collaborate with each other to develop data in a workspace, you must perform the following operations:
|
Permission management
DataWorks provides a comprehensive permission management system for you to manage product-level permissions and module-level permissions. You can grant different permissions to different users based on your business requirements. Details of permission management:
1. Management of data access permissions
You can configure mappings between RAM users that are added to a DataWorks workspace as members to develop EMR tasks and EMR cluster accounts to allow the RAM users to have the permissions of the mapped EMR cluster accounts. For more information, see Configure mappings between tenant member accounts and EMR cluster accounts.
DataWorks allows you to manage permissions on Data Lake Formation (DLF) in a visualized manner. For example, you can request permissions, process permission requests, and audit permissions. This helps you manage permissions on fully managed data lakes in a centralized manner. If DLF is specified as the metadata storage service for an EMR data source that is added to your workspace, you can apply for and manage permissions in DataWorks Security Center. For more information, see Manage permissions on DLF.
2. Management of permissions on services and features
Before you develop data in DataWorks as a RAM user, you must assign a workspace-level role to the RAM user to grant the RAM user specific permissions. For more information, see Best practices for managing permissions of RAM users.
You can refer to Manage permissions on global-level services to manage permissions on DataWorks service modules, such as prohibiting users from accessing Data Map, and to manage permissions of performing operations in the DataWorks console, such as allowing users to delete a workspace.
You can refer to Manage permissions on workspace-level services to manage permissions on DataWorks workspace-level service modules, such as allowing users to access DataStudio to perform development operations, and to manage permissions on DataWorks global-level service modules, such as prohibiting users from accessing Data Security Guard.
Getting started
DataWorks provides multiple services. You can develop tasks that are scheduled to run in DataStudio. After the tasks are developed, you can go to Operation Center in the production environment to monitor and perform O&M operations on the tasks. DataWorks also provides process control for task development and deployment to standardize data development operations and ensure security of data development.
1. Data integration
DataWorks Data Integration allows you to read data from and write data to EMR Hive. You must add the Hive service to DataWorks as a data source before you can synchronize data from another type of data source to a Hive data source or synchronize data from a Hive data source to another type of data source. In addition, DataWorks provides a variety of data synchronization scenarios, such as batch synchronization, full synchronization, and incremental synchronization. You can select a scenario based on your business requirements. For more information, see Overview.
2. Data modeling and development
Module | Description | References |
Data Modeling | Data Modeling is the first step for end-to-end data governance. Data Modeling uses the modeling methodology of the Alibaba data mid-end, interprets the business data of an enterprise from a business perspective by using the data warehouse planning, data standard, dimensional modeling, and data metric modules, and allows personnel inside the enterprise to quickly understand and share the idea of measuring and interpreting business data in compliance with data warehousing specifications. | |
DataStudio | DataWorks encapsulates the capabilities of an EMR compute engine. This way, you can use the EMR compute engine to run EMR data synchronization and development tasks.
| |
You can use general nodes and nodes of a specific type of compute engine in DataWorks to process complex logic. DataWorks supports the following types of general nodes:
| ||
After tasks on nodes are developed, you can perform the following operations based on your business requirements:
| ||
Operation Center | Operation Center is an end-to-end big data O&M and monitoring platform. Operation Center allows you to view the status of tasks and perform O&M operations on tasks on which exceptions occur. For example, you can perform intelligent diagnostics and rerun tasks in Operation Center. Operation Center provides the intelligent baseline feature that you can use to resolve issues such as uncontrollable output time of important tasks and difficulties in monitoring of massive tasks. This feature helps you ensure the timeliness of task output. | |
Data Quality | Data Quality ensures data availability for the end-to-end data R&D process and provides reliable data for your business in an efficient manner. Data Quality can help you identify data quality issues at the earliest opportunity and prevent data quality issues from escalating by virtue of effective monitoring rule-based quality checks and the combination of monitoring rules and task scheduling processes. |
2. Data analysis
The DataAnalysis service module of DataWorks helps you perform SQL-based analysis online, gain an insight into business requirements, and edit and share data, and allows you to save query results as chart cards and quickly generate visualized data reports based on the chart cards for daily reporting. For more information, see DataAnalysis overview.
3. Data governance
After you register an EMR cluster to DataWorks, DataWorks automatically collects metadata from your EMR compute engine. You can refer to Data Map overview to view metadata. In addition, you can refer to Data Governance Center overview to view the issues that are detected by DataWorks and perform related data governance operations.
Module | Description | References |
Data Map | Data Map is an enterprise-grade data management platform that provides management, sorting, quick search, and in-depth understanding capabilities for data objects based on the underlying unified metadata services. | |
Security Center Data Security Guard Approval Center | Security Center is an end-to-end data security governance platform that covers classification of data assets, sensitive data identification, management on data-related authorization, masking of sensitive data, audit of access to sensitive data, and risk identification and response. Security Center helps you determine data security governance issues. | |
Data Governance Center | Data Governance Center automatically identifies items to be governed for multiple governance fields based on rules that come from experience in data-related fields, and provides governance and optimization solutions covering pre-event issue prevention and post-event issue resolution. Data Governance Center can help you actively and systematically complete data governance. |
4. Data service
DataService Studio is designed to provide comprehensive data service and sharing capabilities for enterprises and helps enterprises manage API services for internal and external systems in a centralized manner. For more information, see DataService Studio overview.
5. Open Platform
DataWorks provides openness capabilities that allow your application systems to quickly integrate with DataWorks. You can use DataWorks to manage data-related processes, govern data, perform O&M operations on data, and quickly respond to changes to the business status in the application systems.
Item | Description | References |
OpenAPI | The OpenAPI module allows you to call DataWorks API operations so that you can integrate your applications with DataWorks. This can help facilitate big data processing, decrease manual operations and O&M operations, minimize data risks, and reduce costs for enterprises. | |
OpenEvent | The OpenEvent module allows you to subscribe to DataWorks change events related to your applications so that you can detect and respond to the changes at the earliest opportunity. | |
Extensions | You can use the OpenEvent module to subscribe to event messages that are generated in your DataWorks workspace. You can use the Extensions module to register your local program as an extension to manage extension point events and processes. |
Appendix: Suggestions for EMR cluster configuration
You can use different EMR services to run EMR tasks in DataWorks. The optimal configurations of the EMR services vary. When you create an EMR cluster, you can select EMR services based on your business requirements.
Kyuubi
When you configure Kyuubi in the production environment, we recommend that you set the
kyuubi_java_opts
parameter to 10g or a larger value, and set thekyuubi_beeline_opts
parameter to 2g or a larger value.Spark
The default memory size of Spark is small. You can add a command that is used to configure the memory size in the
spark-submit
CLI to modify the default memory size.You can modify the following parameters that are configured for Spark based on the scale of the EMR cluster that you use:
spark.driver.memory
,spark.driver.memoryOverhead
, andspark.executor.memory
.
ImportantOnly EMR Hive nodes, EMR Spark nodes, and EMR Spark SQL nodes in DataWorks can be used to generate lineages. EMR Hive nodes can be used to generate table-level and column-level lineages. Spark-based EMR nodes can be used to generate only table-level lineages.
For more information about how to configure Spark, see Spark memory management.
HDFS
You can modify the following parameters that are configured for HDFS based on the scale of the EMR cluster that you use:
hadoop_namenode_heapsize
,hadoop_datanode_heapsize
,hadoop_secondary_namenode_heapsize
, andhadoop_namenode_opts
.