After you create a workspace, you can add a data source to the workspace or register a cluster to the workspace based on the database, data warehouse, or cluster that you want to use. This way, you can use the data source or cluster to perform operations such as data synchronization, data analysis and development, and data scheduling. This topic describes how to make the environment preparations that you must complete in your workspace before you can develop data. The preparations include data source addition or cluster registration and association of a data source for scheduling with DataStudio. In this topic, a formal development environment is used.
Background information
In a DataWorks workspace, you can synchronize data and develop data based on data sources or clusters.
Data sources
DataWorks allows you to add various types of data sources. After you add a data source to a DataWorks workspace, you can use the data source to synchronize data in the workspace. For more information about the data sources that you can use to synchronize data, see Data source list.
You can use only the following types of data sources for data development: MaxCompute, Hologres, AnalyticDB for PostgreSQL, AnalyticDB for MySQL 3.0, and ClickHouse. If you want to use one of the preceding types of data sources for data development, task scheduling, and data analysis, you must associate the data source with DataStudio after you add the data source to DataWorks.
Clusters: DataWorks allows you to register an E-MapReduce (EMR) cluster, a Cloudera's Distribution Including Apache Hadoop (CDH) cluster, or a Cloudera Data Platform (CDP) cluster to DataWorks. After the cluster is registered, you can perform operations, such as data development, task scheduling, and data analysis, in the current workspace based on the cluster. If you want to run a data synchronization task based on a component of a cluster, you must add the component to DataWorks as a data source. For more information, see Supported data source types and synchronization operations.
For more information about data sources or clusters, see Add and manage data sources.
Prerequisites
A workspace is created. For more information, see Create a workspace.
The Alibaba Cloud services to which the required compute engines belong are activated. For more information, see the documentation on the official website of each Alibaba Cloud service.
Permissions
You can add a data source or register a cluster to DataWorks only if you have the required permissions. If you do not have the required permissions, an error message appears when you add a data source or register a cluster. You must first apply for the permissions that are specified in the error message. The required permissions vary based on the type of the compute engine.
The following figure shows the permissions that are required for adding a MaxCompute data source to DataWorks.
Step 1: Add a data source or register a cluster
After you create a workspace, you must add a data source of a required engine type or register a cluster to the current workspace for subsequent development operations.
Add a data source
Go to the Management Center page.
Log on to the DataWorks console. In the left-side navigation pane, click Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
In the left-side navigation pane of the SettingCenter page, choose
.On the Data Sources page, click Add Data Source to add a data source based on your business requirements.
You can use only the following types of data sources to develop data and schedule tasks: MaxCompute, Hologres, AnalyticDB for PostgreSQL, AnalyticDB for MySQL 3.0, and ClickHouse. To add these types of data sources, refer to the topics that are listed in the following table.
Data source type
References
MaxCompute
Hologres
AnalyticDB for PostgreSQL
AnalyticDB for MySQL3.0
ClickHouse
After you add a data source, you can use the data source to synchronize data. For more information, see Overview.
If you want to use a data source for data development, data analysis, or periodic task scheduling, proceed to Step 2: Associate the data source with DataStudio.
Register a cluster
Go to the Management Center page.
Log on to the DataWorks console. In the left-side navigation pane, click Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
In the left-side navigation pane of the SettingCenter page, click Open Source Clusters. On the Open Source Clusters page, click Register Cluster to register a cluster based on your business requirements. To register a cluster, refer to the topics that are listed in the following table.
Cluster type
References
EMR
CDH or CDP
After you register a cluster to DataWorks, you can use the cluster to perform operations such as data development and periodic task scheduling and data analysis.
If you want to run a data synchronization task based on a component of a cluster, you must add the component to DataWorks as a data source. For more information, see Supported data source types and synchronization operations.
Step 2: Associate the data source with DataStudio
After you add a data source to a DataWorks workspace, if you want to perform operations such as data development, data analysis, or periodic task scheduling in Operation Center in the current workspace based on the data source, you must associate the data source with DataStudio in the current workspace. For more information, see Preparations before data development: Associate a data source or cluster.
After you register an EMR, CDH, or CDP cluster to DataWorks, DataWorks automatically associates the cluster with DataStudio. Therefore, you can use the data source to develop tasks in the current workspace without the need to perform manual association.