If you want to perform operations on data in your database or data warehouse in DataWorks, you must add your database or warehouse to DataWorks as a data source on the Data Source page in Management Center in the DataWorks console and associate the data source with a DataWorks service in which you want to use the data source. For example, if you want to synchronize data from a MaxCompute project, you must add the MaxCompute project to DataWorks as a data source. Then, when you configure a synchronization task in Data Integration, you can select the data source and use the data source as the source or destination of the synchronization task.
Background information
From October 20, 2023, the MaxCompute, Hologres, AnalyticDB for PostgreSQL, AnalyticDB for MySQL, and ClickHouse compute engines are gradually managed as data sources, and the E-MapReduce (EMR) and Cloudera's Distribution including Apache Hadoop (CDH) or CDP compute engines are gradually managed as open source clusters. This ensures better user experience. After the change, you must perform operations that are related to compute engines, such as creating and modifying compute engines, on the Data Source page or Open Source Clusters page in the DataWorks console. For more information, see Notice for a new version of DataWorks data sources.
Permission management
Only a workspace member to which the O&M or Workspace Administrator role is assigned and a RAM user to which the AliyunDataWorksFullAccess or AdministratorAccess policy is attached can add data sources. For information about the authorization, see Manage permissions on workspace-level services and Grant permissions to a RAM user.
In addition to the preceding permissions, other permissions may also be required for adding specific types of data sources. You can perform the authorization based on the instructions displayed in the DataWorks console.
Data source isolation
A workspace in standard mode supports the data source isolation feature. You can add a data source separately in the development environment and production environment. This way, the data source used for testing and the data source used for task scheduling in the production environment are isolated. This can ensure data security in the production environment. For more information, see Appendix: Environments of data sources.
Data sources in the development environment: You can select such a data source when you create a synchronization task. Then, you can run the synchronization task in the development environment. You cannot commit the synchronization task to the production environment or run the synchronization task in the production environment.
Data sources in the production environment: You can use such a data source only in the production environment. You cannot select such a data source when you configure a synchronization task.
Supported data source types
For information about the data source types that are supported by DataWorks, see Supported data source types and synchronization operations. The following types of data sources are mainly used for scheduling tasks: MaxCompute, Hologres, AnalyticDB for PostgreSQL, AnalyticDB for MySQL V3.0, ClickHouse, EMR, and CDH/CDP.
Take note of the following items about CDH/CDP and EMR clusters:
If you want to use a component, such as Hive, of a cluster in DataWorks, you can add the component to DataWorks as a data source on the Data Source page.
If you want to schedule tasks based on a cluster in DataWorks, you must register the cluster to DataWorks. For more information, see Register an EMR cluster to DataWorks or Preparations: Obtain configuration information about a CDH or CDP cluster and configure network connectivity.
The data sources that can be used for different modules of DataWorks vary.
Add a data source
Go to the Management Center page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
In the left-side navigation pane of the SettingCenter page, choose Data Source > Data Sources to go to the Data Source page in Management Center.
On the Data Sources page, you can click Add Data Source or Batch Add Data Sources based on your business requirements.
NoteFor information about the data source types that are supported by DataWorks, see the Supported data source types section in this topic.
Add a data source
Click Add Data Source. In the Add Data Source dialog box, click the desired data source type. On the page that appears, configure the parameters to add a data source of the selected type. Parameters that you must configure when you add different types of data sources vary. You can view the infotip of each parameter on the configuration page of the related data source.
Optional. Test the connectivity of the resource group.
In the Connection Configuration section of the Add Data Source dialog box, find the resource group that is associated with the workspace and click Test Network Connectivity in the Connection Status column.
NoteFor more information about resource groups, see Overview.
If Connected is displayed in the Connection Status column, click Complete Creation.
If Connection failed is displayed in the Connection Status column, the resource group cannot be connected to the data source. In this case, tasks that use the data source cannot be run.
You can click Self-service Troubleshoot to troubleshoot connectivity issues in the Network Connectivity Diagnostic Tool panel. If the connectivity diagnostics tool does not provide a solution, check the parameters that you configure, such as the account, password, and connection address, and make sure that the IP address of the resource group is added to the IP address whitelist of the data source. For more information, see Network connectivity.
Add multiple data sources at a time
Click Batch Add Data Source and perform the following operations. You can add only multiple MySQL, PolarDB, SQL Server, or Oracle data sources at a time.
In the Batch Add Data Sources dialog box, select the desired data source type and download the configuration template for this data source type.
The information that you must configure in the template varies based on the value of the Data Source Type parameter. You can set the Data Source Type parameter to Connection Mode or Instance Mode. You can view the information that you must configure in the DataWorks console.
Configure data source information in the template.
After the data source information is configured, upload the template. Then, the system adds the data sources to DataWorks at a time based on the information in the template.
When the system adds the data sources, you can view the progress and details in the Batch Add Data Sources dialog box. If specific data sources fail to be added, you can troubleshoot the issue based on the error message.
DataWorks allows you to add a data source in connection string mode or Alibaba Cloud instance mode. You can select a mode based on your business requirements. The parameters that you must configure vary based on the mode that you select.
If you add a data source in connection string mode, DataWorks parses the JDBC URL of the data source. If the JDBC URL contains parameters that are not supported by DataWorks, DataWorks automatically removes the parameters. If you want to retain the unsupported parameters in the JDBC URL, submit a ticket to contact technical personnel.
You can configure different data source information for the development environment and production environment by using the same data source name. Data source configurations in different environments are independent of each other.
Manage data sources
On the Data Source page, you can configure Data Source Type and Data Source Name to search for the data source that you want to manage. On the Data Source page, you can also perform the following operations on a data source.
Modify Data Source: You can modify the configuration information of a data source based on your business requirements. You cannot change the name or environment of a data source.
Delete Data Source: You can delete a data source that is no longer required. The following table describes the impacts that are generated if you delete data sources in different environments.
NoteIf you authorize a member in Workspace A to use a data source in Workspace B and you delete the data source, tasks that use the data source across the workspaces fail.
Impacts of data source deletion on Data Integration
Environment of the data source to be deleted
Operation and impact
Solution that can be applied before data source deletion
Development environment and production environment
You must check whether the data source is being used by synchronization tasks in the production environment. The deletion operation is irreversible. If synchronization tasks configured for the data source are used in the production environment and you delete the data source, the following issues occur:
The synchronization tasks in the production environment cannot be run as expected. We recommend that you delete the data source only after the synchronization tasks are deleted.
The data source is not available when you configure a synchronization task in the development environment.
Go to the Batch Operation-Data Development tab on the DataStudio page, change the data source used by the synchronization tasks at a time, and then commit and deploy the synchronization tasks.
Development environment
You must check whether the data source is being used by synchronization tasks in the production environment. The deletion operation is irreversible. If synchronization tasks configured for the data source are used in the production environment and you delete the data source, the following issues occur:
The synchronization tasks in the production environment can be run as expected. However, you cannot obtain metadata information when you modify the synchronization tasks.
The data source is not available when you configure a synchronization task in the development environment.
Production environment
You must check whether the data source is being used by synchronization tasks in the production environment. If synchronization tasks configured for the data source are used in the production environment and you delete the data source, the following issues occur:
The synchronization tasks in the production environment cannot be run as expected. We recommend that you delete the data source only after the synchronization tasks are deleted.
If you configure a synchronization task for the data source in the development environment, you cannot commit or deploy the synchronization task to the production environment.
Impacts of data source deletion on other modules
Module
Risk level of the deletion operation
Impact
Affected object
Solution that can be applied before data source deletion
Operation Center
High
The running of related tasks fails.
Go to the Batch Operation-Data Development tab on the DataStudio page, change the data source used by the synchronization tasks at a time, and then commit and deploy the synchronization tasks.
DataService Studio
High
Related tasks fail to call DataService Studio APIs.
Change the data source of DataService Studio APIs.
DataAnalysis
Medium
The running of related query tasks fails.
Query tasks that are run in DataAnalysis.
Change the data source for SQL queries.
Data Quality
Medium
Errors occur when related tasks are checked.
Tasks for which Data Quality monitoring rules are configured. For more information, see View monitoring results.
Go to Operation Center and disassociate Data Quality monitoring rules from tasks. For more information, see View and manage auto triggered tasks.
Clone Data Source: You can use the cloning feature to quickly generate a new data source whose configuration information is the same as an existing data source.
NoteThe name of the new data source must be different from that of the existing data source.
Permission Management: You can use the permission management feature to grant permissions on a data source in the current workspace to a member in another workspace. After the permissions are granted to the member, the member can view and use the data source but cannot modify the data source. For more information, see Manage permissions on data sources.
NoteIf you grant permissions on a data source to a workspace, all members in the workspace can view and use the data source.
Appendix: Environments of data sources
In a workspace in standard mode, the same data source has two different sets of configurations in the development environment and production environment. The configurations correspond to two databases or data warehouses at the underlying layer. You can configure different data source information for different environments. This way, the data source that is used for testing and the data source that is used for task scheduling in the production environment can be isolated, and data security in the production environment can be ensured. For example, if you specify different databases for the development environment and production environment when you add a data source, a batch synchronization task that uses the data source accesses different databases when you run the task. This way, the data in the development environment and the data in the production environment are isolated.
A workspace in basic mode provides only one environment and cannot isolate data. For more information about workspace modes, see Differences between workspaces in basic mode and workspaces in standard mode.
If you upgrade a workspace from the basic mode to the standard mode, the original data source is split into two, one for the development environment, and the other for the production environment. For more information, see Scenario: Upgrade a workspace from the basic mode to the standard mode.
In a workspace in standard mode, a task accesses different data sources when it is run in different environments:
When the task is run in DataStudio and Operation Center in the development environment, the task accesses the data source in the development environment by default.
When the task is run in Operation Center in the production environment, the task accesses the data source in the production environment by default.
When you add a data source, you must check whether the database or data warehouse to which the data source in the development environment or production environment corresponds meets your business requirements. If the configurations of the data source in the development environment and those of the data source in the production environment are different, such as different database usernames and passwords, the following issues may occur:
The related task is successfully run in DataStudio but fails to be scheduled in the production environment.
The volume of data that is generated when the related task is run in DataStudio and the volume of data that is generated when the task is scheduled to run in the production environment are different.
You can compare the operational logs generated in the development environment and production environment for the task to troubleshoot the issue.
If the configurations of a data source in the development environment and those of the data source in the production environment are different, you must make sure that your resource group can separately connect to the data source in the development environment and the data source in the production environment.