Before you can develop and manage MaxCompute tasks in DataWorks, you must add a MaxCompute project to the desired DataWorks workspace as a data source. This way, you can use the MaxCompute data source in different services of DataWorks and perform operations such as data synchronization, data development, and data analysis based on the MaxCompute data source.
Prerequisites
MaxCompute is activated. For more information, see Activate MaxCompute.
NoteWe recommend that you create a MaxCompute project in the same region as the workspace to which you want to add a MaxCompute data source. If the regions are different, you can add only a cross-region data source to the workspace. The data source cannot be associated with DataStudio for data development or periodic task scheduling. The data source can be used only for data synchronization.
The required resource group is purchased and configured.
After the MaxCompute data source is added, you can use the data source in scenarios such as data synchronization, development and scheduling of computing tasks, and generation of DataService Studio APIs. In these scenarios, a resource group for Data Integration, a resource group for scheduling, and a resource group for DataService Studio of DataWorks are separately required.
You must purchase and configure the required resource group based on the use scenario of the MaxCompute data source and establish a network connection between the data source and resource group in advance. For information about resource groups provided by DataWorks and how to select a resource group, see Overview.
A DataWorks workspace is created, or the account that you use is added to the desired workspace as a member.
You must add the desired MaxCompute project to the workspace as a data source. This way, you can use the data source to perform data development operations in the workspace. In addition, you must associate the purchased resource group with the workspace and establish a network connection between the resource group and data source. For information about how to create and manage a workspace, see Create and manage workspaces.
NoteYou can add the same MaxCompute project to multiple workspaces as a data source.
Limits
A MaxCompute data source can be associated with DataStudio only if the MaxCompute data source meets the following conditions: The MaxCompute project based on which the data source is added resides in the same region and belongs to the same Alibaba Cloud account as the workspace. This way, the MaxCompute data source can be used for data development and periodic task scheduling.
You can add a MaxCompute project that does not belong to the current Alibaba Cloud account to a workspace within the current Alibaba Cloud account as a data source. After the data source is added, you can use only a RAM role to access the related MaxCompute project. MaxCompute data sources that are added across accounts cannot be used for data development or periodic task scheduling. For more information, see Scenario: Add a data source across accounts.
Only the Deploy and Workspace Administrator roles can be used to add data sources. For information about how to assign the roles to a member, see Add a RAM user to a workspace as a member and assign roles to the member.
NoteIn addition to the permissions of the preceding workspace-level roles, you also need to manage permissions at the MaxCompute side when you add a MaxCompute data source. You can manage permissions by following the instructions shown in the DataWorks console. For more information, see the following section.
Permission description
Use a RAM user or RAM role to add a MaxCompute data source. If you want to use a RAM user or RAM role to add a MaxCompute data source, you must make sure that the RAM user or RAM role is granted the odps:ListProjects permission of MaxCompute and the permissions of the Super_Administrator role of the MaxCompute project.
Specify a RAM user or RAM role as the default access identity of a MaxCompute data source in the production environment.
If you want to set the default access identity of a MaxCompute data source in the production environment to an identity that is not the current logon account, such as another Alibaba Cloud account or another role, the account or role must be granted the permissions of the admin or Super_Administrator role. After the data source is added, the account or role is granted the permissions of the Role_Project_Scheduler role of the related MaxCompute project in the production environment. For information about how to configure the default access identity, see the "Add a data source" section in this topic.
The data of the MaxCompute data source added to the workspace in the production environment belongs to the default access identity that you specify for the MaxCompute data source when you add the data source in the production environment. If you want to use another account to access or perform operations on tables in the MaxCompute data source in the production environment, you must request the required permissions in Security Center. For more information, see Manage permissions on MaxCompute and Overview.
NoteYou cannot perform fine-grained permission management on a workspace that is in basic mode. In this example, a MaxCompute data source is added to a workspace in standard mode.
Entry points for adding a data source
Go to the Data Sources page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
In the left-side navigation pane of the SettingCenter page, choose
.
On the Data Sources page, click Add Data Source. In the Add data source dialog box, click MaxCompute. On the Add MaxCompute Data Source page, configure the parameters to add a MaxCompute data source.
You can also go to the Data Sources page in Data Integration to add a MaxCompute data source. You can add a data source only to the production environment on the Data Source page in Data Integration. After the data source is added, you must manage the data source on the Data Sources page in SettingCenter. You can go to Data Integration to view the types of data sources that you can add in this service.
Add a MaxCompute data source
If you have a MaxCompute project, you can add a MaxCompute data source to the current workspace based on the MaxCompute project. If you do not have a MaxCompute project, you must create a MaxCompute project in the MaxCompute console. For more information, see Create a MaxCompute project.
If you use a workspace in standard mode, you must add a data source separately in the development environment and production environment. For information about the workspace modes, see Differences between workspaces in basic mode and workspaces in standard mode.
If you add a MaxCompute data source to DataWorks based on an existing MaxCompute project, you must make sure that the account you use is granted the odps:ListProjects permission and assigned the Super_Administrator role of the MaxCompute project.
Perform the following steps to add a MaxCompute data source by using this method.
Configure the parameters in the Basic Information section.
Parameter
Description
Data Source Name
The name of the data source in DataWorks. The name must be unique within the current tenant.
Authentication Method
For a new data source, the value of this parameter is fixed as Alibaba Cloud Account and Alibaba Cloud RAM Role.
NoteFor an existing data source that is added by using an AccessKey pair, we recommend that you change the value of this parameter to Alibaba Cloud Account and Alibaba Cloud RAM Role for the data source.
Alibaba Cloud Account
Specifies whether the MaxCompute project you want to use belongs to the current Alibaba Cloud account or another Alibaba Cloud account. Valid values:
Current Alibaba Cloud Account: The MaxCompute project belongs to the current Alibaba Cloud account.
Another Alibaba Cloud Account: The MaxCompute project belongs to another Alibaba Cloud account.
The other parameters that you must configure vary based on the value of the Alibaba Cloud Account parameter. For more information about how to configure these parameters, see the following descriptions for Other items.
Region
The region in which the MaxCompute project that you want to use resides.
NoteIf the region that you selected is different from the region in which the workspace resides, you cannot associate the MaxCompute data source with DataStudio in the workspace after you add the MaxCompute project as a data source. This indicates that the data source cannot be used in DataStudio or Operation Center and can be used only in Data Integration for data synchronization.
Other items (Set the Alibaba Cloud Account parameter to Current Alibaba Cloud Account)
If you set the Alibaba Cloud Account parameter to Current Alibaba Cloud Account, you must configure the following parameters:
MaxCompute Project Name: The name of the MaxCompute project that you want to add as a data source in the selected region.
NoteIf you cannot select the desired MaxCompute project, assign the Super_Administrator role of the project to the current logon account. For more information, see the Permission description section in this topic.
Default Access Identity: The default access identity that is used to access the data source in the current workspace.
Development environment: The value of this parameter is fixed as Executor.
Production environment: The value of this parameter can be Alibaba Cloud Account, Alibaba Cloud RAM User, or Alibaba Cloud RAM Role.
NoteOnly an Alibaba Cloud account, or a RAM user or RAM role that is attached the AdministratorAccess policy can be used to select any access identity in the development environment and production environment.
Specify a RAM user or RAM role as the default access identity of a MaxCompute data source in the production environment.
If you want to set the default access identity of a MaxCompute data source in the production environment to an identity that is not the current logon account, such as another Alibaba Cloud account or another role, the account or role must be granted the permissions of the admin or Super_Administrator role. After the data source is added, the account or role is granted the permissions of the Role_Project_Scheduler role of the related MaxCompute project in the production environment.
The data of the MaxCompute data source added to the workspace in the production environment belongs to the default access identity that you specify for the MaxCompute data source when you add the data source in the production environment. If you want to use another account to access or perform operations on tables in the MaxCompute data source in the production environment, you must request the required permissions in Security Center. For more information, see Manage permissions on MaxCompute and Overview.
Other items (Set the Alibaba Cloud Account parameter to Another Alibaba Cloud Account)
If you set the Alibaba Cloud Account parameter to Another Alibaba Cloud Account, you must configure the following parameters:
UID of Alibaba Cloud Account: The UID of the Alibaba Cloud account to which the MaxCompute project you want to add as a data source belongs.
MaxCompute Project Name: The name of the MaxCompute project that you want to add to the current workspace as a data source.
RAM Role: The RAM role that you want to use to access the MaxCompute project. The RAM role that you select must meet the following conditions:
The RAM role is created within the Alibaba Cloud account that you selected.
The RAM role is assigned to the current logon account to allow DataWorks to access the MaxCompute project.
The RAM role is added to the MaxCompute project that you selected.
NoteFor information about how to add a MaxCompute data source across accounts, see Scenario: Add a data source across accounts.
If the MaxCompute project that you selected and the workspace belong to different Alibaba Cloud accounts, you cannot associate the MaxCompute data source with DataStudio in the workspace after you add the MaxCompute project as a data source. This indicates that the data source cannot be used in DataStudio or Operation Center and can be used only in Data Integration for data synchronization.
Endpoint
The configuration method for the endpoints that DataWorks uses to access the MaxCompute project that you want to add as a MaxCompute data source. The endpoints include the endpoint of the MaxCompute service and the endpoint of the Tunnel service that you can use to upload and download local data or data of cloud data sources. The following configuration methods are supported:
Auto Fit: DataWorks automatically matches endpoints based on actual situations. We recommend that you select this option.
NoteIf the MaxCompute project that you selected and the workspace reside in different regions and you set the Endpoint parameter to Auto Fit, DataWorks reads and downloads data over the public endpoint of the MaxCompute service by default.
Custom Configuration: If you select this option, you must manually configure the endpoint of the MaxCompute service and the endpoint of the Tunnel service. The endpoints vary based on the region that you selected. For more information, see Endpoints.
Test the network connectivity between the data source and a resource group.
Resource groups are classified into resource groups for Data Integration, resource groups for scheduling, and resource groups for DataService Studio based on the use scenarios. For more information about different types of resource groups, see Overview.
You can find the resource group that you want to use in the Connection Configuration section and test the network connectivity between the data source and resource group. If the network connectivity test fails, tasks that use the data source cannot be run.
NoteAfter the data source is added to DataWorks, DataWorks adds the default access identity that you selected to the MaxCompute project based on which the data source is added and grants the related permissions on the MaxCompute project to the identity. Before the authorization is complete, the system may report an error for no permissions during the network connectivity test. In this case, you need to wait a moment after you save the data source.
What to do next
To ensure the smoothness of data development, we recommend that you read Usage notes for development of MaxCompute tasks in DataWorks to understand information such as the procedure of using MaxCompute in DataWorks, fees for data development by using MaxCompute, environment preparation, and permission management before you perform the related operations.
After the data source is added, you can perform the following operations based on your business requirements:
Develop and schedule computing tasks:
DataWorks DataStudio and Operation Center provide the capabilities of developing and scheduling MaxCompute tasks. If you want to develop MaxCompute tasks based on the MaxCompute data source or periodically schedule MaxCompute tasks, you must go to the DataStudio page in the DataWorks console and associate the MaxCompute data source with DataStudio.
NoteYou can associate a MaxCompute data source with DataStudio only if the MaxCompute project based on which the data source is added resides in the same region and belongs to the same Alibaba Cloud account as the workspace to which the data source is added.
DataWorks Data Integration provides MaxCompute Reader and MaxCompute Writer for you to read data from and write data to the MaxCompute data source. You can configure a batch or real-time synchronization task for the MaxCompute data source in DataStudio or configure a synchronization task for the MaxCompute data source in Data Integration based on your business requirements to perform data synchronization.
Manage the data source: You can go to the Data Source page in SettingCenter to perform management operations on the data source. For example, you can edit or delete the data source.