FAQ about DataStudio - DataWorks - Alibaba Cloud Documentation Center

This topic provides answers to some frequently asked questions about DataStudio.

Resources
PyODPS
Nodes and workflows
Tables
Operational logs and retention period of operational logs
- How do I query historical operational logs on the DataStudio page?
- How long are operational logs on the DataStudio page retained?
Batch operations
- How do I perform operations on multiple nodes, resources, or functions at a time?
- How do I change resource groups for scheduling for multiple nodes in a workflow at a time on the DataStudio page?
Power BI connection to MaxCompute
What do I do if an error is reported when I connect Power BI to MaxCompute?
API calls
- When I call a DataWorks API operation, the following error message appears: access is forbidden. Please first activate DataWorks Enterprise Edition or Flagship Edition. What do I do?
- How do I obtain call cases of SDK for Python?
Others
- How do I disable the MaxCompute Query Acceleration (MCQA) feature if I want to obtain the instance ID that is used to download more than 10,000 data records?

Which kind of resource group can I use when I reference a third-party package in a PyODPS node?

Use an exclusive resource group for scheduling. For more information, see Use a PyODPS node to reference a third-party package.

How do I control whether the queried table data can be downloaded?

Before you download data from DataWorks, you must make sure that the download feature is enabled. If no download entry point is available, the download feature is not enabled for your workspace. If you use a RAM user and want to use the download feature, contact the owner of the Alibaba Cloud account or the workspace administrator to enable this feature on the Workspace page. 打开下载功能

After you query data in DataStudio, the download entry point is displayed in the lower-right corner of the query result tab, as shown in the following figure. 数据下载

You can download only a maximum of 10,000 data records from DataStudio due to the limits on the compute engine.

How do I download more than 10,000 data records?

Use a Tunnel command of MaxCompute. For more information, see Use SQLTask and Tunnel to export a large amount of data.

When I create a table in a workspace to which an E-MapReduce (EMR) cluster is registered, the following error message appears: call emr exception. What do I do?

Possible cause: Security settings are not configured for the security group of the Elastic Compute Service (ECS) instance that hosts your EMR cluster. When you register an EMR cluster to your workspace, you must add the following rules to the security group of the ECS instance that hosts your EMR cluster. Otherwise, the error message "call emr exception" may appear.
Security settings are not configured for the security group of the ECS instance that hosts your EMR cluster. When you register an EMR cluster to your workspace, you must add the following rules to the security group of the ECS instance that hosts your EMR cluster. Otherwise, the error message "call emr exception" may appear.
- Action: Allow
- Protocol type: Custom TCP
- Port range: 8898/8898
- Authorization object: 100.104.0.0/16
Solution: Check the security settings of the security group of the ECS instance that hosts your EMR cluster. If the security settings do not include the preceding rules, add the rules to the security group.

How do I reference a resource in a node?

Find the resource that you want to reference in the node in the Scheduled Workflow pane, right-click the resource name, and then select Insert Resource Path. 引用资源

How do I download a resource that is uploaded to DataWorks?

Find the resource that you want to download in the Scheduled Workflow pane, right-click the resource name, and then select View Earlier Versions. In the Versions dialog box, click Download in the Actions column. 查看历史版本

How do I upload a resource whose size is greater than 30 MB?

Use a Tunnel command to upload the resource. Then, add the resource to DataStudio in the MaxCompute Resources pane for future use. For more information, see How do I use a resource that is uploaded to DataWorks by using odpscmd?

How do I use a resource that is uploaded to DataWorks by using odpscmd?

If you want to use a resource that is uploaded to DataWorks by using odpscmd, add the resource to DataStudio in the MaxCompute Resources pane. 资源添加

How do I upload a JAR package from my on-premises machine to DataWorks as a JAR resource and reference the uploaded resource in a node?

Upload the JAR package to DataWorks on the DataStudio page as a JAR resource. If you want to reference the resource in a node, find the resource in the Scheduled Workflow pane, right-click the resource name, and then select Insert Resource Path. A comment is automatically added at the beginning of the code for the node, and the node can directly reference the resource in its code based on the resource name. 上传本地JAR

For example, you want to reference the resource test.jar in a Shell node. After you select Insert Resource Path, the comment ##@resource_reference{"test.jar"} is automatically added at the beginning of the code for the Shell node.

How do I use a MaxCompute table resource in DataWorks?

You cannot upload a MaxCompute table resource to DataWorks by using the codeless user interface (UI). You can refer to Example: Reference a table resource to view the method of referencing a table resource. If you want to use a MaxCompute table resource in DataWorks, perform the following steps:

Execute the following SQL statement in MaxCompute to add a table as a resource. For more information, see the Add resources section of the "Resource operations" topic.
```
add table <table_name> [partition (<spec>)] [as <alias>] [comment '<comment>'][-f];
```
On the DataStudio page of DataWorks, create a Python resource and use the Python resource to find the table resource that is added in MaxCompute. In this example, a resource named get_cache_table.py is created. For information about Python code, see the 1. Write a UDF section of the "Example: Reference a table resource" topic.

On the DataStudio page of DataWorks, create a function. In this example, a function named table_udf is created.
On the configuration tab of the function, configure the following parameters:
- Class Name: Set the value to get_cache_table.DistCacheTableExample.
- Resources: Select get_cache_table.py from the drop-down list. Table resources need to be added in the code editor.
Follow the instructions that are described in the 3. Use the UDF section of the "Example: Reference a table resource" topic to construct test data and call the function.

How do I configure a resource group if the following error message appears when I commit a task in DataWorks: Default Or Available ResourceGroup Not Set?

To resolve this issue, select a resource group for scheduling from the Resource Group drop-down list in the Resource Group section of the Properties tab in the right-side navigation pane of the configuration tab of the desired node. If no resource group for scheduling is available, perform the following operations to associate a resource group for scheduling with the desired workspace:

Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, click Resource Group.
On the Resource Groups page, find the created resource group and click Associate Workspace in the Actions column.
In the Associate Workspace panel, find the created workspace and click Associate in the Actions column.

After the preceding operations are complete, you can go to the Properties tab of the desired node and select the associated resource group for scheduling from the Resource Group drop-down list.

Can a Python resource call another Python resource?

A Python resource can call another Python resource in the same workspace.

Can PyODPS call custom functions to use third-party packages?

If you do not want to use the map method of DataFrame to call the test function, you can use PyODPS to call custom functions to use third-party packages. For more information, see Reference a third-party package in a PyODPS node.

When I call a pickle file in a PyODPS 3 node, the following error message appears: `_pickle.UnpicklingError: invalid load key, '\xef.` What do I do?

Check whether the code of your PyODPS 3 node contains special characters. If the code contains special characters, compress the code into a ZIP package, upload the package to DataWorks, and then decompress the package to call the pickle file.

How do I delete a MaxCompute resource?

To delete a MaxCompute resource in a workspace in basic mode, right-click the resource name and select Delete to delete it. To delete a MaxCompute resource in a workspace in standard mode, you must delete the resource in the development environment and then delete it in the production environment. The following example shows how to delete a MaxCompute resource in the development and production environments.

Note

In a DataWorks workspace in standard mode, the development environment is isolated from the production environment. If you delete a resource on the DataStudio page, the resource is deleted only from the development environment. The same resource is deleted from the production environment only after you deploy the delete operation to the production environment.

Delete a resource from the development environment: In the desired workflow, choose MaxCompute > Resource, right-click the name of the resource that you want to delete, and then select Delete. In the Delete message, click OK.
Delete a resource from the production environment: A resource can be deleted from the production environment only after the delete operation of the resource is deployed to the production environment. On the DataStudio page, click Deploy in the upper-right corner of the configuration tab of the resource that you want to delete. On the Create Deploy Task page, set the Change Type parameter to Offline, find the package of the resource that is deleted in the previous step, and then click Deploy in the Actions column. In the Create Deploy Task dialog box, click Deploy. After you click Deploy, the resource is deleted from the production environment.

How do I recover a node that is deleted?

On the DataStudio page, click Recycle Bin in the left-side navigation pane. In the Recycle Bin pane, find the node that you want to recover, right-click the node name, and then select Restore. 还原节点

How do I view the versions of a node?

Find the node whose versions you want to view and double-click the node name to go to the configuration tab of the node. Then, click Versions in the right-side navigation pane. On the Versions tab, you can view the versions of the node.

Important

A version is generated only after you commit the code.

查看版本

How do I clone a workflow?

Use a node group. For more information, see Create and manage a node group.

How do I export the code of a node?

Use Migration Assistant. For more information, see Overview.

How do I check whether a node is committed?

If you want to check whether a node is committed, find the desired workflow in the Scheduled Workflow pane and expand the workflow to view the status of each node in this workflow. If the icon is displayed on the left side of a node, the node is committed. Otherwise, the node is not committed.

Can I configure properties for all nodes in a workflow at a time?

No, you cannot configure properties for all nodes in a workflow. In DataWorks, you are not allowed to configure properties for a workflow. If a workflow contains multiple nodes, you must configure properties for the nodes one by one. For example, if a workflow contains 20 nodes, you must configure properties for these nodes one by one.

What is the impact on the instances of a node after the node is deleted?

The scheduling system generates one or more instances for a node every day based on the time properties of the node. If the node is deleted after it is run for a period of time, its instances are retained. However, the instances will fail to be run after the node is deleted. This is because the required code is unavailable.

After a node is modified, committed, and deployed to the production environment, is the existing faulty node in the production environment overwritten?

No, the existing faulty node is not overwritten. The updated code is used to run new node instances that are not run, and the existing node instances are retained. If scheduling properties are modified, the modified configurations apply only to the new node instances.

How do I create a table in a visualized manner?

Go to the DataStudio page and click Workspace Tables in the left-side navigation pane. In the Workspace Tables pane, create a table. 新建表

How do I add fields to a table that is in the production environment?

If you use an Alibaba Cloud account, add fields to the table in the Workspace Tables pane of the DataStudio page and commit the table to the production environment.

If you use a RAM user, you must request the permissions of the O&M engineer or workspace administrator role for the RAM user, use the RAM user to add fields to the table in the Workspace Tables pane of the DataStudio page, and then commit the table to the production environment.

How do I delete a table?

You can delete a table from the development environment on the DataStudio page.

To delete a table from the production environment, use one of the following methods:

Go to Data Map and delete the table on the My Data tab.
Create an ODPS SQL node, and enter and execute the DROP statement on the configuration tab of the node. For more information about how to create an ODPS SQL node, see Develop a MaxCompute SQL task. For more information about the syntax of the DROP statement, see Table operations.

删除表

How do I upload data from my on-premises machine to a MaxCompute table?

Go to the DataStudio page and use the Import feature in the Scheduled Workflow pane to import the data. 导入本地数据

When I create a table in an EMR cluster, the following error message appears: `call emr exception`. What do I do?

Possible causes:
Security settings are not configured for the security group of the ECS instance that hosts your EMR cluster. When you register an EMR cluster to your workspace, you must add the following rules to the security group of the ECS instance that hosts your EMR cluster. Otherwise, the error message "call emr exception" may appear.
- Action: Allow
- Protocol type: Custom TCP
- Port range: 8898/8898
- Authorization object: 100.104.0.0/16
Solution:
Check the security settings of the security group of the ECS instance that hosts your EMR cluster. If the security settings do not include the preceding rules, add the rules to the security group.

How do I query data that is in the production environment from the development environment on the DataStudio page?

In a workspace in standard mode, if you want to query data that is in the production environment from the development environment on the DataStudio page, specify the table whose data you want to query in the Project name.Table name format.

In a workspace that is upgraded from the basic mode to the standard mode, if you want to query data that is in the production environment from the development environment on the DataStudio page, you must request the permissions of the producer role first and specify the table whose data you want to query in the Project name.Table name format. For more information about how to request the permissions, see Request permissions on tables.

How do I query historical operational logs on the DataStudio page?

Click Runtime Logs in the left-side navigation pane of the DataStudio page. In the Runtime Logs pane, you can view the historical operational logs.

How long are operational logs on the DataStudio page retained?

By default, operational logs on the DataStudio page are retained for three days.

Note

For information about the retention period of logs and instances in Operation Center of the production environment, see How long are the logs of resource groups for scheduling and node instances that are run on such resource groups retained?

How do I perform operations on multiple nodes, resources, or functions at a time?

Go to the DataStudio page and click the Batch Operation icon in the Scheduled Workflow pane. On the Batch Operation-Data Development tab, you can perform the desired operation on multiple nodes, resources, or functions at a time. Then, you can commit the objects on which you perform the operation at a time and deploy the objects on the Create Deploy Task page to make the modifications take effect.

批量操作

How do I change resource groups for scheduling for multiple nodes in a workflow at a time on the DataStudio page?

Find the desired workflow on the DataStudio page, move the pointer over the workflow name, and then click the icon on the right side of the workflow name. On the tab that appears, select the nodes for which you want to change resource groups for scheduling and click Switch Resource Groups. After you change the resource groups for the nodes, click the Submit icon in the top toolbar to commit the nodes at a time. Then, deploy the nodes on the Create Deploy Task page at a time to make the modifications take effect in the production environment. 批量修改调度资源组

What do I do if an error is reported when I connect Power BI to MaxCompute?

MaxCompute cannot be connected to Power BI. We recommend that you connect Hologres instead of Power BI to MaxCompute. For more information, see Endpoints for connecting to Hologres.

When I call a DataWorks API operation, the following error message appears: `access is forbidden. Please first activate DataWorks Enterprise Edition or Flagship Edition.` What do I do?

Activate DataWorks Enterprise Edition. For more information, see Overview.

How do I obtain call cases of SDK for Python?

You can click Debug on an API operation page to view the call case.

How do I disable the MaxCompute Query Acceleration (MCQA) feature if I want to obtain the instance ID that is used to download more than 10,000 data records?

To obtain the instance ID that is required to download data records, you must disable the MCQA feature.

Note

DataWorks allows you to download only a maximum of 10,000 data records. If you want to download more than 10,000 data records of an ODPS SQL node, you must use a Tunnel command.

Add set odps.mcqa.disable=true; to the code of the ODPS SQL node and execute this statement together with other SELECT statements.

Which kind of resource group can I use when I reference a third-party package in a PyODPS node?

How do I control whether the queried table data can be downloaded?

How do I download more than 10,000 data records?

When I create a table in a workspace to which an E-MapReduce (EMR) cluster is registered, the following error message appears: call emr exception. What do I do?

How do I reference a resource in a node?

How do I download a resource that is uploaded to DataWorks?

How do I upload a resource whose size is greater than 30 MB?

How do I use a resource that is uploaded to DataWorks by using odpscmd?

How do I upload a JAR package from my on-premises machine to DataWorks as a JAR resource and reference the uploaded resource in a node?

How do I use a MaxCompute table resource in DataWorks?

How do I configure a resource group if the following error message appears when I commit a task in DataWorks: Default Or Available ResourceGroup Not Set?

Can a Python resource call another Python resource?

Can PyODPS call custom functions to use third-party packages?

When I call a pickle file in a PyODPS 3 node, the following error message appears: _pickle.UnpicklingError: invalid load key, '\xef. What do I do?

How do I delete a MaxCompute resource?

How do I recover a node that is deleted?

How do I view the versions of a node?

How do I clone a workflow?

How do I export the code of a node?

How do I check whether a node is committed?

Can I configure properties for all nodes in a workflow at a time?

What is the impact on the instances of a node after the node is deleted?

After a node is modified, committed, and deployed to the production environment, is the existing faulty node in the production environment overwritten?

How do I create a table in a visualized manner?

How do I add fields to a table that is in the production environment?

How do I delete a table?

How do I upload data from my on-premises machine to a MaxCompute table?

When I create a table in an EMR cluster, the following error message appears: call emr exception. What do I do?

How do I query data that is in the production environment from the development environment on the DataStudio page?

How do I query historical operational logs on the DataStudio page?

How long are operational logs on the DataStudio page retained?

How do I perform operations on multiple nodes, resources, or functions at a time?

How do I change resource groups for scheduling for multiple nodes in a workflow at a time on the DataStudio page?

What do I do if an error is reported when I connect Power BI to MaxCompute?

When I call a DataWorks API operation, the following error message appears: access is forbidden. Please first activate DataWorks Enterprise Edition or Flagship Edition. What do I do?

How do I obtain call cases of SDK for Python?

How do I disable the MaxCompute Query Acceleration (MCQA) feature if I want to obtain the instance ID that is used to download more than 10,000 data records?

When I call a pickle file in a PyODPS 3 node, the following error message appears: `_pickle.UnpicklingError: invalid load key, '\xef.` What do I do?

When I create a table in an EMR cluster, the following error message appears: `call emr exception`. What do I do?

When I call a DataWorks API operation, the following error message appears: `access is forbidden. Please first activate DataWorks Enterprise Edition or Flagship Edition.` What do I do?