All Products
Search
Document Center

DataWorks:Create and manage MaxCompute tables

Last Updated:Jul 12, 2024

You can create and manage MaxCompute tables by executing SQL statements in the MaxCompute client or by performing operations in the DataWorks console. It is more convenient to create and manage MaxCompute tables in the DataWorks console. This topic describes how to create and manage MaxCompute tables in the DataWorks console.

Prerequisites

A MaxCompute data source is added to DataWorks and is associated with DataStudio. For more information, see Add a MaxCompute data source and Preparations before data development: Associate a data source or a cluster with DataStudio.

Background information

  • MaxCompute table operation principles

    When you create MaxCompute tables or perform operations on MaxCompute tables in DataWorks, you must comply with the basic requirements for operations performed on MaxCompute tables. For example, you cannot delete fields in a MaxCompute table after you create the table. For more information, see Limits on MaxCompute table operations.

  • Changing of physical properties of MaxCompute tables

    You can change the physical properties of MaxCompute tables by executing MaxCompute SQL statements. For more information, see Table operations. Permission management is implemented when you access MaxCompute tables from DataWorks. If you access MaxCompute tables from DataWorks as a RAM user, an error may be reported in some scenarios. For more information, see Manage permissions on data in a MaxCompute compute engine instance.

  • Metadata update latency

    Latency may exist when you perform operations on metadata. After you create or update tables in DataWorks, you may not find the tables on the UI. After you remove tables from DataWorks, you may still find the tables on the UI. In these cases, you can use a manual synchronization tool in Data Map to manually synchronize the operation results. For more information, see Manual synchronization tool.

Go to the DataStudio page

Log on to the DataWorks console. In the top navigation bar, select the desired region. Then, choose Data Modeling and Development > DataStudio in the left-side navigation pane. On the page that appears, select the desired workspace from the drop-down list and click Go to DataStudio.

Create a MaxCompute table

You can create MaxCompute tables in the Workspace Tables pane of the DataStudio page in the DataWorks console. You can also create MaxCompute tables in a specific workflow in DataStudio. DataWorks allows you to separately create tables in the development environment and in the production environment by executing DDL statements or by using the codeless UI.

Entry points of creating tables

  • Create a MaxCompute table in the Workspace Tables pane

    In the left-side navigation pane of the DataStudio page, click Workspace Tables. In the Workspace Tables pane, click the 新建 icon to create a MaxCompute table. For more information about how to manage tables, see Manage tables. If the Workspace Tables module is not displayed in the left-side navigation pane, you can add the module. For more information about how to add a module to the DataStudio page, see Scenario: adjust the displayed DataStudio modules.

  • Create a MaxCompute table in a specific workflow in DataStudio

    DataWorks organizes and manages code by workflow. You can create tables related to the business in a specific workflow. If no workflow is available, you can create one. For information about how to create a workflow, see Create a workflow. If workflows are available, right-click the name of the desired workflow and choose Create Table > MaxCompute > Table. In the dialog box that appears, configure the required information and click Create.

Configure the basic information about the MaxCompute table

After you perform the preceding operations, you can configure the information about the table on the table configuration tab. On the tab, you can configure the information about the table by executing a DDL statement or by using the codeless UI in DataWorks.

  • Configure the table by executing a DDL statement

    You can click the ddl icon on the table configuration tab and execute a standard table creation statement of MaxCompute to generate the table schema. For information about the standard table creation statements of MaxCompute, see Table operations. After the table schema is generated, DataWorks automatically fills out the configurations for the table on the table configuration tab. You can use this method if you are accustomed to writing code to develop tables.

    Note
    • The name of the table is configured before you access the table configuration tab. You cannot change the table name in the DDL statement. If you change the table name in the DDL statement, an error is reported.

    • You can define only the physical properties of the table by executing the DDL statement. After the table is created and the configurations of the table are automatically filled out on the table configuration tab, you can go to the table configuration tab again to configure the business-related properties of the table.

  • Configure the table by using the codeless UI in DataWorks

    You can configure related properties of the table on the table configuration tab based on the configuration guide. You can use this method if you want to use the codeless UI. The following content shows how to configure the related properties of the table by using the codeless UI.

    1. Configure the parameters in the General section. 配置表基本属性The following table describes the parameters.

      Parameter

      Description

      Display Name

      The display name of the table.

      Theme

      The folders that are used to store and manage the table. You can specify the level-1 and level-2 folders to store the table. The Level-1 Theme and Level-2 Theme parameters can be used to categorize tables based on business categories. You can store tables of the same business category in the same folder.

      Note

      The level-1 and level-2 themes in the Workspace Tables pane of the DataStudio page help you better manage tables in folders. You can quickly find the current table in the Workspace Tables pane by theme. If no theme is available, you can create one. For information about how to create a theme, see the Create or manage folders for tables section of the "Manage settings for tables" topic.

    2. Configure the parameters in the Physical Model section. 物理模型设计The following table describes the parameters.

      Parameter

      Description

      Level

      The physical data layer to which the table belongs. By default, a data warehouse in DataWorks is divided into the following data layers: operational data store (ODS), dimension (DIM), data warehouse detail (DWD), data warehouse summary (DWS), and application data service (ADS). You can store the table at a specific data layer based on the business category.

      Note

      You can also create a data layer based on your business requirements. For more information, see Manage settings for tables.

      Category

      The business category to which the table belongs. Categories help you classify tables in a finer-grained manner based on your business requirements. For example, you can use a basic business category, an advanced business category, or a category for another purpose to classify your tables in most cases.

      Note

      You can also create custom categories based on your business requirements. For more information, see Manage settings for tables.

      TTL

      The time-to-live (TTL) period of the table.

      Partition Type

      Specifies whether the table is a partitioned or non-partitioned table. For more information about partitioned and non-partitioned MaxCompute tables, see Partition.

      Table Type

      Specifies whether the table is an internal or external table. If you query data in an internal table, the query speed is fast because real data is imported to the internal table. If you query data in a table by using an external table, you can query data without the need to import data to the external table. Less memory is used when you query data by using an external table. For more information about internal and external MaxCompute tables, see Table.

    3. Configure the parameters in the Schema section. 表结构The following table describes the parameters.

      Parameter

      Description

      Data Type

      The data type of a field. You can select only data types that are supported by MaxCompute from the Data Type drop-down list. For more information about the data types supported by MaxCompute, see Data type editions.

      Field Security Level

      The security level of a field. This parameter is required only if you enable the label-based access control in a MaxCompute project. For more information about the security levels of fields in a MaxCompute table, see Label-based access control.

      Primary Key Field

      The primary key of the table. MaxCompute tables do not support primary keys. In this case, the primary key you specify in this parameter is used to manage the table only from the business perspective.

Commit and deploy the MaxCompute table

After you define the schema of the table, you must commit the table to the development and production environments.

Note

Items that you need to take note of when you commit and deploy MaxCompute tables in workspaces in basic mode or workspaces in standard mode:

  • For workspaces in basic mode: You need to only commit MaxCompute tables to the production environment.

  • For workspaces in standard mode: Only users who are assigned the Workspace Administrator or O&M role can deploy MaxCompute tables to the production environment. If you want to deploy MaxCompute tables to the production environment, you must be granted the required permissions. For information about authorization, see Add a RAM user to a workspace as a member and assign roles to the member.

For more information about workspace modes, see Differences between workspaces in basic mode and workspaces in standard mode.

Operation

Description

Load from Development Environment

Load the table information from the development environment and present the table information on the current page.

Note

You can perform this operation only after the table is committed to the development environment. After you perform this operation, the table information in the development environment overwrites the table information on the current page.

Commit to Development Environment

Commit the table to the development environment of the workspace. This indicates that you must create the current table in the MaxCompute project that is added to DataWorks as a MaxCompute data source in the development environment. After the table is committed, you can perform the following operations:

  • View the schema of the table

    Develop a task on an ODPS SQL node in DataStudio, and run the desc tablename command to view the schema of the table.

  • View information about the MaxCompute project

    View information about the MaxCompute project that is added to DataWorks as a MaxCompute data source in the development environment and information about the created table. For more information, see Add a MaxCompute data source.

Load from Production Environment

Load the table information from the production environment and present the table information on the current page.

Note

You can perform this operation only after the table is committed to the production environment. After you perform this operation, the table information in the production environment overwrites the table information on the current page.

Commit to Production Environment

Commit the table to the production environment of the workspace. This indicates that you must create the current table in the MaxCompute project that is added to DataWorks as a MaxCompute data source in the production environment. After the table is committed, you can perform the following operations:

  • View the schema of the table

    Develop a task on an ODPS SQL node in DataStudio, and run the desc projectname.tablename command to view the schema of the table.

  • View information about the MaxCompute project

    View information about the MaxCompute project that is added to DataWorks as a MaxCompute data source in the production environment and information about the created table. For more information, see Add a MaxCompute data source.

Write data to and export data from the MaxCompute table

This section describes how to write data to and export data from the MaxCompute table.

Write data to the MaxCompute table

You can use a data synchronization task and a task on a MaxCompute node to write data to the MaxCompute table. DataWorks also allows you to use the data upload feature to import data from your on-premises machine to the MaxCompute table.

Export data from the MaxCompute table

After you query data in the MaxCompute table in DataStudio, you can use the download feature on the query result page to download the data to your on-premises machine. By default, you can download a maximum of 10,000 data records at a time. DataWorks allows tenant administrators and tenant security administrators to configure the maximum number of data records that can be downloaded at a time on the Data query and analysis control page in Security Center.

Note

If the number of data records that you want to export exceeds 10,000, you can run Tunnel commands on the MaxCompute client to export the data.

Query data in the MaxCompute table

You can use an ODPS SQL node or the ad hoc query feature to query data in the MaxCompute table by executing SQL statements.

Preset data access permissions

If you use a workspace in basic mode, fine-grained permission management and isolation of data between development and production environments are not supported. In this example, a workspace in standard mode is used.

The following table describes the preset data access permissions for RAM users that are added to a workspace as members.

Permission type

Description

Permissions on a MaxCompute project in the development environment

After you assign a RAM user a built-in workspace-level role in your workspace and associate a MaxCompute project with the workspace in the development environment, the RAM user is automatically granted the permissions of the mapped role of the MaxCompute project. By default, the RAM user has the permissions of the MaxCompute project in the development environment. However, the RAM user does not have the permissions of the MaxCompute project that is associated with the workspace in the production environment.

Permissions on a MaxCompute project in the production environment

The RAM user that is used as a scheduling access identity has high permissions on a MaxCompute project in the production environment. Other RAM users do not have permissions on the MaxCompute project in the production environment. To perform operations on MaxCompute tables in the production environment, you must go to Security Center to request the required permissions.

DataWorks provides a default request processing procedure. DataWorks also allows users that are granted administrative permissions to customize request processing procedures.

For more information about permission management for MaxCompute, see Manage permissions on data in a MaxCompute compute engine instance.

Data access behaviors

MaxCompute allows you to query tables across projects. You can query data in a MaxCompute project that is associated with a workspace in the production environment by specifying the project name on the DataStudio page. The following table describes the methods to query tables across projects and the accounts that can be used to access the tables in different environments.

Note
  • On the Data Source page in DataStudio, you can view the MaxCompute projects that are added to DataWorks as MaxCompute data sources in the development and production environments and the accounts that are used to configure environments for the MaxCompute projects. For more information, see Preparations before data development: Associate a data source or a cluster with DataStudio.

  • For a workspace in standard mode, the personal identity of a task executor is used to run MaxCompute tasks in the development environment by default, and an Alibaba Cloud account is used as the scheduling access identity to run MaxCompute tasks in the production environment. For more information, see Add a MaxCompute data source.

Sample code

Execution account in the development environment (DataStudio and Operation Center in the development environment)

Execution account in the production environment (Operation Center in the production environment)

Access tables in the MaxCompute project in the development environment:

select col1 from projectname_dev.tablename;

The personal Alibaba Cloud account of a task executor is used to access tables in the MaxCompute project in the development environment.

  • If a RAM user runs a task, the personal Alibaba Cloud account of the RAM user is used to access tables in the MaxCompute project in the development environment.

  • If an Alibaba Cloud account is used to run a task, the Alibaba Cloud account is used to access tables in the MaxCompute project in the development environment.

The scheduling access identity is used to access tables in the MaxCompute project in the development environment.

Access tables in the MaxCompute project in the production environment:

select col1 from projectname.tablename;

The personal Alibaba Cloud account of a task executor is used to access tables in the MaxCompute project in the production environment.

Note

Due to security control on data in the production environment, a personal Alibaba Cloud account cannot be used to access tables in the MaxCompute project in the production environment. To use a personal Alibaba Cloud account to access tables in the MaxCompute project in the production environment, go to Security Center to request the permissions. DataWorks provides a default request processing procedure. DataWorks also allows users that are granted management permissions to customize request processing procedures.

The scheduling access identity is used to access tables in the MaxCompute project in the production environment.

Execute the following statement in the MaxCompute project in the desired environment such as the development environment to access tables in the MaxCompute project:

select col1 from tablename;

If the statement is executed in the MaxCompute project in the development environment, you can use the personal Alibaba Cloud account of a node executor to access tables in the MaxCompute project in the development environment.

If the statement is executed in the MaxCompute project in the production environment, you can use the scheduling access identity to access tables in the MaxCompute project in the production environment.

View MaxCompute data assets

This section describes how to view MaxCompute data assets.

View tables in the production environment within a tenant

In the Tenant Tables pane of the DataStudio page in the DataWorks console, you can view all tables in the production environment of your Alibaba Cloud account in the current region.

View metadata

You can go to Data Map to view the details and metadata of a MaxCompute table.

View table lineages

You can view the ancestor and descendant lineages for a table on the details page of the table in Data Map.

Note

To locate the tasks in which the table is used, you can use the code search feature to search for the tasks.

Manage multiple MaxCompute tables at the same time

This section describes how to manage multiple MaxCompute tables at the same time.

Delete multiple MaxCompute tables at the same time

You can go to the My Data page in Data Map and delete multiple MaxCompute tables of which you are the owner at the same time.

Note

Due to security control for data in the production environment, you cannot directly run commands to delete tables in the production environment. If you want to delete a table that belongs to another user, you must go to Security Center to request the required permissions.

Change the owners of multiple tables at the same time

You can go to the My Data page in Data Map and transfer the ownership of multiple tables to another user at the same time.

Change the TTL periods of multiple tables at the same time

You can go to the My Data page in Data Map and change the TTL periods of multiple MaxCompute tables of which you are the owner at the same time.

Note

You can go to the Workspace Tables pane to modify the TTL period of a single table. For more information, see Create a MaxCompute table.

Manage tables in a specific workflow

You can add tables that are created in the Workspace Tables pane to a specific workflow in DataStudio by importing the tables. You can manage the tables by workflow and view the tables related to the business in the workflow. The following figure shows the steps to add tables of a MaxCompute compute engine to a specific workflow.从引擎添加

  • Referenced: indicates that a specific table is queried by using a SELECT statement in a specific workflow.

  • Written: indicates that data is written to a specific table by using an INSERT OVERWRITE statement in a specific workflow.