All Products
Search
Document Center

MaxCompute:Migrate tasks from external projects in data lakehouse solution 1.0 to external schemas in data lakehouse solution 2.0

Last Updated:Oct 31, 2024

Alibaba Cloud no longer develops the features and usage of external projects in data lakehouse solution 1.0. Data lakehouse solution 1.0 will be discontinued. If you want to use MaxCompute to access federated data sources, you must upgrade the federated solution to data lakehouse solution 2.0. This topic describes the compatibility status of existing tasks after you enable the schema mode supported by project-level metadata and the schema mode supported by SQL syntax in data lakehouse solution 2.0. This topic also describes the modifications you must make in different scenarios when you migrate existing tasks from external projects in data lakehouse solution 1.0 to external schemas in data lakehouse solution 2.0. This helps you better evaluate and complete the migration.

Background information

External projects in data lakehouse solution 1.0 support federated data sources that integrate Data Lake Formation (DLF) and Object Storage Service (OSS) and federated data sources that integrate Hive Metastore Service (HMS) and Hadoop Distributed File System (HDFS). Data lakehouse solution 1.0 uses a two-layer model in which tables are stored in projects. Projects in data lakehouse solution 1.0 map to databases in DLF or Hive.

In data lakehouse solution 1.0, connection information is included in external projects that are sourced from the DLF and OSS federated data lakes. This allows MaxCompute to access DLF metadata and OSS data by using properties of the external projects. For external projects that are sourced from the HMS and HDFS federated data lakes, MaxCompute abstracts foreign servers for storing connection information and uses the foreign servers to access the Hive metadata and data. The following figure shows the mappings.

image.png

The federated data sources in data lakehouse solution 1.0 have the following disadvantages:

  • Federated data sources differ in different scenarios. No foreign server is abstracted for the DLF and OSS federated data sources. This makes it difficult for tenant-level resource sharing and permission management.

  • In most cases, external data sources are specified in the Project.Schema.Table format, which is inconsistent with the two-layer model used by external projects in data lakehouse solution 1.0. A large number of external projects need to be created to map to the upper-layer objects of tables, such as databases in DLF and schemas in Hologres.

  • External projects in data lakehouse solution 1.0 require data warehouse projects to process computing tasks. This increases the complexity of permission authorization in cross-project data access scenarios.

Data lakehouse solution 2.0 is introduced to provide the following improvements:

  • Foreign servers are abstracted to allow permission isolation between the tenant level and data level. This facilitates data sharing and permission management.

  • MaxCompute projects are upgraded to support schemas. You can create MaxCompute projects that use the Project.Schema.Table three-layer model or change existing MaxCompute projects that use the Project.Table two-layer model to three-layer model projects. This allows MaxCompute to support more databases and data sources in the big data ecosystem.

  • External schemas are provided in data lakehouse solution 2.0 to map to the upper-layer objects of tables. Data computing on an external schema is performed by using computing resources of the data warehouse project to which the external schema belongs.

The following figure shows the mappings in data lakehouse solution 2.0.

image.png

For more information about how to create an external data source and an external project, see Use the data lakehouse solution 2.0.

Schema modes and compatibility status

MaxCompute provides two schema modes.

  • Schema mode supported by project-level metadata

    To enable the schema mode supported by project-level metadata, log on to the MaxCompute console. In the left-side navigation pane, choose Workspace > Projects. On the page that appears, find your project and click Supported Schema Levels in the Actions column. The Supported Schema Levels option is available for projects that do not support schemas.

    image

    After you enable the schema mode supported by project-level metadata for a project, take note of the following items:

    • The project uses the Project.Schema.Table three-layer model.

      Note

      Projects for which the schema mode is disabled use the Project.Table two-layer model.

    • If you want to access data in a custom schema, you must enable the schema mode supported by SQL syntax. Otherwise, you can only access data in the Default schema.

    • The operation of enabling the schema mode cannot be rolled back. This operation may produce incompatibility issues that affect other tasks in your project and access to your project. Proceed with caution.

    • A schema named Default is created in each project for which the schema mode is enabled. This schema cannot be modified or deleted.

  • Schema mode supported by SQL syntax

    • Session-level settings: The settings affect only the semantics of the current session and take precedence over tenant-level settings. You can execute the set odps.namespace.schema=true | false; statement to enable or disable the schema mode supported by SQL syntax.

    • Tenant-level settings: You can log on to the MaxCompute console and choose Tenants > Tenant Property to view tenant-level settings. If no project is available in the tenant, you can enable Tenant-level Information Schema Syntax on the Tenants page. If projects are available in the tenant, the schema mode cannot be enabled, and we also recommend that you do not enable the schema mode. For more information about how to enable the schema mode at the tenant level, see Tenant properties.

      After you enable the schema mode supported by SQL syntax at the tenant level, take note of the following items:

      • You do not need to configure the schema mode at the session level for other tasks of the tenant. The schema mode supported by project-level metadata is automatically enabled for new projects.

      • All projects support the schema mode supported by SQL syntax. This mode cannot be disabled after you enable it.

        If existing tasks in your project are incompatible with the schema mode supported by SQL syntax, and you enable the schema mode at the tenant level, the existing tasks will fail. We recommend that you first enable the schema mode supported by SQL syntax at the session level. After verification succeeds, enable the schema mode at the tenant level.

Note

If the schema feature is disabled, the a.b.c model is not supported. The system parses the a.b model as the project.table model. If the schema feature is enabled, the system parses the a.b.c model as the project.schema.table model and the a.b model as the schema.table model.

In data lakehouse solution 2.0, if the schema mode supported by project-level metadata and the schema mode supported by SQL syntax are not both enabled or disabled, incompatibility issues occur. For example, existing tasks in a data warehouse project that uses the project(p).table(t) model can run as expected, the schema mode supported by project-level metadata is enabled for the project, and the schema mode supported by SQL syntax is enabled for query tasks in the project. If the SQL syntax of the query tasks remains unchanged, incompatibility errors may be reported when you run the query tasks. The following table describes the compatibility details.

Schema mode status

Task type

Compatibility status

Suggestion

  • The schema mode supported by project-level metadata is disabled.

  • The schema mode supported by SQL syntax is enabled.

Existing task

  • If the existing task uses the ...FROM t syntax, the system parses t by using the current_project(p).table(t) model. No incompatibility issues occur.

  • If the existing task uses the ...FROM p.t syntax, the system parses p.t by using the current_project.schema(p).table(t) model, which is incompatible with the project(p).table(t) model. As a result, an error message is reported.

  • No modification is required.

  • Recommended syntax: ...FROM p.default.t.

New task

  • If you specify a table in a new task in the p.default.t format, the system parses the table by using the project(p).schema(default).table(t) model.

  • If you specify a table in a new task in the default.t format, the system parses the table by using the current_project.schema(default).table(t) model.

  • If you specify a table in a new task in the t format, the system parses the table by using the current_project.current_schema.table(t) model.

  • If you specify a table in a new task in the p.s.t format, the system parses the table by using the project(p).schema(s).table(t) model. If no schema exists in the project, an error message is reported.

None.

  • The schema mode supported by project-level metadata is enabled.

  • The schema mode supported by SQL syntax is disabled.

Existing task

  • If the existing task uses ...FROM t syntax, the system parses t by using the current_project(p).schema(default).table(t) model. No incompatibility issues occur.

  • If the existing task uses the ...FROM p.t syntax, the system parses p.t by using the project(p).schema(default).table(t) model. The system automatically supplements the default schema level to ensure syntax compatibility.

None.

New task

  • If you specify a table in a new task in the p.t format, the system parses the table by using the project(p).schema(default).table(t) model. The system automatically supplements the default schema level.

  • If you specify a table in a new task in the t format, the system parses the table by using the project.schema(default).table(t) model. The system automatically supplements the default schema level.

Note

You cannot access custom schemas or use USE SCHEMA <schema_name>.

None.

  • The schema mode supported by project-level metadata is enabled.

  • The schema mode supported by SQL syntax is enabled.

Existing task

  • If the existing task uses ...FROM t syntax, the system parses t by using the current_project(p).schema(default).table(t) model. No incompatibility issues occur.

  • If the existing task uses the ...FROM p.t syntax, the system parses p.t by using the current_project.schema(p).table(t) model, which is incompatible with the project(p).table(t) model. As a result, an error message is reported.

  • Scenario 1: No modification is required.

  • Scenario 2: We recommend that you use the ...FROM p.default.t syntax.

New task

Rules of the enabled schema modes and the syntax used apply.

None.

Migration description

  1. Enable the schema mode supported by project-level metadata for an existing or new project.

    After you enable the schema mode, a default schema is automatically generated. All internal tables in the project are stored in the default schema. For more information about the schema feature, see Schema and Schema-related operations.

    Note

    The operation of enabling the schema mode supported by project-level metadata cannot be rolled back. This operation may produce incompatibility issues that affect other tasks in your project and access to your project. Proceed with caution. For more information about schema compatibility, see Schema modes and incompatibility issues in this topic.

  2. Create an external schema in the project for which you enable the schema mode supported by project-level metadata. For more information about how to create external schemas that use different data sources, see Use the data lakehouse solution 2.0.

    Note

    We recommend that you name the external schema the same as the name of the external project.

  3. Reconstruct the existing project.

    • Scenario 1: Create an external schema in the existing project and access data in the external schema.

      In this example, the existing project is named p1, and the table in the external project is specified as external_project(e).table(t). After the reconstruction, the table that maps to the federated data source is specified as project (p1).external_schema(e).table(t)

      • If the existing task executes the SELECT * FROM e.t; statement, and the schema mode supported by SQL syntax is enabled, the system parses e.t as external_schema(e).table(t). The existing task can run as expected in the project p1.

        Note

        If the schema mode supported by SQL syntax is disabled, the system parses e.t in the SELECT * FROM e.t; statement as project (e).table(t). This parsing result is inconsistent with the mapping structure. As a result, the existing task fails, and an error message is reported.

      • If the project p1 is associated with an external project for a join query SELECT * FROM e.t1 JOIN p1.t2;, and the schema mode supported by SQL syntax is enabled, the system parses p1 in p1.t2 as a schema. In this case, you must change the SQL statement to SELECT * FROM e.t1 JOIN p1.default.t2;.

    • Scenario 2: Create an external schema in another project and access data in the external schema.

      In this example, the existing project is named p1, and the table in the external project is specified as external_project(e).table(t). The schema mode supported by project-level metadata is disabled for the project p1. Another project named p2 is used for reconstruction. After the reconstruction, the table that maps to the federated data source is specified as project (p2).external_schema(e).table(t).

      • If you enable the schema mode supported by SQL syntax for existing tasks in the project p1, you must modify the SQL statement. For example, change SELECT * FROM e.t; to SELECT * FROM p2.e.t;.

        Note

        If you run the existing tasks in the project p2, no modification to the SQL statement is required.

      • If you execute the SELECT * FROM e.t1 JOIN p1.t2; statement in the project p1 when the schema mode supported by project-level metadata is disabled for the project p1 and the schema mode supported by SQL syntax is enabled for the project p1, the system parses p1.t2 as p1.p1.t2. If you want to perform a cross-project join query in the project p1 by using the external schema, you must change the statement to SELECT * FROM p2.e.t1 JOIN p1.default.t2;.