When you build a data lakehouse, you must create an external project to associate the external project with a Hadoop cluster, or associate the external project with Data Lake Formation (DLF) and Object Storage Service (OSS). This topic describes how to manage an external project and tables in the external project in different scenarios by using SQL statements. For example, you can query or update table data.
Background information
After you create an external project on the Lake and Warehouse Integration (Data Lakehouse) page in the DataWorks console, you can manage the external project by using SQL statements in the following scenarios:
Manage an external project based on DLF and OSS
After you create an external project, you must connect to the external project by using the MaxCompute client and go to the MaxCompute project to which the external project is mapped. This way, you can manage the data or tables of the external project by using the MaxCompute project.
When you reference a table in an external project, you must specify the table in the external_project_name.table_name format in your SQL statement. You are not allowed to create a table in an external project.
Insert data into a table.
Insert data into a non-partitioned table.
-- Go to the MaxCompute project to which the external project is mapped rather than the external project. use doc_test_prod; -- Insert data into a non-partitioned table. insert into table ext_mcdlf_ddl.ddl_test values(1,"ddl_test_1");
Insert data into a partitioned table.
-- Go to the MaxCompute project to which the external project is mapped. use doc_test_prod; -- Insert data into a partitioned table. insert overwrite table ext_mcdlf_ddl.range_ptstring_ptint partition (pt1 = 'ds1', pt2=2) values (4, 'val4'), (5, 'val5');
View the information about the tables in the external project.
-- Go to the MaxCompute project to which the external project is mapped. use doc_test_prod; -- View the tables in the external project. show tables in doc_test_prod; -- View the schema of the ext_mcdlf_ddl.ddl_test table. desc extended ext_mcdlf_ddl.ddl_test;
Query the table data in the external project.
-- Go to the MaxCompute project to which the external project is mapped. use doc_test_prod; -- Query the data in the non-partitioned table named ext_mcdlf_ddl.ddl_test. select * from ext_mcdlf_ddl.ddl_test limit 10; -- Query the data in the partitioned table named ext_mcdlf_ddl.range_ptstring_ptint. select * from ext_mcdlf_ddl.range_ptstring_ptint where pt1='ds1';
In some special scenarios, you must add parameters when you execute SQL statements. For example:
-- If the Optimized Row Columnar (ORC) version is later than ORC-135, add the following configuration: set odps.ext.oss.orc.native=true; -- Enable the Hive-compatible data type edition. MaxCompute supports Hive syntax only after you enable the Hive-compatible data type edition. set odps.sql.hive.compatible=true; -- Write data to OSS by using the multipart upload feature. set odps.sql.unstructured.oss.commit.mode=true; -- Run the following command to enable the multipart upload feature for the entire project. setproject odps.sql.unstructured.oss.commit.mode=true;
For more information about common MaxCompute SQL syntax, see Table operations and Insert or update data into a table or a static partition (INSERT INTO and INSERT OVERWRITE).
You can add the SET commands to the odps_config.ini file. This way, you do not need to enter the commands each time you manage the external project.
For more information about the multipart upload feature, see Multipart upload.
Manage an external project based on an external Hadoop data source
Execute the following statements to query the data of a Hive table from an external Hadoop data source.
-- Go to the MaxCompute project to which the external project is mapped. You cannot run jobs in an external project.
use <main_project_name>;
-- Set odps.sql.hive.compatible to true. If you want to access Alibaba Cloud E-MapReduce (EMR) tables, you must configure this property. If you want to access MaxCompute tables, you do not need to configure this property.
set odps.sql.hive.compatible=true;
-- Optional. Specify the username that is used to access the Hadoop data source.
set odps.external.hadoop.user=<hadoop_name>;
-- Read data from a Hive table.
select * from <my_hive_table> limit 10;
You cannot execute the DDL statements on tables or data of the external project by using the MaxCompute project to which the external project is mapped. The DDL statements that are used to add partitions to a table and drop partitions from a table are exceptions. For more information about the SQL DDL statements, see Overview of MaxCompute SQL.
You can add the SET commands set.odps.sql.hive.compatible=true and set.odps.external.hadoop.user=<hadoop_name> to the odps_config.ini file. This way, you do not need to enter the commands each time you manage the external project.
Create a table with the same schema as a table in the external data source
If you want to create a table that has the same schema as a table in the external data source, you can run the create table...like
command to replicate the schema of the table in the external data source. Sample statements:
For more information about the create table...like
command, see Table operations.
-- Create a table.
create table from_exe_tbl like external_project.testtbl;
-- The from_exe_tbl table that you created has the same schema as the external_project.testtbl table.
The two tables have the same attributes, such as column names, column comments, and table comments, aside from the lifecycle. However, the data in the external_project.testtbl table is not replicated to the from_exe_tbl table.
-- Query the table data.
SELECT * from from_exe_tbl;
-- View the table schema.
desc from_exe_tbl;
References
For more information about how to grant other users the permissions on an external project or a table in the external project, see Grant other users the permissions on an external project.