After you create an Object Storage Service (OSS) external table, you can use the external table to access and query the data files that are stored in the specified OSS directory for efficient data reading.
Background information
After you create an OSS external table, you can use one of the following methods to read OSS data based on your business requirements:
Method 1 (recommended): Import OSS data in an open source format from the OSS external table to a MaxCompute internal table before you read the data.
If you directly read data from the OSS external table, each read operation involves OSS I/O operations and high-performance optimizations on MaxCompute storage cannot be implemented. As a result, the data reading performance is not high. If you require repeated data computing or high computing efficiency, we recommend that you perform the following operations in sequence to improve the computing performance of MaxCompute: 1. Create a MaxCompute internal table that has the same schema as the OSS external table. 2. Import OSS data from the OSS external table to the MaxCompute internal table. 3. Execute complex query statements for the MaxCompute internal table.
Sample statements:
CREATE TABLE <table_internal> LIKE <mc_oss_extable_name>; INSERT OVERWRITE TABLE <table_internal> SELECT * FROM <mc_oss_extable_name>;
Method 2: Directly read OSS data from the OSS external table. The operations are the same as those for reading data from a MaxCompute internal table.
If you use this method, the OSS external table is used in the same way as a MaxCompute internal table, but data is directly read from an OSS directory.
Precautions
When you use an external table to read OSS data, take note of the following items:
You must create an OSS external table before you read OSS data from the external table. For more information about the OSS data file formats that are supported by MaxCompute and the commands that are used to create OSS external tables, see Syntax used to create an OSS external table.
If complex data types are involved in the SQL statement that you want to execute, you must add
set odps.sql.type.system.odps2=true;
before the SQL statement and submit this command with the statement. For more information about data types, see Data type editions.If you use an OSS external table whose data is mapped to open source data, you must set
odps.sql.hive.compatible
to true at the session level before you use MaxCompute to read data from the OSS external table. Otherwise, an error is returned.The bandwidth provided by OSS is limited. If the traffic generated in a short period of time exceeds the upper limit of the bandwidth allowed by the OSS instance due to frequent data reading from or writing to OSS, the bandwidth is used up. In this case, the speed at which data is read from or written to the OSS external table is decreased. For more information about the bandwidth occupied by OSS, see Does OSS throttle bandwidth and QPS?
Example 1: Read OSS data from a non-partitioned OSS external table that is created by using a built-in text extractor
Read OSS data from the OSS external table mc_oss_csv_external1 that is created in Example: Create a non-partitioned table as an OSS external table by using a built-in text extractor.
Use the MaxCompute client to read data from the OSS external table. Sample statement:
SELECT recordId, patientId, direction FROM mc_oss_csv_external1 WHERE patientId > 25;
The following result is returned:
+------------+------------+------------+
| recordid | patientid | direction |
+------------+------------+------------+
| 1 | 51 | S |
| 3 | 48 | NE |
| 4 | 30 | W |
| 5 | 47 | S |
| 7 | 53 | N |
| 8 | 63 | SW |
| 10 | 31 | N |
+------------+------------+------------+
Example 2: Read OSS data from a partitioned OSS external table that is created by using a built-in text extractor
Read OSS data from the OSS external table mc_oss_csv_external2 that is created in Example: Create a partitioned table as an OSS external table by using a built-in text extractor.
Use the MaxCompute client to read data from the OSS external table. Sample statement:
SELECT recordId, patientId, direction FROM mc_oss_csv_external2 WHERE direction = 'NE';
The following result is returned:
+------------+------------+------------+
| recordid | patientid | direction |
+------------+------------+------------+
| 2 | 13 | NE |
| 3 | 48 | NE |
| 9 | 4 | NE |
+------------+------------+------------+
Example 3: Read compressed OSS data from an OSS external table that is created by using a built-in text extractor
Read OSS data from the OSS external table mc_oss_csv_external3 that is created in Example: Create an OSS external table that is used to read or write compressed data by using a built-in text extractor.
Use the MaxCompute client to read data from the OSS external table. Sample statement:
SELECT recordId, patientId, direction FROM mc_oss_csv_external3 WHERE patientId > 25;
If compressed OSS data is in an open source format, you must add the set odps.sql.hive.compatible=true;
command before the SQL statement and submit this command with the statement.
The following result is returned:
+------------+------------+------------+
| recordid | patientid | direction |
+------------+------------+------------+
| 1 | 51 | S |
| 3 | 48 | NE |
| 4 | 30 | W |
| 5 | 47 | S |
| 7 | 53 | N |
| 8 | 63 | SW |
| 10 | 31 | N |
+------------+------------+------------+
Example 4: Read OSS text data from an OSS external table that is created by using a custom extractor
Read OSS data from the OSS external table ambulance_data_txt_external that is created in Example: Create an OSS external table by using a custom extractor.
Use the MaxCompute client to read data from the OSS external table. Sample statement:
SELECT recordId, patientId, direction FROM ambulance_data_txt_external WHERE patientId > 25;
The following result is returned:
+----------+-----------+-----------+
| recordid | patientid | direction |
+----------+-----------+-----------+
| 1 | 51 | S |
| 3 | 48 | NE |
| 4 | 30 | W |
| 5 | 47 | S |
| 7 | 53 | N |
| 8 | 63 | SW |
| 10 | 31 | N |
+----------+-----------+-----------+
Example 5: Read OSS non-text data from an OSS external table that is created by using a custom extractor
Read OSS data from the OSS external table speech_sentence_snr_external that is created in Example: Create an OSS external table that is used to read or write non-text data by using a custom extractor.
Use the MaxCompute client to read data from the OSS external table and process the data. Sample statement:
SELECT sentence_snr, id FROM speech_sentence_snr_external WHERE sentence_snr > 10.0;
The following result is returned:
--------------------------------------------------------------
| sentence_snr | id |
--------------------------------------------------------------
| 34.4703 | J310209090013_H02_K03_042 |
--------------------------------------------------------------
| 31.3905 | tsh148_seg_2_3013_3_6_48_80bd359827e24dd7_0 |
--------------------------------------------------------------
| 35.4774 | tsh148_seg_3013_1_31_11_9d7c87aef9f3e559_0 |
--------------------------------------------------------------
| 16.0462 | tsh148_seg_3013_2_29_49_f4cb0990a6b4060c_0 |
--------------------------------------------------------------
| 14.5568 | tsh_148_3013_5_13_47_3d5008d792408f81_0 |
--------------------------------------------------------------
Example 6: Read OSS data from an OSS external table if the schemas of data in the external table are different
If the schemas of OSS data are different and you want to query data based on the column names, you must enable the following flag when you use MaxCompute to read an OSS external table. Sample statement:
-- Query data from the external table.
SET odps.ext.oss.orc.native=true;
SELECT * FROM <tablename>;
Example 7: Optimize query plan
Due to the data being stored in an external data lake and lacking pre-aggregated statistics, the query optimizer adopts a conservative strategy, leading to low query efficiency. By enabling the following flag, the optimizer can discover small tables by temporarily collecting statistics on the tables during query execution. It then proactively uses Hash Join, optimizes the Join Order, reduces extensive shuffle operations, and shortens the execution pipeline, ultimately optimizing the query plan.
SET odps.meta.exttable.stats.onlinecollect=true;
SELECT * FROM <tablename>;