MaxCompute allows you to use external tables to query and analyze data that is stored in external storage systems, such as Object Storage Service (OSS). This way, you can manage external data without the need to import data to MaxCompute internal storage. This improves data processing flexibility.
Background information
MaxCompute SQL provides an entry point for distributed data processing. This allows you to process and store exabytes of offline data. The computing framework of MaxCompute continues to evolve to meet the requirements that arise from expanded big data business and new use scenarios. In early versions, MaxCompute provides powerful computing capabilities to process internal data in special formats. MaxCompute now supports the processing of external data.
MaxCompute SQL is now used to process structured data that is stored in MaxCompute internal tables in the CFile column store format. You must use different tools to import external user data to MaxCompute tables for data computations. The user data includes texts and unstructured data. For example, to process OSS data in MaxCompute, you can use one of the following methods:
Use OSS SDK or other tools to download data from OSS. Then, use MaxCompute Tunnel to import the downloaded data to a MaxCompute table.
Write a user-defined function (UDF) to call OSS SDK and access OSS data.
However, the two methods have deficiencies.
The first method requires data transfer operations outside the MaxCompute system. If a large amount of OSS data needs to be processed, parallel operations are required to accelerate the process. As a result, you cannot fully utilize the large-scale computing capabilities of MaxCompute.
The second method requires UDF-based access permissions. It also requires that developers control the number of parallel jobs and handle issues related to data partitioning.
MaxCompute provides external tables to address these issues. External tables are used to process data that is stored outside MaxCompute internal tables. You can execute a simple DDL statement to create an external table in MaxCompute. Then, you can use this table to associate it with external data sources. This allows access to and output of data in various formats. In most cases, external tables can be accessed like standard MaxCompute tables. You can fully utilize the computing capabilities of MaxCompute SQL to process external data.
If you use an external table, the data in this table is not stored in MaxCompute, and you are not charged for the storage of the table data.
Full search is supported for external tables.
Tunnel commands and Tunnel SDK cannot be used for external tables. You can use Tunnel to upload data to MaxCompute internal tables. You can also use OSS SDK for Python to upload data to OSS and map the data to external tables in MaxCompute.
You can create, search for, configure, and process external tables in the DataWorks console. You can also query and analyze data by using the external table feature. For more information, see External table.
If external tables are used, you are charged only computing fees for MaxCompute based on the billing rules of computing resources in MaxCompute. Data in external tables is not stored in MaxCompute. Therefore, no storage fees are generated for MaxCompute. For more information about storage fees, see the description related to the billing rules for data source storage. If you use a public endpoint of MaxCompute to access an external table, you are charged for Internet traffic and data downloads. For more information about MaxCompute fees, see Overview.
If you use MaxCompute external tables to access external data sources, these sources may incur costs for computation, access, and data transfer, which are subject to the specific billing method of the external data source. Please refer to the documentation of the relevant product for more details.
Examples
This section describes how to use MaxCompute external tables to process unstructured data:
To access unstructured data in OSS and Tablestore, see Access OSS data and Access Tablestore data.
To use external tables to access OSS data, you must authorize MaxCompute to access OSS. The authorization is performed in the Resource Access Management (RAM) console. For more information, see STS authorization.
The unstructured data processing framework of MaxCompute allows you to export MaxCompute data to OSS by using the INSERT statement. For more information, see Write data to OSS.
For more information about how to process data in various open source formats, see Open source data formats supported by OSS external tables.
References
MaxCompute supports various external tables, such as OSS, Hologres, and ApsaraDB RDS external tables. For more information, see OSS external tables, Create a Tablestore external table, Hologres external tables, and Apache Paimon external tables.