MaxCompute Streaming Tunnel allows you to write data to MaxCompute in streaming mode and provides a set of APIs and backend services that are different from the APIs and backend services of MaxCompute Tunnel. The APIs of MaxCompute Streaming Tunnel significantly reduce the development costs of distributed services and remove the performance bottlenecks of MaxCompute Tunnel in scenarios that involve high concurrency and high queries per second (QPS). This topic describes how to use MaxCompute Streaming Tunnel.
Description
MaxCompute Streaming Tunnel is commercially released. You can use this service free of charge.
MaxCompute Streaming Tunnel was made available for public preview on January 1, 2021. You can use this service free of charge during public preview. You can follow Service notices to obtain updates about the commercialization of this service.
The following services allow you to write data to MaxCompute by using MaxCompute Streaming Tunnel:
Realtime Compute for Apache Flink: By default, MaxCompute Tunnel is used. To use MaxCompute Streaming Tunnel, you can use the built-in plug-in of Realtime Compute for Apache Flink.
DataWorks: By default, MaxCompute Tunnel is used. To use MaxCompute Streaming Tunnel, contact the DataWorks engineer on duty to enable this channel in the background.
ApsaraMQ for Kafka : By default, MaxCompute Tunnel is used. To use MaxCompute Streaming Tunnel, contact the Kafka engineer on duty to enable this channel in the background.
MaxCompute Streaming Tunnel provides the following features:
Streaming semantic APIs: help you facilitate the development of distributed data synchronization services.
Automatic partitioning: prevents concurrent partition locking if multiple data synchronization jobs are concurrently run to create partitions.
Asynchronous merging of incremental data: improves data storage efficiency.
MaxCompute Streaming Tunnel can resolve various issues that may occur when you use MaxCompute Tunnel to write streaming data. MaxCompute Streaming Tunnel provides the following benefits:
Optimizes the data storage structure to prevent file fragmentation that is caused by data write operations with high QPS.
Provides a mechanism for asynchronously processing incremental data. This mechanism can process incremental data without service interruption. This mechanism supports the data merging feature, which improves storage efficiency.
Scenarios
The following table describes the use scenarios of MaxCompute Streaming Tunnel.
Scenario | Description | Benefit |
A large number of event logs are written to MaxCompute in real time. | Log data is directly written to MaxCompute for batch processing. | Intermediate storage services are not required. This reduces costs. |
Stream computing results are written to MaxCompute in real time. | The limits on the concurrency and | MaxCompute Streaming Tunnel ensures the availability of streaming services in scenarios that involve high-concurrency locking and prevents a large number of small files from being generated on MaxCompute due to small |
Data of streaming storage services such as DataHub and ApsaraMQ for Kafka is synchronized to MaxCompute in real time. | The limits on concurrency and | The problems with the Simple Message Queue for real-time synchronization with MaxCompute are solved. High concurrency and batch synchronization of large amounts of data are supported. |
Limits
MaxCompute Streaming Tunnel has the following limits:
Table or partition locking: When streaming data is written to a MaxCompute table or partition, MaxCompute Streaming Tunnel locks the table or partition. During this period, all DML operations that involve data modifications to the table or partition are not allowed. The DML operations include
INSERT INTO and INSERT OVERWRITE
. After the data write operation is complete, MaxCompute unlocks the table or partition, and all operations are allowed.If the schema of the table to which you want to write data is modified, streaming data cannot be written to the table.
Increased storage volume of hot data: If data merging or ZORDER BY is performed for asynchronous processing, MaxCompute Streaming Tunnel saves two copies of the data that is written within the previous hour. One copy is the original data and the other copy is the data that is asynchronously merged. This causes redundant data to be stored. By default, the retention period of redundant data is 1 hour.