By Long Zhiqiang, Alibaba Cloud Intelligent Senior Product Expert
Currently, the popular real-time data warehouses are based on Flink. The purpose of this article is not to define MaxCompute as a real-time data warehouse. We are talking about the real-time process based on the current data, how to do support in MaxCompute, and how to access, query, and apply real-time data in MaxCompute. Open-source real-time data warehouses are based on Flink. Flink is essentially real-time computing and supports the integration of stream and batch computing. Therefore, real-time scenarios are based on Flink + Kafka + storage. This article is not about computing. This article explains how to write real-time streaming data based on BinLog, Flink, and Spark Streaming into MaxCompute.
Data is written to MaxCompute in real-time through the real-time streaming tunnel, and the data is visible after the write. This is a product feature of MaxCompute. Currently, most data warehouse products in the market have latency in write queries. MaxCompute achieves real-time writes with high QPS, and data can be queried after the writes. You can use MaxCompute Query Acceleration (MCQA) to query data written into MaxCompute in real-time. An ad hoc query can be used to access data written in real-time after the integration with BI tools.
Binlog is written to MaxCompute through DataX, which supports addition, deletion, modification, and query. In subsequent product functional iterations, MaxCompute supports upsert and the addition, modification, and deletion of business database data. When Flink data is written to MaxCompute after computing, the Streaming Tunnel plug-in is used to write data to MaxCompute. This process does not require code development. Kafka also supports the plug-in.
Currently, real-time writing does not do the computing process when writing data. It only writes the current streaming data, including the Message Service data, and writes it to MaxCompute through the Streaming Tunnel service. Currently, Streaming Tunnel supports mainstream Message Services, such as Kafka and Flink, with plug-in support. Currently, Streaming Tunnel SDK only supports Java SDK. You can use Streaming Tunnel SDK to perform logical processing after application reading and then retrieve Streaming Tunnel SDK and write them to MaxCompute. After writing to MaxCompute, the main processing link is used to perform direct-read queries on the written data. You can also associate the written data with offline data in MaxCompute for joint query and analysis. During the querying process, you can enable the query acceleration (MCQA) feature if the access is through SDK or JDBC. If you use the web console or DataWorks, the query acceleration (MCQA) feature is enabled by default. Currently, it uses BI analysis tools and third-party application layer analysis tools. When MaxCompute is linked through SDK or JDBC, the query acceleration (MCQA) function can be turned on. This way, data written in real-time can be queried in a matter of seconds.
On the whole, the current scenario is mainly about real-time streaming writing of data. After writing, offline data can be combined for joint analysis and query, and the query acceleration (MCQA) function can be used. After the data enters MaxCompute, no calculation is performed, only a query is performed. This is a real-time data processing scenario based on MaxCompute.
The streaming data writing feature has been commercially released in China, and this feature is free to use.
It provides a mechanism for asynchronous processing of incremental data. This mechanism can process incremental data without service interruption. It also supports the following features:
Stream API stateless concurrent data is visible in real-time. The technical architecture is divided into three parts: data channel, stream computing data synchronization, and self-developed applications. The current tunnel supports Datahub, Kafka, TT, and SLS. Data synchronization in stream computing supports Blink, Spark, DTS, DataX, and kepler/DD.
When data is written to MaxCompute, a Tunnel cluster exists before the computing cluster. The Stream Tunnel service is provided to write data from the client to the Tunnel server. The writing process is the best process for files, and there will be a merge of files. This process consumes the computing resource service in the data channel process, but this consumption is free.
Realize real-time data writing and query-based acceleration interactive analytics. Currently, the query acceleration feature supports 80%-90% of daily query scenarios. The syntax of the query acceleration feature is the same as the built-in syntax of MaxCompute.
It has an adaptive execution engine and multi-level caching mechanism. When SQL is submitted to the MaxCompute computing engine, it is divided into two modes: offline jobs (throughput optimization) and short queries (latency optimization.) The query acceleration job reduces and optimizes the execution plan from the bottom of the technology. Computing resources are pre-pull resources, vectorized execution, and will be based on the memory/network shuffle and multi-level caching mechanism. Compared with the code of the offline job, the code is produced to the disk shuffle. Then, the resource queue application is performed. The query acceleration does the identification, and if the conditions are met, it goes to the pre-pull resource directly. In the data caching section, there is a caching mechanism for tables and fields based on the Pangu distributed file system.
A performance comparison between TPCDS test set and an industry-leading competitor:
Data application tools or BI analysis tools are used to connect to MaxCompute through JDBC/SDK. You can read table data in MaxCompute.
Product level integration Message Service:
You can achieve high-performance analysis and decision-making assistance for changing data, and obtain 1 billion data entries in seconds using MaxCompute.
This demo practice was implemented through MaxCompute + QuickBI. QuickBI supports direct MaxCompute query acceleration mode. QuickBI has acceleration engines, such as DLA and CK, Which is the current optimal mode. Direct connection to MaxCompute is the fastest to use the query acceleration mode.
137 posts | 20 followers
FollowAlibaba Cloud MaxCompute - December 22, 2021
Alibaba Cloud MaxCompute - December 22, 2021
Alibaba Cloud New Products - August 20, 2020
Alibaba Cloud MaxCompute - November 15, 2021
Alibaba Cloud Community - March 29, 2022
Alibaba Cloud MaxCompute - December 22, 2021
137 posts | 20 followers
FollowRealtime Compute for Apache Flink offers a highly integrated platform for real-time data processing, which optimizes the computing of Apache Flink.
Learn MoreAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreConduct large-scale data warehousing with MaxCompute
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreMore Posts by Alibaba Cloud MaxCompute