×
Community Blog The Practice of Real-Time Data Processing Based on MaxCompute

The Practice of Real-Time Data Processing Based on MaxCompute

This article explains how to write real-time streaming data based on BinLog, Flink, and Spark Streaming into MaxCompute.

By Long Zhiqiang, Alibaba Cloud Intelligent Senior Product Expert

I. An Introduction to Product Functions

Data Warehouse Architecture Based on Query Acceleration

Currently, the popular real-time data warehouses are based on Flink. The purpose of this article is not to define MaxCompute as a real-time data warehouse. We are talking about the real-time process based on the current data, how to do support in MaxCompute, and how to access, query, and apply real-time data in MaxCompute. Open-source real-time data warehouses are based on Flink. Flink is essentially real-time computing and supports the integration of stream and batch computing. Therefore, real-time scenarios are based on Flink + Kafka + storage. This article is not about computing. This article explains how to write real-time streaming data based on BinLog, Flink, and Spark Streaming into MaxCompute.

Data is written to MaxCompute in real-time through the real-time streaming tunnel, and the data is visible after the write. This is a product feature of MaxCompute. Currently, most data warehouse products in the market have latency in write queries. MaxCompute achieves real-time writes with high QPS, and data can be queried after the writes. You can use MaxCompute Query Acceleration (MCQA) to query data written into MaxCompute in real-time. An ad hoc query can be used to access data written in real-time after the integration with BI tools.

Binlog is written to MaxCompute through DataX, which supports addition, deletion, modification, and query. In subsequent product functional iterations, MaxCompute supports upsert and the addition, modification, and deletion of business database data. When Flink data is written to MaxCompute after computing, the Streaming Tunnel plug-in is used to write data to MaxCompute. This process does not require code development. Kafka also supports the plug-in.

Currently, real-time writing does not do the computing process when writing data. It only writes the current streaming data, including the Message Service data, and writes it to MaxCompute through the Streaming Tunnel service. Currently, Streaming Tunnel supports mainstream Message Services, such as Kafka and Flink, with plug-in support. Currently, Streaming Tunnel SDK only supports Java SDK. You can use Streaming Tunnel SDK to perform logical processing after application reading and then retrieve Streaming Tunnel SDK and write them to MaxCompute. After writing to MaxCompute, the main processing link is used to perform direct-read queries on the written data. You can also associate the written data with offline data in MaxCompute for joint query and analysis. During the querying process, you can enable the query acceleration (MCQA) feature if the access is through SDK or JDBC. If you use the web console or DataWorks, the query acceleration (MCQA) feature is enabled by default. Currently, it uses BI analysis tools and third-party application layer analysis tools. When MaxCompute is linked through SDK or JDBC, the query acceleration (MCQA) function can be turned on. This way, data written in real-time can be queried in a matter of seconds.

On the whole, the current scenario is mainly about real-time streaming writing of data. After writing, offline data can be combined for joint analysis and query, and the query acceleration (MCQA) function can be used. After the data enters MaxCompute, no calculation is performed, only a query is performed. This is a real-time data processing scenario based on MaxCompute.

1

An Introduction to the Streaming Data Writing Feature

The streaming data writing feature has been commercially released in China, and this feature is free to use.

Feature-Specific

  • Supports streaming data writes in high-concurrency and high-QPS (queries-per-second) scenarios: The writes are visible.
  • Provides streaming semantic APIs: You can develop distributed data synchronization services easily using streaming service APIs.
  • Supports automatic partition creation: Solves the problem of concurrent lock snatching caused by concurrent partition creation in the data synchronization service.
  • Supports asynchronous aggregation of incremental data (Merge): Improves data storage efficiency
  • Supports asynchronous zorder by sorting features for incremental data: For more information about zorder by, please see Insert or Overwrite Data (INSERT INTO | INSERT OVERWRITE).

Performance Advantages

  • More optimized data storage structure to solve the problem of fragmented files caused by high QPS writes
  • Data link is completely isolated from metadata access to solve the lock snatching delay and error reporting caused by metadata access in high-concurrency write scenarios.
  • It provides a mechanism for asynchronous processing of incremental data. This mechanism can process incremental data without service interruption. It also supports the following features:

    • Data merging (Merge): Improves the storage efficiency
    • zorder by sorting: Improves the storage and query efficiency

Streaming Data Writing – Technical Architecture

Stream API stateless concurrent data is visible in real-time. The technical architecture is divided into three parts: data channel, stream computing data synchronization, and self-developed applications. The current tunnel supports Datahub, Kafka, TT, and SLS. Data synchronization in stream computing supports Blink, Spark, DTS, DataX, and kepler/DD.

When data is written to MaxCompute, a Tunnel cluster exists before the computing cluster. The Stream Tunnel service is provided to write data from the client to the Tunnel server. The writing process is the best process for files, and there will be a merge of files. This process consumes the computing resource service in the data channel process, but this consumption is free.

2

An Introduction to the Query Acceleration Feature

Realize real-time data writing and query-based acceleration interactive analytics. Currently, the query acceleration feature supports 80%-90% of daily query scenarios. The syntax of the query acceleration feature is the same as the built-in syntax of MaxCompute.

MaxCompute Query Acceleration – Full Link Speeds up the MaxCompute Query Execution Speed for Real-Time Query Tasks

  • Use the MaxCompute SQL syntax and engine to optimize for near real-time scenarios
  • The system optimizes queries automatically and supports the selection of the execution mode in which the user selects latency first or throughput first.
  • Use different resource scheduling policies for near real-time scenarios: latency-based
  • Optimize full-link for scenarios with low latency requirements: Independent execution of resource pools, multi-level data and metaCaching, and interaction protocol optimization

Benefits

  • Integrated solution for simplified architecture, query acceleration, and massive analysis adaptation
  • It is several times (or dozens of times) faster than normal offline mode.
  • Combined with MaxCompute streaming upload capabilities, near real-time analysis is supported.
  • Supports multiple access methods and is easy to integrate
  • Supports automatic recognition of short queries in offline tasks. The pay-as-you-go mode is enabled by default. During the free trial, you can use MCQA free of charge to run query jobs in which the number of bytes scanned is within 10 GB on instances that use subscription resources.
  • Low cost, no O&M, high elasticity

Query Acceleration – Technical Architecture

It has an adaptive execution engine and multi-level caching mechanism. When SQL is submitted to the MaxCompute computing engine, it is divided into two modes: offline jobs (throughput optimization) and short queries (latency optimization.) The query acceleration job reduces and optimizes the execution plan from the bottom of the technology. Computing resources are pre-pull resources, vectorized execution, and will be based on the memory/network shuffle and multi-level caching mechanism. Compared with the code of the offline job, the code is produced to the disk shuffle. Then, the resource queue application is performed. The query acceleration does the identification, and if the conditions are met, it goes to the pre-pull resource directly. In the data caching section, there is a caching mechanism for tables and fields based on the Pangu distributed file system.

3

Query Acceleration – Performance Comparison

A performance comparison between TPCDS test set and an industry-leading competitor:

  • The 100GB over 30%
  • The 1TB scale has similar performance.

4

II. Application Scenarios

Streaming Data Writing – Application Scenarios

5

Query Acceleration – Application Scenarios

Quick Query of Fixed Reports

  • Data ETL is processed as consumption-oriented aggregated data
  • Meet fixed report/online data service requirements and second-level query
  • Elastic concurrency /data caching /easy integration

Data application tools or BI analysis tools are used to connect to MaxCompute through JDBC/SDK. You can read table data in MaxCompute.

6

Ad-Hoc Data Exploration and Analysis

  • Identify job characteristics automatically and select different execution modes according to data size and computational complexity. It can calculate general queries fast and complete complex queries.
  • Use storage layer modeling optimization, such as partitioning and HashClustering, to optimize query performance

7

Near Real-Time Operational Analysis

  • Support for batch and streaming data access
  • Historical data and near real-time data fusion analysis
  • Product level integration Message Service:

    • Datahub-log/message
    • DTS-database logs
    • SLS-behavior log
    • Kafka-IoT/Log Access

8

III. Tools and Access

Streaming Data Write – Access

Messages and Services

  • Message Queue for Apache Kafka (supported by plug-ins)
  • Output plug-ins of Logstash (supported by plug-ins)
  • Flink Built-in plug-ins
  • DataHub real-time tunnel (internal plug-in)

SDK Class New Interface – Java

  • Examples of simple upload
  • Examples of Multi-thread upload
  • Examples of asynchronous I /O multi-thread upload

Query Acceleration – Access

Tool Class

  • DataWorks (enabled by default)
  • ODPS CMD (required)
  • MaxCompute Studio (required)

SDK Class Interface

  • ODPS JavaSDK
  • ODPS PythonSDK
  • JDBC

Old Interface Compatible

  • Automatic recognition mode

9

IV. Demo and Summary

Real-Time Data Processing Practice Based on MaxCompute

You can achieve high-performance analysis and decision-making assistance for changing data, and obtain 1 billion data entries in seconds using MaxCompute.

This demo practice was implemented through MaxCompute + QuickBI. QuickBI supports direct MaxCompute query acceleration mode. QuickBI has acceleration engines, such as DLA and CK, Which is the current optimal mode. Direct connection to MaxCompute is the fastest to use the query acceleration mode.

10

Practice Summary

Benefits

  • Streaming Tunnel: Real-time writing is visible, which solves the problem of fragmented files caused by high QPS writing.
  • Query Acceleration: Low latency-multi-level cache and fast resource scheduling, ease of use-a set of SQL syntax, elasticity-storage, and computing separation

Enhancement

  • Currently, downstream applications can only be queried in full at a time during consumption/summary and cannot be processed by real-time stream computing. Real-time warehousing does not support modification or deletion.
  • Subsequent MC provides a streaming SQL engine to run real-time streaming jobs to achieve integration of streaming and batch.
0 1 0
Share on

Alibaba Cloud MaxCompute

137 posts | 20 followers

You may also like

Comments

Alibaba Cloud MaxCompute

137 posts | 20 followers

Related Products