A Deep Dive into Apache Flink 1.11: Stream-Batch Integrated Hive Data Warehouse

By Li Jinsong (Zhixin) and Li Rui (Tianli).

The features of Apache Flink 1.11 have been frozen. The stream-batch integrated data warehouse is a highlight in the new version. In this article, I will introduce the improvements of the stream-batch integrated data warehouse in Apache Flink 1.11. The official version will be released soon.

First, there is some good news! The Blink planner is now the default in the Table API/SQL.

In Apache Flink 1.11, stream computing combined with the Hive batch processing data warehouse provides real-time and exactly-once capabilities of stream processing of Apache Flink (Flink) for offline data warehouses. In addition, Apache Flink 1.11 has perfected the Filesystem connector, which makes Flink easier to use.

Data Warehouse Architecture

Offline Data Warehouses

Traditional offline data warehouses are developed based on Hive and the Hadoop Distributed File System (HDFS.) A Hive data warehouse features mature and stable big data analytics capabilities. With scheduling and upstream and downstream tools, it builds a complete data processing and analytics platform:

Flume imports data into a Hive data warehouse.
The scheduling tool schedules extract-transform-load (ETL) jobs for data processing.
Ad-hoc queries are performed flexibly on tables in a Hive data warehouse.
The scheduling tool schedules aggregations to the database at the business intelligence (BI) layer.

The problems with this process are:

The import process is not flexible enough, but it should be a flexible SQL stream computing process.
The real-time performance is poor due to cascade computing based on job scheduling.
ETL does not support streaming incremental computing.

Real-Time Data Warehouses

With the popularity of real-time computing, an increasing number of companies have started to use real-time data warehouses because the features of offline data warehouses cannot meet their requirements. Based on Kafka and Flink Streaming, a real-time data warehouse defines the stream computing jobs throughout the process in seconds or milliseconds.

Historical data can be saved in the real-time data warehouse for only 3 to 15 days. Ad-hoc queries cannot be performed in the real-time data warehouse. If you build an offline and real-time Lambda architecture, maintenance, computing and storage, consistency, and repeated development are heavy burdens.

Real-Time Hive Data Warehouses

To solve the problems of offline data warehouses, Apache Flink 1.11 provides Hive data warehouses with the real-time capability. This can enhance the real-time performance in each section without placing overly heavy burdens on the architecture.

Hive Streaming Sink

How do you import real-time data into a Hive data warehouse? Do you use Flume, Spark Streaming, or Flink DataStream? The streaming file sink at the Table API/SQL layer is finally coming! Apache Flink 1.11 supports the streaming sink [2] of the Filesystem connector [1] and the Hive connector.

(Note: The concept of bucket in StreamingFileSink in this figure is equivalent to the concept of partition in the Table API/SQL.)

The streaming sink at the Table API/SQL layer has the following features:

Enables Flink Streaming to be real-time or quasi-real-time.
Supports all formats (CSV, JSON, Avro, Parquet, and ORC) of the Filesystem connector.
Supports all formats of Hive tables.
Inherits all features of DataStream StreamingFileSink, such as exactly-once, and has support for HDFS and S3, and introduces a new mechanism: partition commit.

A reasonable data import into a data warehouse involves the writing of data files and partition commit. When a certain partition finishes writing, Flink must notify the Hive Metastore that the partition has finished writing, or a SUCCESS file must be added to the corresponding folder. In Apache Flink 1.11, the partition commit mechanism allows you to do the following things:

Trigger: You can control the timing of partition commit according to the watermark and the time extracted from the partition or the processing time. You can choose to let the downstream application view the partitions that have not finished writing as soon as possible or let the downstream application view partitions after they finish writing.
Policy: The commit policy has built-in support for the SUCCESS file and Metastore commit. You can also extend commit implementations. For example, during commit, you can trigger the Hive analysis to generate statistics or merge small files.

For example:

-- Use the Hive DDL syntax with the Hive dialect.
SET table.sql-dialect=hive;
CREATE TABLE hive_table (
  user_id STRING,
  order_amount DOUBLE
) PARTITIONED BY (
  dt STRING,
  hour STRING
) STORED AS PARQUET TBLPROPERTIES (
  -- Determine when to commit a partition according to the time extracted from the partition and the watermark.
  'sink.partition-commit.trigger'='partition-time',
  -- Configure an hour-level partition time extraction policy. In this example, the dt field represents the day in yyyy-MM-dd format and hour is from 0 to 23. timestamp-pattern defines how to extract a complete timestamp from the two partition fields.
  'partition.time-extractor.timestamp-pattern'='$dt $ hour:00:00',
  -- Configure delay to the hour level. When watermark is greater than the partition time plus one hour, you can commit the partition.
  'sink.partition-commit.delay'='1 h',
  -- The partition commit policy is to update metastore(addPartition) before the writing of the SUCCESS file.
  'sink.partition-commit.policy.kind'='metastore,success-file'
)
 
SET table.sql-dialect=default;
CREATE TABLE kafka_table (
  user_id STRING,
  order_amount DOUBLE,
  log_ts TIMESTAMP(3),
  WATERMARK FOR log_ts AS log_ts - INTERVAL '5' SECOND
)
 
-- Dynamically specify table properties [3] by using table hints.
INSERT INTO TABLE hive_table SELECT user_id, order_amount, DATE_FORMAT(log_ts, 'yyyy-MM-dd'), DATE_FORMAT(log_ts, 'HH') FROM kafka_table;

Hive Streaming Source

A large number of ETL jobs exist in the Hive data warehouse. These jobs are often run periodically with a scheduling tool. This causes two major problems:

The real-time performance is poor. The minimum scheduling period is often at the hour level.
The process is complex and involves many components, leading to problems.

For these offline ETL jobs, Apache Flink 1.11 provides real-time Hive streaming reading:

You can use a partition table to monitor the generation of partitions and read new partitions incrementally.
You can use a non-partition table to monitor the generation of new files in the folder and read new files incrementally.

You can use a 10-minute partition policy. Flink's Hive streaming source and Hive streaming sink can greatly improve the real-time performance of a Hive data warehouse to the quasi-real-time minute level. Meanwhile, they also support full ad-hoc queries for a table to improve flexibility.

SELECT * FROM hive_table
/*+ OPTIONS('streaming-source.enable'='true',
'streaming-source.consume-start-offset'='2020-05-20') */;

Real-Time Data Association with Hive Tables

After the integration of Flink and Hive was released, most of the customers hope that Flink's real-time data can be associated with offline Hive tables. Therefore, Apache Flink 1.11 supports the temporal join [6] of real-time tables and Hive tables. Let's use an example from Flink's official documentation. Assume that Orders is a real-time table and LatestRates is a Hive table. You can perform the temporal join by using the following statements:

SELECT
  o.amout, o.currency, r.rate, o.amount * r.rate
FROM
  Orders AS o
  JOIN LatestRates FOR SYSTEM_TIME AS OF o.proctime AS r
  ON r.currency = o.currency

Currently, the temporal join against Hive tables only supports only processing time. We cache the data of Hive tables to memory and update the cached data at regular intervals. You can specify the interval for updating the cache by using lookup.join.cache.ttl. The default interval is one hour.

The lookup.join.cache.ttl parameter must be configured to the properties of Hive tables, so each table supports different configurations. In addition, you must load each entire Hive table into memory, so this feature only applies to scenarios that involve small Hive tables.

Hive Enhancements

Hive Dialect Syntax Compatibility

Flink on Hive users cannot use Data Definition Language (DDL) easily for the following reasons:

DDL has been further improved in Apache Flink 1.10, but Flink differs from Hive in metadata semantics. Therefore, it is difficult to operate on Hive metadata by using Flink DDL, so this method only covers a few scenarios.
If Flink is used for integration with Hive, you often need to use Hive CLI to run DDL.

To solve the preceding two problems, we proposed FLIP-123 [7], which provides users with Hive syntax compatibility through the Hive dialect. The ultimate goal of this feature is to provide users with an experience similar to Hive CLI or Beeline. With this feature, you do not have to switch between Flink and Hive CLI. You can migrate some Hive scripts to Flink and run them in it.

In Apache Flink 1.11, the Hive dialect supports most common DDL operations, such as CREATE TABLE, ALTER TABLE, CHANGE COLUMN, REPLACE COLUMN, ADD PARTITION, and DROP PARTITION. To this end, we implement an independent parser for the Hive dialect. Based on the dialect that you specified, Flink determines the parser that will be used to parse SQL statements. You can specify an SQL Dialect by using the table.sql-dialect parameter. The default value of this parameter is default, which represents Flink's native dialect. However, if this parameter is set to hive, the Hive dialect is used. If you are an SQL user, you can set the table.sql-dialect in the yaml file to specify the initial dialect of the session or use the set command to adjust the dialect as needed without restarting the session.

For more information about the functions supported by the Hive dialect, see FLIP-123 or Flink's official documentation. The following are some design principles and notes for using this feature:

You can use only the Hive dialect rather than Flink's native tables (such as Kafka and ES tables) to operate on Hive tables. This also means that the Hive dialect must be used with HiveCatalog.
When you use the Hive dialect, you may not be able to use some of the original Flink syntax, such as the type aliases defined in Flink. When you need to use the Flink syntax, switch to the default dialect.
The DDL syntax of the Hive dialect is based on Hive's official documentation, but the syntax may vary slightly between different versions of Hive. Therefore, you need to make some adjustments.
The syntax implementation of the Hive dialect is based on Calcite, which has different reserved keywords from Hive. Some keywords, such as default, that are directly used as identifiers in Hive may need to be escaped with a grave accent (`) in the Hive dialect.

Vectorized Reading

Apache Flink 1.10 supports vectorized reading of ORC (Hive 2+), but this is very limited. To this end, Apache Flink 1.11 has added more support for vectorization:

ORC for Hive 1.x [8]
Parquet for Hive 1,2,3 [9]

Vectorization is supported for all versions of Parquet and ORC. This feature is enabled by default and can be disabled manually.

Simplification of Hive Dependencies

In Apache Flink 1.10, Flink's documentation lists the required Hive dependencies. We recommend that you download the dependencies. Since this is still a little troublesome, Apache Flink 1.11 provides built-in dependencies [10]:

flink-sql-connector-hive-1.2.2_2.11-1.11.jar: the dependency of Hive 1.0.
flink-sql-connector-hive-2.2.0_2.11-1.11.jar: the dependency of Hive 2.0 to 2.2.
flink-sql-connector-hive-2.3.6_2.11-1.11.jar: the dependency of Hive 2.3.
flink-sql-connector-hive-3.1.2_2.11-1.11.jar: the dependency of Hive 3.0.

Download a separate package and then run HADOOP_CLASSPATH to start Flink on Hive.

Flink Enhancements

In addition to Hive-related features, Apache Flink 1.11 has also introduced a lot of enhancements for the stream-batch integration.

Flink Filesystem Connector

For a long time, the Filesystem connector of the Table API in Flink only supported the CSV format but didn't support partitions. In some respects, the Filesystem connector did not support common big data computing scenarios.

In Apache Flink 1.11, the entire Filesystem connector is renewed [1]:

The Filesystem connector has been combined with partition, and therefore supports all semantics about partition in SQL, partition-related DDL statements, partition pruning, insertion of static and dynamic partitions, and INSERT OVERWRITE.

Various formats are supported:

CSV
JSON
Aparch AVRO
Apache Parquet
Apache ORC
Batch read/write operations are supported.
The streaming sink, aforementioned Hive-supported partition commit, and writing of Success files are supported.

For example:

CREATE TABLE fs_table (
  user_id STRING,
  order_amount DOUBLE,
  dt STRING,
  hour STRING
) PARTITIONED BY (dt, hour) WITH (
  'connector'='filesystem',
  'path'='...',
' format'='parquet',
  'partition.time-extractor.timestamp-pattern'='$dt $hour:00:00',
  'sink.partition-commit.delay'='1 h',
  'sink.partition-commit.policy.kind'='success-file')
)

-- stream environment or batch environment
INSERT INTO TABLE fs_table SELECT user_id, order_amount, DATE_FORMAT(log_ts, 'yyyy-MM-dd'), DATE_FORMAT(log_ts, 'HH') FROM kafka_table;

-- Query by partition
SELECT * FROM fs_table WHERE dt='2020-05-20' and hour='12';

Max Slot

Before Apache Flink 1.11, the yarn per-job or session mode expanded infinitely. You could use only the YARN queue to limit resource usage. However, traditional batch jobs are big concurrent jobs, which run on a limited number of resources in stages. Therefore, Apache Flink 1.11 has introduced Max Slot [11] to limit the resource usage of YARN applications.

slotmanager.number-of-slots.max

By using Max Slot, you can define the maximum number of slots allocated by a Flink cluster. This configuration option is used to limit the resource consumption for batch workloads. We do not recommend configuring this option for streaming jobs. If you do not have sufficient slots, streaming jobs may fail.

Summary

Apache Flink 1.11 is also a major revision, where the Flink community has introduced a lot of features and improvements. The biggest goal of Flink is to help businesses build a stream-batch integrated data warehouse and a comprehensive, smooth, and high-performance all-in-one data warehouse. I hope you can get more involved in our community activities and share your questions and ideas with us. You are also welcome to participate in our discussions and development, to help us make Flink even better.

References:

About the Authors

Li Jinsong (Zhixin) is an Apache Flink committer and Alibaba Technical Expert that has long been committed to stream-batch integrated computing and data warehouse architecture.

Li Rui (Tianli) is an Apache Hive PMC member and Alibaba Technical Expert that was mainly involved in open-source projects, such as Hive, HDFS, and Spark at Intel and IBM before joining Alibaba.

Community

A Deep Dive into Apache Flink 1.11: Stream-Batch Integrated Hive Data Warehouse

Data Warehouse Architecture

Offline Data Warehouses

Real-Time Data Warehouses

Real-Time Hive Data Warehouses

Hive Streaming Sink

Hive Streaming Source

Real-Time Data Association with Hive Tables

Hive Enhancements

Hive Dialect Syntax Compatibility

Vectorized Reading

Simplification of Hive Dependencies

Flink Enhancements

Flink Filesystem Connector

Max Slot

Summary

References:

About the Authors

Read previous post:

Read next post:

Apache Flink Community

You may also like

Comments

Apache Flink Community

Related Products

Realtime Compute for Apache Flink

Batch Compute

Apsara Stack

ECS(Elastic Compute Service)