Introduction to SQL in Flink 1.11

By Li Jinsong, an Apache Flink Committer and Alibaba Technical Expert

On July 6, the Apache Flink community announced the release of Flink 1.11, which, after months of planning and usability optimization starting in early March, marked a notable improvement in the Flink user experience.

As one of the core modules in Flink, the Structured Query Language (SQL) of Flink is essential to improving integrated processing of streaming and batch data. Flink SQL has been significantly enhanced in this release with over ten new major features, wider application possibilities, further streamlining, and simplified operations.

Some of Flink SQL's most significant new features are as follows:

Support for using the Blink planner as the default planner
Support for change data capture (CDC) tools that allow the easy integration of Debezium and Canal data sources into the Flink SQL system
Support for the real-time delivery of streaming data from Kafka to Hive

The Evolution of Flink SQL

Since stream computing emerged as a popular computational paradigm, more people are using this technology in their programming work. Some of them are senior big data engineers, others may include developers with less experience, such as Java beginners or data analysts who are not yet familiar with programming. In this case, the challenges for stream programming are no longer limited to high computational intensity or massive volumes of data. One top priority for improving the adoption rate of real-time data processing is to increase developers' productivity by shortening the learning curve of stream computing.

As one of the most commonly used languages for data analysis tasks, SQL offers a concise way for its users to express their business logic. Apache Flink is a unified computing engine for batch and stream data processing, it is designed to provide full SQL support. The implementation of Flink SQL conforms to ANSI SQL. SQL helps programmers manage their heavy workloads with only a few lines of code.

Some highlights during the development of Flink SQL are given below:

Flink 1.1.0: Flink SQL was released and Table API was featured in Flink. However, Flink 1.1.0 could only provide limited support for SQL.
Flink 1.3.0: Retractions were introduced in streaming SQL. This made streaming SQL easier to use and added support for unbounded streams.
Flink 1.5.0: The SQL Client was introduced, which means that SQL scripts were supported by Flink SQL.
Flink 1.9.0: This version provided support for the Table API planner and introduced a Table module in Blink. The Blink Table module featured the internal version of Alibaba's Blink for SQL operations, significant structural changes, and great functional enhancements.
Flink 1.10.0: After Blink was basically merged into Flink, many known issues in the previous versions of Flink were resolved in this release. Flink 1.10.0 also supported watermark generation in Flink SQL Data Definition Language (DDL) and provided full TPC Benchmark DS (TPC-DS) coverage for batch queries with improved performance.

The SQL module has become increasingly important in Flink, and is supported by a growing user base. Meanwhile, SQL-based Python and machine learning APIs are going through rapid development. As one of the most commonly used APIs in Flink, Flink SQL will become more important for integrations within the Flink ecosystem.

Major Changes of Flink SQL in Flink 1.11

After this release, Flink SQL supports the following scenarios:

Support for CDC was introduced to enable Flink to interpret and consume database changelogs with tools like Debezium.
A new file system connector was introduced. It comes with a more comprehensive set of supported formats and can be used in streaming data scenarios, such as directly streaming data from Kafka to Hive.

In addition, Flink 1.11 improves the usability of Flink SQL, fixes bugs from previous versions, and improves the API experience.

Supporting CDC

CDC is a set of software design patterns commonly used in databases. Debezium and Canal are popular CDC tools that are used to stream changes in specified formats to Kafka. In the previous practices, specific logic was required to parse and transform CDC data to make it "INSERT-only". This workaround method not only complicated the follow-up processing, but also added to the difficulty of overall operations.

To solve this problem, Flink SQL must offer native support for streaming data changes when handling changelogs.

Stream computing handles the calculation of continuous data streams and therefore computes constantly changing results. The results of each calculation performed by an aggregate function are single local values which generate changelog streams. The preceding figure shows the process of writing aggregated data to Kafka, which was impossible to do when Kafka only supported "INSERT-only" data.

Prior to Flink 1.11, the input of CDC data was not supported due to the limitations of sources and sinks. Flink 1.11 introduces new table sources and sink interfaces to ingest and emit CDC data and provides support for the Debezium and Canal formats (FLIP-105). This means that dynamic table sources are no longer limited to "append-only" operations and can ingest these external changelogs (INSERT events), interpret them into change operations (INSERT, UPDATE, and DELETE events), and emit them downstream with the change type.

As shown in the preceding figure, the data synchronized from CDC to Kafka is theoretically an appended stream with the changelog identifier in the file format. The most widely used method is to regard this changelog identifier as a common field.

Starting from Flink 1.11, this identifier is declared as a format. This format indicates that data can be identified and interpreted by Flink into changelog streams in Flink and can be processed in accordance with SQL semantics. Similarly, Flink SQL can generate changelog stream output (no built-in support in Flink 1.11). This allows you to write all types of SQL queries to Kafka as long as the formats comply with changelogs.

You have to specify either "format=debezium-json" or "format=canal-json" in your CREATE TABLE statement to consume changelogs using SQL DDL.

CREATE TABLE my_table (
    ...
) WITH (
    'connector'='...', -- e.g. 'kafka'
    'format'='debezium-json'
);

Flink 1.11 provides ready-to-use interfaces. However, you need to pay attention to the following considerations when you use Flink 1.11:

When Kafka is used as a changelog source, only the changelogs formatted in Debezium-json and Canal-json are supported.
You can create and implement your own formats and connectors.

Redefining Sources and Sinks

Apart from offering support for the aforementioned changelogs, the redefinition of sources and sinks in Flink 1.11 also provides solutions to many known issues.

The following code demonstrates an example of using the new sources and sinks (see the official documentation for details):

CREATE TABLE kafka_table (
    ...
) WITH (
    'connector' = 'kafka-0.10',
    'topic' = 'test-topic',
    'scan.startup.mode' = 'earliest-offset',
    'properties.bootstrap.servers' = 'localhost:9092',
    'format' = 'json',
    'json.fail-on-missing-field' = 'false'
);

Flink 1.11 maintains forward compatibility by retaining the old source and sink connectors and allowing you to use the connector.type key to fall back to them.

Factory Discovery Mechanism

Before Flink 1.11, users often encountered the "NoMatchingTableFactoryException" error:

This message indicated that the system could not find a suitable table factory for a DDL. This occurred due to one of the following reasons:

The factory class was not in the classpath. Factories in Flink SQL were discovered using Java's Service Provider Interfaces (SPI).
One or more specified parameters were incorrect.

However, judging by the error message alone, it is impossible to know which key or even which part of the code caused this problem.

public interface Factory {
    String factoryIdentifier();
    ⋯
}

This problem is solved in Flink 1.11 by introducing a TableFactory interface to allow users to search for a factory that corresponds to a specific factory identifier. In this case, if the system cannot find a factory, it means that the factory class is not in the classpath. On the other hand, if a result is returned for a factory search, you can conduct further checks for verification.

Types and Data Structures

In earlier versions of Flink, you could define data structures using source and sink interfaces, and the framework would convert the user-defined data structures into internal data structures identifiable to Flink SQL:

public interface TableSource<T> {
    TypeInformation<T> getReturnType();
    ...
}

You can define a TableSource and implement getReturnType() to specify the conversion.

However, you may sometimes find that the type declared in the DDL statement does not match the one returned by the getReturnType() method, especially when two different mechanisms are involved in determining a type, such as the RTTI (runtime type information) feature and the DataType in SQL. In addition, low match precision issues can also cause type mismatches.

This problem is also fixed in Flink 1.11. Developers of connectors can no longer define data structures and can only use the internal RowData structure of Flink SQL. This solution guarantees correct matches between default types and DDL statements, and does not require verification by the framework.

The internal RowData structure has the following functions:

Ensuring high-performance implementation for abstract classes and interfaces
Including RowKind to support CDC data formats
Providing compliance with the requirements of the SQL standard, such as data precision
Correspondending with the structured enum (enumerable) types in SQL

Upsert and Primary Key

A typical stream computing scenario is the writing of aggregated data into upsert sinks, such as the Java Database Connectivity (JDBC) and HBase sinks. The following error messages are often received when the system processes complex SQL queries:

According to these messages, the UpsertStreamTableSink interface requires the Table to have full primary key information. Otherwise, the task throws an exception. This is caused by the UpsertStreamTableSink mechanism of Flink, which, as the name suggests, is a sink the requires the key information from the table to update.

One way to retrieve the primary key is to use the optimizer to infer from the query, as is shown in the following figure.

This method is very effective and also conforms to the semantics for simple queries. However, if you use it for complex queries, such as using aggregate, JOIN, and then aggregate, you can expect exceptions because optimizers often fail to get primary keys, or the SQL does not support retrieving primary keys with optimizers.

Therefore, in Flink 1.11, this mechanism is replaced by using the CREATE TABLE statement. For example, to create a jdbc_table, you must explicitly write the primary key in the definition. The phrase NOT ENFORCED indicates that primary key constraints are not enforced because connectors may lack the ability to carry out the enforcement. Once a primary key is specified, the framework receives the decision to update the JDBC sink with the primary key. In this way, the results are not affected by the complexity of queries and you can refer to the definitions to acquire a clear understanding of how updates will be performed.

CREATE TABLE jdbc_table (
    id BIGINT,
    ...
    PRIMARY KEY (id) NOT ENFORCED
)

Batch and Streaming Analytics with Hive

First, let's look at the traditional Hive data warehouse infrastructure. The following figure shows a typical Hive data warehouse. Generally, the extracting, transforming, and loading (ETL) procedure uses scheduling tools to schedule jobs, such as on a daily or hourly basis. The scheduling process reflects the combination of latencies. In this example, Table1 is generated first during scheduling, then Table2, and then Table3, as computation latencies add up.

However, this is too slow and the latency is too high. Furthermore, the process of putting data into the Hive database and the previous ETL scheduling also leads to high latency in ad-hoc analysis. Consequently, the results of ad-hoc analysis are always stale data.

This is where real-time data warehouses come in. They allow data to be read and written to Kafka, and then exported to the business intelligence (BI) database. This enables real-time queries with real-time data processing. The real-time ETL in Kafka provides millisecond-level latency. Real-time data warehouses store data based on queues or real-time databases. This meets the requirements of real-time data processing with minimal latency. However, real-time data warehouses still have some drawbacks:

The first drawback is low storage efficiency. This is because queue-based storage normally requires the use of row-stores and queues, which is not very effective.
The second drawback is the absence of historical data. With the precomputing architecture, data is aggregated when it enters the BI database, which means that no historical data is available. While data in Kafka is generally retained for 15 days, the lack of historical data means that ad-hoc analysis cannot be completed. All analyses are pre-defined. You must start the corresponding real-time job and write the analysis to the database to make the analysis usable. In contrast, Hive supports ad-hoc analysis and delivers real-time analysis results.

Can we build a Lambda architecture that features the advantages of both offline and real-time warehouses?

The answer is yes. However, this would involve incredibly high costs. First, the costs for maintenance, computing, and storage would add up to a big amount. Second, this would result in repeated development. To keep data consistency, when the developers of the offline data warehouse finish writing Hive and SQL statements, the corresponding SQL statements also need to be written for the real-time data warehouse. In addition, the necessary communication, migration, and data comparison between two development teams would also result in high costs. The migrations that would accompany increasing demands for real-time analysis would cause more and more repetition in development work. Such costs may be acceptable for a few important tasks, but for a large number of tasks, the maintenance costs would be way too much.

Is there a way to enjoy the benefits of both ad-hoc and real-time analyses? One solution is to join real-time streaming data with offline Hive data, which would represent a major improvement because the data would be quasi-real-time. This is implemented in Flink 1.11, as shown in the following figure. As we can see from the figure, data is exported in streams to the BI database, and ad-hoc analysis can also be performed. The most significant feature is the support for joining streaming data in Flink.

Streaming Sinks

In earlier Flink versions, the streaming file sink wrote incoming data into buckets at the DataStream layer. The streaming file sink is a quasi-real-time sink that guarantees exactly-once semantics.

It is based on the two-phase commit protocol (2PC):

Phase 1: SnapshotPerTask. In this phase, files to be committed are closed or the offsets of the files being written are recorded.
Phase 2: NotifyCheckpointComplete. In this phase, the files to be committed are renamed. Note that rename() is atomic and idempotent. Therefore, we can guarantee exactly-once semantics for data transfers if the at-least-once guarantee of the rename() operation is ensured.

This file system writer is very useful. However, in the Hive data warehouse, visibility into data movement relies on the Hive metastore. But how is the Hive metastore notified during this process?

The partition commit mechanism is added at the StreamingFileSink layer.

In addition to files, partitions will also be committed. As shown in the preceding figure, the file writer corresponds to the previous StreamingFileSink and guarantees exactly-once semantics. The file writer is followed by a partition committer, which supports the following commit policies:

Adding partitions to Hive metastore with built-in support.
Writing SUCCESS files to file systems.
Customizing the committer with features such as analyzing partitions and merging small files in partitions.

After the committer is hung up in the writer, the commit trigger determines when to commit.

By default, commits are always performed for new files because commits are reentrant.
In other cases, the commit time can also be determined by the partition time and watermarks. For example, for hourly partitions, the 10:00 partition will be committed at 11:00. Watermarks ensure the accuracy of current jobs.

Streaming Sources

A large number of ETL jobs exist in the Hive data warehouse. These jobs are often run periodically with a scheduling tool. This causes two major problems:

The real-time performance is low. In most cases, the minimum scheduling period cannot be shorter than an hour.
The process is complex and involves many components, which often leads to problems.

For these offline ETL jobs, the following features are added to Flink 1.11 to support real-time Hive streaming reading:

Using a partition table to monitor the generation of partitions and read new partitions incrementally
Using a non-partition table to monitor the generation of new files in folders and read new files incrementally

You can even use a 10-minute partition strategy and use Flink's Hive streaming source and Hive streaming sink to greatly improve the real-time performance of the Hive data warehouse to the quasi-real-time level, and receive responses in minutes. At the same, ad-hoc queries on all tables are also supported for better flexibility.

SELECT * FROM hive_table /*+ OPTIONS('streaming-source.enable'='true', 'streaming-source.consume-start-offset'='2020-05-20') */;

In addition to scanning, Flink 1.11 also supports temporal joins, which is often referred to as streaming dim joins.

SELECT
  o.amout, o.currency, r.rate, o.amount * r.rate
FROM
  Orders AS o
  JOIN LatestRates FOR SYSTEM_TIME AS OF o.proctime AS r
ON r.currency = o.currency

The "cache all" mode is used and partition detection is not supported, which makes this feature better suited to small tables.

Hive Dialect

Flink SQL is ANSI-SQL compliant, while the Hive Query Language (HiveQL) uses a different HQL syntax. This makes Flink SQL and Hive different in both syntax and semantics.

Therefore, it is important to help Hive users learn the syntax and semantics of Flink SQL. Flink 1.11 currently supports the Hive dialect, which enables you to write in HQL in the Flink ecosystem. Currently, the Hive dialect only supports DDL statements. Support for queries will be gradually added in future versions.

File System Connector

The Hive integration is a major update with comprehensive functionality, but the runtime environment is complex. Is there a lightweight solution for file system reads and writes?

For a long time, the Flink Table has only supported the comma-separated values (CSV) format with the file system table, without support for partitioned files.

The revamped file system connector of Flink 1.11 now supports the following features and formats:

Combined with partitions, the file system connector now supports partition semantics in SQL, partition-related DDL statements, partition pruning, insertion of static and dynamic partitions, and INSERT OVERWRITE.
CSV
JSON
Aparch AVRO
Apache Parquet
Apache ORC
Batch read/write operations are supported.
Streaming sinks, the aforementioned Hive-supported partition commit, and writing of SUCCESS files are supported.

The file system connector requires only a few SQL statements and saves you the effort of building a Hive integration environment:

You can start a streaming job to write data to the file system. Then, you will be able to query the data in the file system with Hive. You can also import data to the database by using several SQL statements, which is easier than using DataStream programs.
You can use the file system connector to query data in a Hive data warehouse. It may not offer as many functions as Hive integration, but the definitions are easier to understand.

Table Usability

DDL Hints and LIKE Syntax

Since Flink 1.10, the Hive metastore has been used to store Table-related metadata in Flink streaming SQL. For example, you can use HiveCatalog to store Kafka tables. This allows you to use tables at startup.

By using DDL statements, you can submit SQL statements to a cluster to write data to Kafka, MySQL, or a distributed file system (DFS). Does this mean that with HiveCatalog, you only need to write DDL statements once and will then be able to use Kafka tables directly?

In some cases, the answer is yes. However, there are some limitations. For example, a Kafka table normally has some execution-related parameters. This is because the data retention period in Kafka is usually set to 15 days or shorter. Therefore, you need to specify the offset for each consumption because offsets are constantly changing. Consequently, the parameters for using the Kafka table are different each time you submit a job, and these parameters are stored in catalogs. As a result, you will have to create another table and rewrite the fields and parameters. Flink 1.11 provides the following solutions for this issue:

Flink 1.11 supports using table hints to declare the dynamic table options. Table hints are very easy to use. When you use SELECT FROM statements in SQL, you need to specify the dynamic table options by adding hints after tables. You can specify different dynamic table options in different use cases.
Flink 1.11 supports the LIKE syntax. The LIKE syntax is a standard SQL definition, which is equivalent to cloning a table and reusing its schema. The LIKE syntax supports multiple constraints that can be inherited or overridden.

Table hints:

SELECT id, name FROM kafka_table1 /*+ OPTIONS('scan.startup.mode'='earliest-offset') */;

LIKE syntax:

CREATE TABLE kafka_table2 WITH ( 'scan.startup.mode'='earliest-offset') LIKE kafka_table1;

These two approaches are a perfect supplement to HiveCatalog. They allow you to avoid writing a large number of schemas every time you run a job.

Built-in Connectors

Flink SQL 1.11 introduces three built-in connectors for easier debugging, testing, stress testing, and online observation.

DataGen Source: The DataGen connector allows you to generate records flexibly according to the generation strategy that you define, such as using a random generator or specifying sequences. It is useful for offline functional testing and performance testing.
Print Sink: The Print connector allows you to print records in runtime tasks. For example, if a sink of an online job is short of data and you do not know whether the data sent from the upstream or the logic of the sink is faulty, you can use the Print sink to check whether the input data is correct.
Blackhole Sink: The BlackHole connector can swallow all input records to facilitate performance testing.

These three connectors are designed to eliminate the impact of connectors on debugging or testing results. This helps solve the problem where connectors make it more difficult to identify problems and troubleshoot in stream computing.

SQL-API

TableEnvironment

TableEnvironment is very important as it functions as the entrance to SQL programming. The features of the preceding APIs are described as follows:

Table sqlQuery: returns the result of a SELECT query as a table and translates SQL queries into Flink tables.
void sqlUpdate: executes DDL or DML statements. However, the execution behavior differs. DDL statements are directly executed while DML statements are first cached to TableEnvironment and then executed after EXECUTE calls.
execute: executes and submits jobs to clusters.

TableEnvironment caches execution plans. However, in previous versions of Flink, users found it difficult to work with multiple APIs. Therefore, the Flink community improved the API experience in Version 1.11 to better organize the APIs and prevent bugs.

Execute one SQL query: TableResult executeSql(String sql)
Execute multiple SQL queries: StatementSet createStatementSet()
TableResult: collect, print, and getJobClient are supported.

Now, executeSql is a unified interface that supports queries, DDL statements, and DML statements.

In addition, you must consider the following rules if you want to use it with DataStream:

toDataStream is called: StreamExecutionEnvironment.execute must be called.
toDataStream is not called: TableEnvironment.executeSql must be called.

SQL Client

Apart from DDL support, the SQL Client in Flink 1.11 also supports tableau mode, which allows you to view query results printed on the console instead of having to switch pages.

Summary

You can visit the official website for the detailed documentation of the user interfaces introduced in this article.

Flink 1.11 has laid a solid foundation for CDC support by improving its internal mechanisms and APIs. In the future, more built-in CDC support will be introduced for binlogs and easily comprehensive Flink SQL syntax. The Flink community will continue its efforts to integrate stream and batch processing at the underlying layer to empower the SQL layer with more possibilities.

Community

Introduction to SQL in Flink 1.11

The Evolution of Flink SQL

Major Changes of Flink SQL in Flink 1.11

Supporting CDC

Redefining Sources and Sinks

Factory Discovery Mechanism

Types and Data Structures

Upsert and Primary Key

Batch and Streaming Analytics with Hive

Streaming Sinks

Streaming Sources

Hive Dialect

File System Connector

Table Usability

DDL Hints and LIKE Syntax

Built-in Connectors

SQL-API

TableEnvironment

SQL Client

Summary

Read previous post:

Read next post:

Apache Flink Community

You may also like

Comments

Apache Flink Community

Related Products

Message Queue for Apache Kafka

Realtime Compute for Apache Flink

Storage Capacity Unit

Hybrid Cloud Storage