The Past, Present, and Future of Apache Flink

This article is based on the keynote speech given by Feng Wang, Head of the Open Data Platform at Alibaba Cloud, at Flink Forward Asia in Jakarta 2024.

Introduction

Welcome to Flink Forward Asia! This year's event marks the first occurrence of Flink Forward in Jakarta, Indonesia, a significant milestone as it's the first time hosted in a Southeast Asian country. I'm Feng Wang from Alibaba Cloud, and I'm excited to engage with so many local developers passionate about technology.

Apache Flink, launched in 2014, celebrates its 10th anniversary this year. This occasion provides a perfect opportunity to explore the evolution of Flink, its current status, and its future directions.

The Past of Apache Flink

Let’s take a moment to reflect on the remarkable milestones that Apache Flink has achieved over the past decade. These milestones underscore Apache Flink's pivotal role in the evolution of data technology, facilitating the transition from batch processing to real-time processing.

Origins

Similar to Apache Spark, Apache Flink originated from a university research project called "Stratosphere," based at the Technische Universität Berlin in 2009. In 2014, Stratosphere’s core team donated the project to the Apache Software Foundation, where it was renamed Flink. They then established a company, originally known as dataArtisans, now referred to as Ververica.

Alibaba’s Contributions

On the other side of the globe, Alibaba Group experienced rapid growth, leading to an exponential increase in data generated on its e-commerce platform. To explore the next generation of data platforms capable of processing data in real-time, we conducted research, analysis, and experiments. Ultimately, we selected Apache Flink as our unified streaming technology and processing engine from among various open-source projects. Flink was deployed in large-scale production for the first time in 2016.

In 2018, Alibaba organized the first Flink Forward Asia conference in Beijing to promote Flink's adoption in China. The following year, Alibaba made a significant investment, acquiring dataArtisans and further solidifying its commitment to Apache Flink. This included contributions of their own production-proven version of Flink, named Blink, which encompassed approximately 1.5 million lines of code— a vital contribution that positioned Apache Flink for global production readiness.

By 2023,Apache Flink received the annual system award from SIGMOD, recognizing its value both in industry and academia.

Global Deployment and Community Growth

Over the last decade, Flink has been widely adopted across various industries worldwide. It originated in Europe and expanded to China, the United States, and other regions, accumulating nearly 2000 contributors globally. Notably, about half of these contributors reside in China, largely attributable to the country's sizable population and Alibaba's substantial influence since 2018.

The Flink Forward Asia conference has taken place seven times, consistently promoting Flink's capabilities even during the pandemic. This year's event is the first hosted outside China, hinting at a future filled with more Flink-focused conferences across Asia.

Meeting Market Demands

The recognition of Apache Flink amidst developers is primarily due to its alignment with the growing demand for real-time data analytics. As we know, most decision-makers in enterprises hope their business reports can be updated in real-time within seconds, rather than on a T+1 basis. This real-time reporting would help them make decisions more timely and efficiently, while fintech companies require rapid risk detection to prevent losses from delays. E-commerce businesses also rely on real-time interactions to provide relevant product recommendations.

As a streaming data processing engine, Apache Flink has effectively addressed these challenges, demonstrating its market fit over the last decade.

Apache Flink: the De-Dacto Standard of Streaming Compute

Apache Flink has established itself as the de facto standard for real-time streaming computing across numerous industries. Its comprehensive capabilities allow users to process both bounded and unbounded data seamlessly, functioning as a unified engine for streaming and batch data.

Flink's rich set of connectors enables easy integration with the existing big data ecosystem, facilitating connections with mainstream databases, data lakes, data warehouses, messaging queues, and search engines. This adaptability positions Apache Flink as a backbone within modern data architecture.

Transition to Streaming Lakehouse Architecture

A significant development in recent years is the shift from streaming compute to streaming lakehouse architectures. Lakehouse architecture has gained attention as a next-generation data structure, combining the best of data lakes and data warehouses.

In 2022, the Apache Flink community initiated the Flink Table Store project, centered around the real-time data lake format. This framework later evolved into Apache Paimon, an independent project designed for a streaming-oriented lakehouse.

With the new streaming lakehouse approach, developers can utilize Flink CDC to ingest data from external sources in real-time and leverage Flink SQL for both streaming and batch data processing in a unified manner. This architecture simplifies data pipelines, allowing for real-time integration without excessive data copying across systems.

This new type of technology, featuring a real-time architecture, has been successfully implemented at Alibaba, particularly during this year’s Double 11 shopping festival. By leveraging this innovative real-time data analytics architecture, Alibaba's Business Intelligence (BI) team can utilize a single unified SQL to build both stream and batch pipelines simultaneously. Users only need to write the SQL once to define both the real-time and offline business pipelines. Furthermore, they can use a single data storage system to manage all the data without needing to copy it from various data systems. For instance, users no longer need to transfer data from Kafka to Hive. This approach significantly reduces the complexity of the data architecture while also saving costs, all while ensuring real-time capabilities.

Future Directions: Apache Flink 2.0

The future of Apache Flink looks promising, with the anticipated release of Flink 2.0 to mark the next generation in early 2025.

Several key advancements are in store:

Cloud-native: Flink 2.0 will feature a disaggregated storage architecture, delivering improved experiences for elastic serverless computing and fault tolerance, leading to a completely cloud-native architecture.
Unified Analytics: This update aims to streamline SQL usage across streaming and batch processing. Users will only need to declare business logic and specify data freshness, simplifying the process significantly.
Embracing AI: Flink 2.0 will include enhanced AI capabilities, enabling better interactions with AI systems, from large language models to vector databases.

In conclusion, as we celebrate a decade of progress with Apache Flink, I am optimistic about the engagement from local developers in Indonesia and Southeast Asia, contributing to the growth of the open-source community. Let’s look forward to the innovations Apache Flink 2.0 will bring to the data landscape!

Thank you for your time, and I hope to see increased participation in the Flink community from the region moving forward.

Community

The Past, Present, and Future of Apache Flink

Introduction

The Past of Apache Flink

Origins

Alibaba’s Contributions

Global Deployment and Community Growth

Meeting Market Demands

Apache Flink: the De-Dacto Standard of Streaming Compute

Transition to Streaming Lakehouse Architecture

Future Directions: Apache Flink 2.0

Read previous post:

Read next post:

Apache Flink Community

You may also like

Comments

Apache Flink Community

Related Products

Realtime Compute for Apache Flink

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Quick BI