×
Community Blog Apache Flink Has Become the De Facto Standard for Stream Computing

Apache Flink Has Become the De Facto Standard for Stream Computing

This article is based on a keynote speech given by WANG Feng, initiator of Apache Flink Community China and head of Open-Source Big Data Platform at Alibaba Cloud, at Flink Forward Asia 2023.

1. Global Growth of Apache Flink

幻灯片1.PNG
The Apache Flink project has experienced steady and rapid growth over the past decade. Drawing on the expertise of over 1,700 developers around the world, including areas such as China, Europe, and the Americas, the project showcases a diverse and collaborative community. Meanwhile, Apache Flink's popularity has surged, garnering a record-breaking 22 million monthly downloads this year. An increasing number of organizations and individual users now employ Apache Flink for real-time data processing, whether for learning, testing, or production purposes.

2. Apache Flink Honored with 2023 SIGMOD Systems Award

幻灯片2.PNG
The ACM Special Interest Group on Management of Data (SIGMOD) conference holds a distinguished position in the field of data management. The annual SIGMOD Systems Award recognizes a technology's broad utilization and worldwide influence on large-scale data management systems. This year, Apache Flink was honored with the award for its pivotal role in real-time stream data processing. This cements Apache Flink's status as the de facto global standard for stream computing.
Notably, Chinese developers account for 50% of the listed contributors for the award, underscoring their crucial role in advancing Apache Flink' development.

3. Apache Flink Community China's Fifth Anniversary

幻灯片3.PNG
The substantial contributions of Chinese developers are instrumental to Apache Flink's success in the global open-source arena. Apache Flink Community China was founded in 2018 to promote the development and adoption of Apache Flink in China. Prominent Chinese companies like Tencent, Kuaishou, ByteDance, and Meituan have propelled the community's fast evolution through active engagement.. A key initiative of this community is Flink Forward, a conference series that brings data streaming communities together. Since 2018, Alibaba Cloud has been hosting a Flink Forward Asia (FFA) conference annually. These collective efforts vastly enriched the educational resources for Apache Flink developers.

4. Apache Flink Major Releases in 2023

幻灯片4.PNG
The robust growth of Apache Flink is fueled by a vibrant community, sustained operations, and, fundamentally, the essence of its technology. The Apache Flink community continues to roll out new features and innovations, enhancing value for users and in turn, inspiring user contributions that propel Apache Flink's development forward.
Here is a summary of Apache Flink's highlights in 2023:
Following its semi-annual release schedule, Apache Flink launched two major releases in 2023: 1.17 in the first half and 1.18 in the second half, drawing a significant number of new contributors. These releases underscore the community's dedication to advancing stream processing based on user requirements and real-world applications. In addition, the community escalated its efforts in refining batch processing in terms of execution performance and feature completeness. Together, these enhancements position Apache Flink as an outstanding processing engine for computations over bounded and unbounded data streams.
The community also actively explored groundbreaking use cases. To mobilize and process more data, Apache Flink is promoting collaborative synergies with the lakehouse architecture—the new frontier in data analytics. As more and more users are transitioning from legacy Hive-based data warehousing to the more agile data lakehouse architecture, Apache Flink's real-time compute capabilities come to the fore, poised to supercharge data flow and analytical performance within data lakehouses.
Moreover, the community refined Apache Flink for seamless cloud integration to fully embrace cloud native—the new bedrock for big data, artificial intelligence, databases, and a myriad of computational frameworks. As cloud native gains more popularity, it is imperative for cloud-enabled Apache Flink applications to deliver consistent and high-quality user experiences.

5. Streaming SQL: Striving for Excellence

幻灯片5.PNG
Let's take a closer look at the key technical improvements in Apache Flink's two most recent versions, starting with a special focus on stream processing.
Streaming SQL has captivated the interest of Apache Flink users. Flink SQL, Apache Flink's implementation of Streaming SQL, is a robust engine designed for real-time analysis of streaming data. Over the past year, the Apache Flink community made significant investments to improve Flink SQL. Hundreds of issues were addressed by leveraging the expertise of more than 40 new contributors. One notable feature is PLAN_ADVICE, which checks semantic correctness and provides optimization suggestions for potential risks. This is a highly practical feature that can help users improve the accuracy and reliability of their query results.
SQL users want Flink SQL to have the same flexibility as the DataStream API. In the past, the use of programming languages like Java facilitated DataStream API calls, serving as the building blocks for creating sophisticated Apache Flink programs. Today, Apache Flink has taken significant strides to infuse SQL with the same flexibility. For example, SQL users can now flexibly manage watermarks and configure state time-to-lives (TTLs) for each operator. Apache Flink also upgraded Calcite, the underlying framework of Flink SQL, to bolster the plan optimization capability.

6. Generic Incremental Checkpoint (GIC): Wide Adoption

幻灯片6.PNG
The core architecture of Apache Flink has been upgraded to enhance its capabilities for stateful computation, including state storage, access, and snapshots. A cornerstone of these enhancements is the checkpointing mechanism, which Apache Flink uses to take snapshots of the states of the distributed data stream at regular intervals. The higher the frequency of the checkpoints, the less data is replayed during a system recovery. For example, second-level checkpoints can dramatically improve the user experience. With the release of Apache Flink 1.17 and 1.18, the GIC feature has been widely adopted in production environments, particularly among companies in China. GIC uses log-based methods to decouple the state data materialization procedure from the checkpoint procedure. This not only improves the speed of checkpointing but also reduces instant spikes in resource consumption due to concurrent data input/output operations across multiple tasks. In short, checkpointing is now faster and smoother.

7. Batch Engine: Production-Ready

幻灯片7.PNG
Apache Flink has distinguished itself in the field of stream processing. However, our ambitions extend beyond the individual strengths of stream or batch processing; we want Apache Flink to stand out as a unified stream-batch processing engine. This year, the community put in a lot of efforts to enrich features, improve stability, and streamline usability in the field of batch processing. These efforts established Apache Flink's position as an industry leader in batch processing.
Numerous companies have shared productive outcomes after utilizing Flink's robust batch engine. The community has been working extensively to improve the performance of the batch engine by leveraging the capabilities of stream processing and incorporating traditional optimization techniques.
幻灯片8.PNG
These improvements have been examined through the lens of the TPC-DS benchmark. Compared with Apache Flink 1.16, Apache Flink 1.18 demonstrates an improvement greater than 50% in batch execution performance on a 10T dataset. With continuous optimizations in the future, Apache Flink is on its way to becoming the most comprehensive and efficient computing engine for stream and batch processing.

8. Toward Cloud-Native Elasticity

幻灯片9.PNG
Beyond the advancements of Apache Flink's core engine, we've seen numerous improvements in the cloud-native distributed deployment architecture—a domain of increasing significance as computational workloads pivot toward cloud ecosystems. Cloud computing is recognized for its elasticity and seemingly boundless resource pool. It allows users to scale computing resources on demand. Apache Flink has dedicated considerable efforts this year to keep in line with this trend.
For example, Apache Flink introduced the dynamic rescaling feature. You can now change the parallelism of any individual task of a job through the REST API without the need for a full job restart. In conjunction with the state backend, this feature accelerates state data recovery, download, and utilization. Your rescaling operation takes effect nearly instantaneously. The community understands that manual scaling operations are not practical due to the unpredictability of business workloads. To address this issue, Apache Flink introduced the Kubernetes Autoscaler, which monitors workloads in real time and adjusts job parallelism accordingly. With the Autoscaler, you can achieve autonomous elastic scaling and fully leverage the cloud environment to meet your business needs.

9. Toward Streaming Lakehouse

幻灯片10.PNG
Apache Flink has ventured into numerous novel business scenarios. One of the biggest innovations is the integration with Lakehouse—the next-generation data analytics architecture. As a premier real-time computing engine, Apache Flink can leverage this integration to accelerate data flow within a lakehouse. This year, Apache Flink introduced several APIs to facilitate the management of data lakehouses. These APIs extend Apache Flink's support for storage formats and streamline the process of data ingestion and retrieval. Apache Flink also introduced a Java Database Connectivity Driver (JDBC Driver). You can now seamlessly connect traditional business intelligence (BI) applications with Apache Flink to conduct data analysis for a lakehouse.

10. Big Data: From Batch to Real-Time Processing

幻灯片11.PNG
Over the past years, Apache Flink has been at the forefront of technological innovation and architectural evolution, responding adeptly to changing user demands. Apache Flink's achievements in the field of stream processing echo the emerging trend in the big data industry: the shift of the business landscape from traditional batch processing to real-time analysis. This evolution is beginning to take root in more and more industries, including internet services, financial services, manufacturing, and transportation. By adopting real-time data processing, businesses can harness the full potential of their data assets.

1 1 0
Share on

Apache Flink Community

156 posts | 45 followers

You may also like

Comments

5873011638862493 May 29, 2024 at 6:46 pm

I often travel for work and need reliable internet. I was searching for a solution that lets me stay connected securely. I found fixed ip sim card on Simbase. The site is straightforward, with clear info on what they offer. They provide a SIM card with a fixed IP address, which is perfect for remote work. Plus, they have various plans and worldwide coverage, which is a big bonus. The setup was easy, and the connection is stable. This site really helped me stay connected without worrying about public Wi-Fi security.

Apache Flink Community

156 posts | 45 followers

Related Products