This article is based on the keynote speech delivered by Ashish Sharma, Solution Architect, AI Rudder and Ganireddy Jyothi Swaroop, Senior Research Associate, Asia School Of Business, at Flink Forward Asia in Jakarta 2024.
Swaroop
Hello everybody, welcome again to Flink forward. The real time is the future. Along my side we have Ashish. We will be talking about how we can harness the streaming of data for the usage of in AI applications using Flink.
Swaroop
Our presentation will consist of three distinct sections. In the first section, we will discuss the evolution of Large Language Models(LLMs). In the second section, we will explain how Apache Flink can be utilized to stream data into AI applications. Finally, in the third section, we will demonstrate the framework for integrating both LLMs and Apache Flink.
Swaroop
Let's get started with the evolution of LLMs. How did the concept of AI begin, given that AI has been around for over 50 years? The development of LLMs, however, only started recently. In 2017, Google published a paper stating that attention is all we need. Before that, there were several statistical models which relied primarily on rule-based systems and had limited intuition. The primary goal of developing LLMs was to ensure that machines could mimic and depict human actions and behaviors. This was the point when the idea of LLMs emerged, leading to the decline of purely statistical models. Now, let's move beyond the pre-ChatGPT era and discuss the GPT era. Over to Ashish.
Ashish
Yeah, so I won't go back 50 years; let's focus on the last three years. In 2022, there was an explosion of interest in ChatGPT, and people began to understand what Large Language Models (LLMs) are all about. AI and Machine Learning communities were already aware of LLMs, but ChatGPT sparked widespread attention.
By 2023, LLM-based solutions were being experimented with extensively. Major players like Microsoft and Alibaba Cloud introduced their own LLM solutions, and the market saw a significant influx of these offerings. As developers, we were engaged in testing, proof-of-concept (POC) projects, and developing prototypes for our enterprise customers.
Now, in 2024, it's the year of production. Enterprise customers have adopted LLM-based solutions, which are now running in production environments. We have fully developed solutions for LLMs, and enterprise customers have key performance indicators (KPIs) and budgets allocated for LLM solutions.
As you can see in this slide, industries across the board—banking, software, communication, media, travel, video production—are all utilizing LLM-based solutions. They are leveraging various data sources in the form of text, images, audio, and video to train their models. These models can be any LLM-based algorithm, and they are being used to develop applications such as content creation, Q&A chatbots, anomaly detection, and risk management. Numerous applications are being created specifically for enterprise customers. However, just as every positive aspect has its downsides, there are challenges and negatives to consider as well. Let's hear more about these challenges from Swaroop.
Swaroop
Let's explore the current challenges in the area of large language models (LLMs). The first issue is hallucination, the second is verifiability, and the third is the knowledge cutoff. All existing LLMs are trained on pre-existing data and are not updated with real-time inputs or data. We would like to highlight these three challenges.
Regarding hallucination, for example, when I asked an LLM for updates on a topic related to a "queen," it generated multiple interpretations, asking if I meant Queen Elizabeth, the band Queen, or a fictional character. This illustrates a common challenge in basic LLMs: their tendency to hallucinate due to a lack of recent training data.
The second challenge involves verifiability. Without real-time data, LLMs are unable to accurately answer contemporary queries. For instance, when I inquired about the winner of the 2024 U.S. presidential election, the LLM could not provide an answer due to its data limitations. Sometimes, it might even claim that the election hasn't occurred. We aim to develop solutions to address these issues. I believe Ashish is the right person to provide an introduction to potential solutions.
Ashish
Thanks to our previous speaker, who elaborated on Retriever-Augmented Generation (RAG), understanding this concept should now be easier for you. I won't delve deeply into RAG, as you would typically retrieve your documents from your data sources. Enterprise customers can access their data, transfer it to the Large Language Model (LLM), and generate responses based on that specific document. What are the benefits of this approach? We will have up-to-date and accurate data, reduced hallucination, and an efficient, cost-effective solution compared to other options available in the market.
Our previous speaker outlined a two-step process: data ingestion and storage, followed by retrieval and generation. Let's discuss these steps for better understanding. The first step, data ingestion, involves creating a knowledge base of available data and the latest information you want the LLM to know beyond what it has been trained on. Initially, you must collect data from various sources, whether PDFs, Excel, CSVs, or images. This data is segmented using text recursive splitters or similar tools. These segments are then processed through an embedding model to convert them into floating-point numbers, known as vectors, which are stored in a vector database. This storage method helps the machine understand the relationship and position of words within sentences, making vector databases highly effective for data extraction.
Now that we have the knowledge base, let's consider the application layer's role in retrieving data from the vector database. When a user submits a query, the app uses an embedding model to send the query to the vector database, retrieving the relevant data chunks. Once these chunks reach the application layer, the application sends both the query and the retrieved chunks to the LLM to generate a response. Since the LLM now has the context of both the query and the retrieved chunks, it can generate an accurate answer without hallucination. Finally, the LLM provides its response, which is displayed to the user through the application layer.
Ashish
I believe everything has been resolved after implementing RAG, correct? All the issues should now be addressed. As you can see, our application can review logs to identify anomalies, and the LLM is providing the answers. The responses from our application logs are here, indicating that the issue is resolved, right?
Swaroop
Have you checked for the most recent log data?
Ashish
Latest data? Let me try, okay?
Swaroop
Let's consider data from one hour ago.
Ashish
oh, it's not there.
Swaroop
So how come?
Ashish
Do we really need real-time data?
Swaroop
Yes, real-time data is crucial. That's the main objective of this entire event, isn't it? If we only send batches of data and store them in a vector database, it's outdated because it isn't real-time. This is why we want to utilize Apache Flink for the real-time data benefits. We've outlined several ways it will enhance the user experience by keeping them updated with the most recent developments and allowing the data to be agile in understanding the current state. We now understand the RAG architecture, right? The next important aspect to discuss is Apache Flink. Let's hear from Ashish about it.
Ashish
I believe everyone in this room is familiar with Apache Flink. Apache Flink is an open-source framework driven by the community, which supports both batch and real-time data processing. It can handle large volumes of data and provides low latency, making it a very efficient tool. For those who might not be familiar with Apache Flink, this event is an excellent opportunity to learn about it. There will be tech talks later where you can dive deeper into its functionalities.
Let me highlight some of the key features of Apache Flink:
Open Source: Being open-source, it allows developers to tailor solutions to their specific needs.
Streaming and Real-time Data Processing: It supports real-time data processing, crucial for modern applications.
Fault Tolerance: It ensures data processing continues smoothly even in case of failures.
Batch Processing: Besides real-time processing, it is also adept at handling batch data processing.
Distributed System: Its distributed nature ensures it can handle large-scale data across multiple nodes.
Stateful Operations: Supports stateful operations, which is vital for complex data processing tasks.
Low Latency: Provides low latency, which is especially important for real-time data processing, such as Language Modeling (LLM).
Apache Flink's numerous features make it a versatile tool for real-time LLM-based solutions and other data processing needs.
One of the most valuable aspects of Apache Flink is its seamless integration with multiple service providers. This capability enables easy data ingestion from various sources and efficient data output to numerous sinks. Many partners and service providers support Apache Flink, allowing smooth data flow and processing. Apache Flink processes and analyzes incoming data and then sends the processed data to the designated sinks. The extensive support from different service providers ensures you have a robust ecosystem for data handling.
The next important consideration when developing an application is how to deploy the solution. Some people prefer deploying on their own systems, while others opt for managed cloud services or third-party vendors. Apache Flink stands out because it can be deployed in a variety of environments, including Docker, public cloud, and other service providers like Decodable and Ververica. This flexibility and seamless end-to-end integration make Apache Flink a key player for real-time data streaming solutions.
The final topic is implementation. It's crucial to understand how to implement a solution that combines Apache Flink and LLM. Before diving into the implementation, let's review the components and architecture available for developing an LLM-based solution.
At a high level, the components include:
Data Sources: These can come from IoT applications, mobile apps, and other services.
Kafka: This acts as a message broker, facilitating the flow of data.
Flink Embedding Model: This can be either open-source or proprietary, depending on your business requirements.
Vector Databases: These are used to store vector representations of data.
LLM: In this example, we are using Qwen.
In this setup, users can submit queries, and the system will provide responses based on the data processed by Apache Flink and the LLM. Understanding the end-to-end solution can be very useful for implementing and managing such a system effectively.
Swaroop
Now that we have reviewed the components, let's discuss the flow and architecture we aim to implement, particularly focusing on streamlining live, real-time data. The data will be sourced from various points and then framed through Kafka, which will send the topics to Apache Flink. As mentioned in the previous talk, the asynchronous I/O data inputs will be sent to the embedding model.
In this architecture, the embedding model will convert the data into a vector database, which will be continuously updated based on real-time data. When a user asks a question, it goes directly to the embedding model. The model will then retrieve the relevant data chunks and provide an appropriate answer. In this case, Qwen will be the system answering the user's queries.
We believe this is an interesting architecture to work on. However, we also want to consider the other side of the coin and discuss some of the challenges and considerations we might face while implementing this system.
As members of the open community, we considered latency because each module has different response times and inference levels. Given this, we need to account for these variations, especially when running multiple instances simultaneously with large data sources. This situation often leads to issues with model complexity. Additionally, if you fine-tune the model for a specific set of instructions, there is a risk of bias and overfitting. Another critical aspect is ethical considerations. It is essential to have guardrails in place to ensure that the model stays within its intended context and that real-time data is not misused.
With that said, we would like to conclude our presentation. Thank you so much for listening. We appreciate the opportunity provided by our organizers to share our use case on this platform, particularly regarding the development of Large Language Models (LLMs). Apache Flink is a tool that many companies, both large and small, are utilizing. Kudos to our organizers, everyone involved, and especially the community. The community's support is crucial for the growth of our solutions, helping us address daily challenges across various industries.
Thank you all very much.
Streaming processing vs. Batch processing: A Comprehensive Guide to Choosing the Right Approach
156 posts | 45 followers
Follow5544031433091282 - October 8, 2023
Alibaba Cloud Indonesia - March 5, 2024
Apache Flink Community - April 16, 2024
Apache Flink Community - July 5, 2024
Apache Flink Community - May 10, 2024
Apache Flink Community - May 27, 2024
156 posts | 45 followers
FollowRealtime Compute for Apache Flink offers a highly integrated platform for real-time data processing, which optimizes the computing of Apache Flink.
Learn MoreBuild a Data Lake with Alibaba Cloud Object Storage Service (OSS) with 99.9999999999% (12 9s) availability, 99.995% SLA, and high scalability
Learn MoreAn end-to-end solution to efficiently build a secure data lake
Learn MoreAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreMore Posts by Apache Flink Community
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
Get Started for Free Get Started for Free