Flink Practices in iQiyi's Advertising Business

This article is taken from the Practices of Flink in iQiyi's Advertising Business by Han Honggen, Technical Manager of iQiyi, at Flink Meetup (Beijing) on May 22. The article includes the following sections:

Business Scenarios
Business Practices
Problems and Solutions When Using Flink
Future Planning

GitHub address: https://github.com/apache/flink

Your support is truly appreciated.

1. Business Scenarios

The use of real-time data in the advertising business can be divided into four main areas:

Data Dashboard: It displays core metrics, such as exposure, clicks, and revenue, as well as monitoring metrics, such as failure rate.
Exception Monitoring: If any fluctuation occurs in the procedure, the overall delivery effect will be affected because the procedure of advertisement delivery is relatively long. In addition, whether each team will have an impact on the overall delivery during release can be observed through the exception monitoring system. We can also observe whether the trend of business metrics is reasonable, including whether there are different fluctuations in exposure under normal inventory conditions, which helps find problems in real-time.
Data Analysis: It empowers business development by using the data. We can analyze some exceptions in the advertising process in real-time or figure out optimization solutions based on the current advertising effect to achieve better results.
Feature Engineering: The advertising algorithm team mainly trains some models to support online delivery. The technical features were initially mostly offline. As real-time technology develops, some projects begin to adopt real-time technology.

2. Business Practices

Business practices are mainly divided into two categories. The first is real-time data warehouses, and the second is feature engineering.

2.1 Real-Time Data Warehouse

2.1.1 Real-Time Data Warehouse – Goals

The goals of real-time data warehouses include data integrity, service stability, and query capabilities.

Data Integrity: In the advertising business, real-time data is used to guide decision-making. For example, advertisers need to guide subsequent bids or adjust budgets based on the current real-time data. In addition, the monitoring of failure rate requires the data to be stable. If the data fluctuates, it would be meaningless for the guidance. Therefore, integrity is a trade-off between timeliness and integrity.
Service Stability: The production chain includes data access, computing (multi-layer), data writing, progress service, and query service. Data quality is also included, such as whether the data accuracy and data trends are in line with expectations.
Query Capabilities: In the advertising industry, different OLAP engines may be used in different scenarios. As a result, the requirements for query methods and performance are different. In addition to the analysis of the latest and most stable real-time data, real-time and offline queries are performed when analyzing data. Cross-source data query and query performance are also required.

2.1.2 Real-Time Data Warehouse – Challenges

Data Progress: It requires a trade-off between timeliness and integrity.
Data Stability: Since the production procedure is relatively long, multiple functional components may be used during the process. Therefore, end-to-end service stability is critical to overall data accuracy.
Query Performance: It mainly includes OLAP analysis capabilities. In real-world scenarios, a data table contains offline and real-time data. A single table may contain hundreds of columns, and the number of rows is also very large.

2.1.3 Architecture of Advertising Data Platform

The figure above is the basic architecture diagram of the advertising data platform, which is viewed from bottom to top:

At the bottom is the data collection layer, which is virtually the same as the platform architectures of most companies. The business database mainly contains the advertiser's order data and the delivery strategy. The tracking log and billing log are generated during advertising delivery.
In the middle is data production. The bottom layer of data production is the infrastructure of big data. This part is provided by a cloud platform team of the company, which includes the Spark/Flink computing engine and Babel, a unified management platform. Talos is a real-time data warehouse service. RAP and OLAP correspond to different real-time analysis and OLAP storage and query services.

The advertising team uses services from the middle layer of data production, such as typical offline computing and real-time computing in production.

Offline computing is a common hierarchical model. The scheduling system manages and schedules offline tasks.
There are also many engines used in real-time computing. Our real-time computing started in 2016, and we chose Spark Streaming. Later, different scenarios were created with the development of big data technology and business requirements of the company, and the computing engine Flink was introduced.
The underlying scheduling of real-time computing depends on the Babel system of cloud computing. In addition to computing, it is accompanied by data governance, including progress management, which manages the stable progress of a data report in real-time computing. Offline computing corresponds to a table with specific partitions.
Lineage management includes two aspects for offline computing: table-level lineage and field lineage. For real-time computing, it is mainly lineage at the task level.
As for lifecycle management, in an offline data warehouse for offline computing, its calculations are continuously iterative. However, if the data retention time is very long, the amount of data will put greater pressure on the underlying storage.
Data lifecycle management is a trade-off between business requirements and storage costs.
Quality management mainly includes two aspects. One is in the data access layer, judging whether the data itself is reasonable. The other is in the data export layer, which is the result metric layer. The data will be used by many other teams, so we must ensure there is no problem with data calculation at the data export level.
The upper layer is the unified query service. We will encapsulate many interfaces for queries.
Data processing includes offline and real-time processing across clusters, so intelligent routing implements some core functions, such as cluster selection, table selection, complex queries, and splitting.
The query service manages the historical queries by popularity. This way, we can enhance lifecycle management and see which data is most significant to the business.
In addition to lifecycle management, it can guide our scheduling system. For example, it can tell which reports are critical and schedule tasks preferentially when resources are insufficient.
Further up are data applications, including the reporting system, ad-hoc queries, data visualization, exception monitoring, and downstream teams.

2.1.4 Real-Time Data Warehouse – Production Procedure

The data production procedure is based on time granularity. In the beginning, we used the offline data warehouse procedure. At the bottom layer, as the real-time demand advances, a real-time procedure is generated. Generally, it is a typical Lambda architecture.

In addition, we use the multi-procedure redundancy mode for some of our core metrics, such as billing metrics, because its stability is more critical to the downstream. First, the source logs are generated. Then, complete redundancy is made in the computing layer and the downstream storage layer. Lastly, unified processing is done in the subsequent queries. This process describes multi-procedure redundancy.

2.1.5 Real-Time Data Warehouse – Progress Service

The preceding part describes how the metrics of the real-time data provided externally are stable and unchanging. The core point of the progress service implementation includes the changing trend of the metrics in the time window and the status of the real-time computing task. In the real-time data warehouse, many metrics are aggregated and computed based on the time window.

For example, we set a value of three minutes for a real-time metric. The metric at the time point of 4:00 includes the data from 4:00 to 4:03, and 4:03 includes the data from 4:03 to 4:06. It refers to the data in a time window that is visible to the public. In real-time computing, data keeps coming in, the data output in the time window of 4:00 starts from 4:00, and metric results have already begun to be generated. Over time, the metric result rises and finally stabilizes. We determine whether it stabilizes based on the rate of change of the time window metric.

However, if we determine only based on this, it still has certain drawbacks.

The computing procedure of this result table depends on many calculation tasks. If there is a problem with any task on this procedure, the current metric may not be complete, even though the trend has stabilized. Therefore, on this basis, we have introduced task status for real-time computing. When the metric tends to be stable, we can also see whether these computing tasks on the production procedure are normal. If so, the metric at the time point of the task itself has been stable and can provide services to the outside world.

If the computing task is stuck, piled up, or restarted due to exceptions, we need to wait for iteration processing.

2.1.6 Real-Time Data Warehouse – Query Service

The preceding figure shows the diagram of query service architecture.

At the bottom is the data layer, which contains real-time storage engines, including Druid. For offline computing, the data is stored in Hive, but the data will be synchronized to OLAP during queries. Two engines are used here. The data will be synchronized to the OLAP engine, and then Impala will be used for queries to perform Union queries with Kudu. In addition, for more fixed scenarios, you can import the data to Kylin and then perform data analysis.

Based on these data, there will be multiple query nodes. An intelligent routing layer is at the top of the data layer. When a query request comes in, the query gateway at the top determines whether it is a complex scene. For example, for a query, if its duration spans offline and real-time computing, offline tables and real-time tables will be used here.

Moreover, there are more complex table selection logics for the offline table, such as hour level and day level. After complex scene analysis, the final table will be roughly determined. We will only refer to some basic services on the left for intelligent routing, such as metadata management and progress management.

For the optimization of query performance, the amount of data scanned at the bottom has a huge impact on the final performance. Therefore, there will be a report with reduced dimensions for analysis based on historical queries, including dimensions included in a table with reduced dimensions and the number of queries that can be covered.

2.1.7 Data Production – Planning

As mentioned earlier in the production of real-time data reports, it is mainly implemented based on API. There is a problem with the Lambda architecture. The real-time and offline computing teams are needed to develop at the same time for the same requirement, which will cause several problems.

On the one hand, their logic may differ, eventually leading to inconsistent result tables.
On the other hand, the labor cost rises because two teams are needed.

Therefore, our demand is to unify stream processing and batch processing. We want to know whether one logic can be used to represent the same business requirement at the computing layer. For example, we can use the computing engines of stream processing and batch processing at the same time to achieve the computing effect.

In this procedure, the original data is accessed through Kafka under the unified ETL logic and then placed in the data lake. The data lake itself can support stream and batch reading and writing and consume data in real-time. So, it can conduct real-time computing and offline computing and then write data back to the data lake uniformly.

As mentioned earlier, offline and real-time integration will be used when conducting queries. Therefore, writing the same table in the data lake can save a lot of work at the storage level and save storage space.

2.1.8 Data Production – SQL Support

SQL support is a capability provided by the real-time data warehouse platform of Talos.

It includes several functions on the page, with project management on the left and Source, Transform, and Sink on the right.

Some business teams are very familiar with computing engine operators. So, they can do some code development.
However, many business teams may not know much about the engine or do not have a strong desire to understand it. Thus, they can splice a job in this visualization way.

For example, you can import a Kafka data source to filter the data. Then, you can import a Filter operator to filter the logic. Next, you can conduct some project or union computing and finally output to a certain place.

For those with slightly stronger ability, they can do some higher-level computing. We can also achieve the purpose of real-time data warehouses here. Create some data sources, express the logic through SQL statements, and output the data to a certain storage medium.

The solution above is at the development level. It provides some other functions at the system level, such as rule verification, development, testing, and release, which can be managed uniformly. In addition, there is monitoring. There are many real-time metrics for online real-time tasks. You can check these metrics to determine whether the current task is in a normal state.

2.2 Feature Engineering

Feature engineering has two requirements:

1. The first requirement is the real-time principle because the value of data will become lower over time. For example, if a user likes to watch children's content, the platform will recommend children-related advertisements. In addition, users will have some positive/negative feedback when watching advertisements. If these data are iterated into features in real-time, the subsequent conversion effect can be improved.

Another key point of the real-time principle is accuracy. Previously, many feature projects were offline. In the production process, there is a deviation between the computed data and the features in the delivery process. The basic feature data is not very accurate. Therefore, we require the data to be more real-time and accurate.

2. The second requirement of feature engineering is service stability.

The first is job fault tolerance, such as whether the job can recover when it is abnormal.
The other job is data quality, which pursues end-to-end accuracy once for real-time data.

2.2.1 CTR Estimation

The following part is the practice of real-time features. The first one is the demand for click-through rate (CTR) estimation.

The background of the CTR estimation case is shown above. From the delivery procedure aspect, when users at the advertisement frontend have viewing behavior, the frontend will request the advertisement from the advertisement engine. Then, the advertisement engine will get the user features and advertisement characteristics when making rough/fine arrangements for advertisement recall. After the advertisement is returned to the frontend, users may produce behavioral events, such as exposure and click. When making CTR estimation, it is necessary to associate the characteristics of the previous request stage with the exposure and click in the subsequent user behavior stream to form the Session data. This is our data requirement.

Two aspects are included in real-world practices:

The correlation between exposure and click events in the Tracking stream
The correlation between feature stream and user behavior

What are the challenges in practices?

The first challenge is the amount of data.
The second challenge is the order issue and latency of real-time data.
The third challenge is the high accuracy requirement.

In terms of time series, the feature is earlier than Tracking. However, how long does this feature need to be retained when the successful correlation rate of two streams is above 99%? In the advertising business, users can download content offline, and the advertisement request and return have been completed during downloading. However, if the user watches it without an Internet connection, the event will not return immediately. Only when the connection is restored will subsequent exposure and click events be returned.

At this time, the retention periods of feature stream and tracking are very long. After offline data analysis, if the correlation rate of the two streams is above 99%, then the feature data needs to be retained for a relatively long time. Currently, it is retained for seven days, which is still relatively long.

The figure above shows the overall structure of CTR estimation. The correlation we mentioned earlier includes two parts:

The first part is the correlation between exposure and click events in the user behavior stream, which is implemented by CEP.
The second part is the correlation between the two streams. The feature described earlier needs to be retained for seven days. The scale is already hundreds of terabytes, which is relatively large. The management of such a large feature in memory has a huge impact on data stability. Therefore, we can achieve such an effect by placing feature data in external storage (HBase) and conducting real-time data queries with HBase features.

However, since the time series of the two streams may be staggered, when exposure and click events appear, this feature may not have arrived, meaning this feature cannot be obtained. Therefore, we implemented a multi-level retry queue to ensure the final correlation integrity of the two streams.

2.2.2 CTR Estimation – Correlation of Events in Streams

The right side of the figure above shows more detailed statements, which explain why the CEP scheme is selected for the correlation of events in streams. The business requirement is to associate exposure events of the same advertisement in the user behavior stream with click events. After exposure, clicks generated (for example, within five minutes) are taken as positive samples, while clicks generated after five minutes are discarded.

You can wonder which scheme can achieve such an effect when encountering such a scene. The processing of multiple events in a stream can be realized by windows. However, problems with windows may exist:

If the event sequence is in the same window, there is no problem with the data.
However, if the event sequence crosses windows, the normal correlation effect cannot be achieved.

Therefore, after a lot of technical research at that time, we found that CEP in Flink could achieve such an effect. It used a similar policy matching method to describe which matching methods these sequences need to meet. In addition, it can specify a time window for the event interval, such as a 15-minute interval for exposure and click events.

On the left side of the figure above, the begin command defines an exposure to realize a click within five minutes after the exposure. The commands after it describe a click that can occur multiple times. The within command indicates the duration of the correlation window.

During the process of production practice, this scheme can achieve correlation in most cases, but when comparing data, we found that some exposure and click events are not normally correlated.

After data analysis, we found that the exposure and click timestamps of these data are at the millisecond level. When these events have the same millisecond-level timestamp, they cannot be matched normally. Therefore, we adopt a scheme to add one millisecond to the click events manually. This way, the exposure and click events can be correlated successfully.

2.2.3 CTR Estimation – Correlation of Two Streams

As mentioned earlier, feature data needs to be retained for seven days with a size of hundreds of terabytes. Data needs to be stored in an external storage medium. As a result, there are specific requirements for the external storage medium when making technical selections:

First, it supports relatively high read/write concurrency.
Second, its timeliness needs to be very low.
Third, it is best to have lifecycle management capabilities because the data needs to be retained for seven days.

Based on the points above, HBase was finally selected to form the solution shown in the preceding figure.

The top part indicates that the exposure sequence is correlated with the click sequence through CEP. At the bottom, the feature stream is written into HBase through Flink for external state storage. The core modules in the middle are used to achieve the correlation of the two streams. After the correlation of exposure and click events is completed, check the HBase data. If it can be found normally, it will be output to a normal result stream. For the data that cannot be correlated, a multi-level retry queue is provided, which will result in queue degradation during multiple retries. Besides, the retry gap will widen step-by-step to reduce the scanning pressure on HBase during retries.

An exit mechanism is also provided because retries are performed a limited amount of times. The reasons why the exit mechanism exists mainly include:

First, the feature data is retained for seven days. If the corresponding feature was generated seven days ago, it could not be correlated.
Second, there is some external behavior similar to click farming in the advertising business, but no real advertising request is initiated. In this scenario, we cannot obtain the corresponding features.

Therefore, the exit mechanism means the feature will expire after retrying multiple times, and it will retry the expired data.

2.2.4 Valid Click

In the valid click scenario, it is the correlation between the two streams, but the technical selection is completely different.

First of all, let's look at the project background. In the Internet scenario, the film itself is an advertisement. After clicking, the playing page appears. On this page, users can watch for six minutes free of charge. If they want to continue watching, they need to become VIPs. The data of valid clicks are collected here. The clicks after users watch for more than six minutes are defined as valid clicks.

This scenario technically involves the correlation of two streams, including the click stream and playing heartbeat stream.

The click stream is simple for understanding, which includes users' exposure and click behavior. Click events can be filtered from them.
The playing heartbeat stream means when users are watching, the heartbeat information is returned periodically. For example, a heartbeat is returned every three seconds, indicating that the user is still watching. If we define the threshold as six minutes, the state itself needs to be processed to meet the condition of six minutes.

In this scenario, the gap between two streams is relatively narrow, while a film generally lasts for more than two hours. So, for the behavior after clicking, the gap can roughly be completed within three hours. The state is relatively small here. And use of Flink management can achieve that.

Now, let's look at a specific solution:

In the figure above, the green part is the click stream, and the blue part is the playing heartbeat stream.

In the statements on the left, after a click event comes in, its status is recorded, and a timer will be registered for regular cleaning. The set duration is three hours. Since most films last less than three hours, if the corresponding playing event does not have a target state at this time, the click event can expire.
In the playing heartbeat stream on the right, this state accumulates the duration. It is a heartbeat stream where a heartbeat is received, for example, every three seconds. We need to calculate whether its cumulative playing time and the current record have reached six minutes. One of the corresponding implementations in Flink is to correlate the two streams through the Connect operator. Then, a CoProcessFunction can be formulated, in which two core operators are contained.
The first operator determines what kind of processing needs to be done after obtaining the stream event of state 1.
The second operator determines which functions can be customized after the second stream event is obtained.

Operators provide users with a lot of flexibility, and users can achieve logical control. Compared with Input Join, it is more flexible for users.

2.2.5 Feature Engineering – Summary

Here is a summary of the parts above. Currently, the management of two streams is very common. There are many options, such as Window join, Interval join, and Connect + CoProcessFunction. In addition, there are some user-defined solutions.

During selection, we recommend considering the business for proper technical selection. First, we should think about the relationship among events in multiple streams and then judge the state scale. To a certain extent, we can exclude the unfeasible schemes from the schemes above.

3. Problems and Solutions When Using Flink

3.1 Fault Tolerance

In Flink, fault tolerance is mainly implemented through Checkpoint. Checkpoint is designed for fault tolerance at the task level within Job. However, when Job is restarted actively or due to exception, the state cannot be recovered from the historical state.

Therefore, we have made some improvements here. When a job is started, it will obtain the last successful historical state from Checkpoint and initialize it to achieve the effect of state recovery.

3.2 Data Quality

Flink implements one-time end-to-end accuracy. First, you need to enable the Checkpoint feature and specify the semantics of one-time accuracy. Moreover, if it supports transactions from the downstream, such as Sink, it can be combined with a two-phase commit protocol, Checkpoint, and downstream transactions to achieve one-time end-to-end accuracy.

This process is described on the right side of the figure above. This is a pre-commit process. In other words, when the Checkpoint coordinator is implementing checkpoints, it will inject some Barrier data into the Source side. After each Source obtains the Barrier data, it stores the state and feeds back the completion state to the coordinator. As such, each operator obtains the Barrier data to achieve the same function.

After arriving at Sink, it will submit a pre-commit mark in Kafka, which is mainly guaranteed by the transaction mechanism of Kafka. After all the operators have completed the checkpoint operation, the coordinator will send an ACK to all operators for acknowledgment. At this time, Sink can commit the transaction.

3.3 Sink Kafka

In the previous practice, we found that when the downstream Kafka increases the number of partitions, no data is written to the new partition.

This happens because FlinkKafkaProducer uses FlinkFixedPartitioner by default, and each task is only sent to one corresponding downstream partition. This problem occurs if the partition number of the downstream Kafka Topic is greater than the parallelism of the current task.

There are two solutions:

The first solution is to let users customize a FlinkKafkaPartitioner.
The second solution is to skip the configuration by default. Data is written to each partition through polling by default.

3.4 Enhanced Monitoring

We need to check the status to run a Flink job. For example, in Flink UI, many of its metrics are at Task granularity and have no overall effect.

The platform has aggregated these metrics and displayed them on one page.

As shown in the preceding figure, the displayed information includes the backpressure status, latency, and CPU/memory utilization of JobManager and TaskManage during operation. Checkpoint monitoring is also displayed, showing whether any checkpoint timed out or failed recently. We will add alert notifications for these monitoring metrics later.

3.5 Monitoring and Alerting

When an exception occurs during real-time task operation, users need to know the status in time. As shown in the figure above, there are some alert items, including alert subscribers, alert levels, and some metrics below. According to the metric values set earlier, if the alert policy rules are met, alerts will be pushed to the alert subscriber. Alerts are sent through emails, telephone calls, and internal communication tools, thus realizing notification of task exception status.

By doing so, when the task is abnormal, the user can know the status in time and then carry out the manual intervention.

3.6 Real-Time Data Production

Now, let's summarize the key nodes of iQiyi's advertising business in the real-time production procedure:

The real-time feature started in 2016 when the main point was real-time metrics. SparkStreaming was used at that time.
In 2018, the real-time CTR feature was launched.
In 2019, the one-time end-to-end accuracy and enhanced monitoring of Flink were launched.
In 2020, the real-time feature of valid clicks was launched.
In October 2020, the improvement of real-time data warehouses was gradually promoted, and the API-based production was gradually transformed to SQL-based production.
In April 2021, the exploration of stream-batch unification was carried out. It has been implemented in ETL.

The first advantage of implementing stream-batch unification for ETL is to improve the timeliness of offline data warehouses. We need to conduct anti-cheating processing on data. When we provide basic features to the advertising algorithm, the timeliness after anti-cheating processing can largely improve the subsequent overall effect. Therefore, if real-time transformation is performed for ETL, it is of great significance for subsequent stages.

After ETL achieves stream-batch unification, we will store the data in the data lake, and subsequent offline data warehouses and real-time warehouses can be implemented based on the data lake. The stream-batch unification involves two stages. In the first stage, unification is achieved in ETL. The report side can also be placed in the data lake so that our query service can be more powerful. Previously, Union calculation was needed for offline and real-time tables. In the data lake, we can do it by writing the same table offline and in real-time.

4. Future Planning

The first thing is stream-batch unification, which includes two aspects:

One aspect is unification in ETL.
Another aspect is the combination of SQL-based real-time reports and the data lake.

Currently, anti-cheating processing is mainly implemented offline. In the future, some online anti-cheating models may be transformed into real-time models to minimize risks.

Community