Abstract: This article is based on the keynote speech on AI feature engineering given by ZHAO Liangxingyun, a senior technical expert of Ant Group, during Flink Forward Asia 2023. This article contains the following parts:
The Ant Group feature store is a high-performance AI data processing framework that integrates multiple computing paradigms. It can meet the requirements for low-latency feature output, high-concurrency access, and feature data consistency between online and offline stores in AI training and inference scenarios.
Ant Group built the feature store to enable algorithm engineers to be data self-sufficient. The feature store allows algorithm engineers to develop, test, create, and run features in a low-code manner without the assistance of a dedicated data engineering team.
After a feature starts to run, the feature store automatically completes high-performance real-time feature production tasks and queries, and ensures feature data consistency between offline and online stores, which is transparent to users.
Ant Group started to build a feature store in 2017. By leveraging years of risk management expertise and substantial data insights, Ant Group built its feature store 1.0, which integrates the capabilities of core data products for risk management. The feature store significantly bolstered the business risk management of Ant Group. However, it is difficult to expand the feature store across all algorithm-related services of Ant Group between 2019 and 2020. The core reason is that the feature store involves numerous syntaxes related to risk management businesses. Its computing paradigms, including computing directed acyclic graphs (DAGs), data precision, and operator types, are designed for risk management. From 2020, Ant Group started to rebuild a well-architected feature store.
Until now, the feature store has served many businesses of Ant Group, including search and recommendation, microcredit, international risk management, e-commerce banking, finance and insurance, and credit scoring and loyalty (Zhima Credit). The feature store contains more than 100,000 features and is able to handle 2 million queries per second (QPS) for online serving and about 1 million transactions per second (TPS) on a daily basis.
To meet the requirements of all involved businesses for features, the feature store must provide the following capabilities:
To meet the requirements for the above-mentioned capabilities, Ant Group proposed to build a next-generation feature engine architecture, that is, the universal feature engine (UFE)-based architecture. This architecture involves both offline and online data systems. The offline system is used to simulate and track a large number of features. The online system separates the storage of write operations and read operations. A Flink-based real-time data production system is used for write operations. This system can be used together with a large-scale simulation system to build the Skyline architecture. A self-managed SQL engine is used for read operations. This engine is used to perform efficient feature queries for model inference services. The SQL engine is mainly responsible for returning a batch of features to the model services as soon as possible.
For feature serving, a feature insight system is provided to monitor feature quality. The system can monitor the calls and usage durations of features in real time, and can also analyze the content distribution of features. It generates an alert if the content distribution of features drastically changes.
The unified metadata service at the bottom layer of the architecture abstracts all feature DevOps operations into interfaces. The DevOps operations include R&D, definition, creation, verification, and running of features. The feature management system provided by the feature store is implemented based on these interfaces. Enterprise users can also use these interfaces to build their platform products based on the core data capabilities of the feature store. Although the features running on the feature store are sourced from different configuration platforms, the feature store uses the same feature metadata system to ensure their metadata consistency. This ensures that technological optimizations for production or consumption globally take effect for the feature store. The feature metadata system provided by the feature store is out of the box and has been connected to multiple platforms of Ant Group. If you have sufficient resources and a bundle of personalized requirements, you can develop your product by leveraging data technologies provided by the feature store.
Maintaining high performance is the first challenge of real-time feature computing. In Ant Group, a computing task often needs to handle hundreds of thousands or even millions of TPS. It is challenging to ensure the smooth running of such a task with minimal latency and stable output.
Another challenge is that customers want to define data requirements on the feature store without taking the details of data implementation into account. However, the optimal implementation method for the same data requirement may vary based on scenarios. The best approaches to implement the same data requirements may change with scenarios, as each one has its own needs, like resource conditions, data accuracy, response time (RT), and query performance. It is challenging to use a real-time feature production system to quickly obtain the optimal computing methods in different scenarios.
Let's consider two scenarios for illustrative purposes.
The comparison of the two scenarios reveals that similar requirements for real-time features need different optimal implementation approaches in different scenarios. A single computing paradigm and a deployment mode cannot cater to all business needs. Therefore, the feature store must be able to provide scenario-specific optimal implementation approaches to suit the data requirements of users.
To address the preceding challenges, Ant Group proposed the feature computing architecture Skyline. Skyline receives definitions of real-time features from various platform products by using the metadata service. The definition process is a directed acyclic graph (DAG) for computing requirements. The DAG is instantiated as the optimal computing paradigm by the scenario-specific adaptor layer. For example, if a user wants to calculate the number of logons within seven days, Skyline determines whether to generate a key-value pair or calculate daily bills and stores the bills for temporary aggregation during a feature query at the adaptor layer. Then, the computing optimization module shared by streaming and batch tasks applies the instantiated DAG into tasks, performs logic optimization such as filter push-up and column pruning for the tasks, and then normalizes the tasks. The results are logical execution plans that describe data processing requirements and can be shared by streaming and batch tasks. The logical execution plans apply independent special optimization in batch and stream processing scenarios and are converted into physical jobs for deployment.
Skyline involves three key stages: computing inference, computing normalization, and computing deployment.
The scenario-based rule plug-in instantiates the DAG into different computing tasks based on AGG operators and the length of the time window. For example, the HOP function is used to aggregate data of a window whose length is less than one day, and the TUMBLE function is used to calculate the daily bills of a window whose length is greater than one day and aggregates the bills. Secondary aggregation is performed on the daily bills of multiple days during feature serving.
Skyline performs filter push-up, column pruning, and normalization (node order adjustment and link compression) on computing tasks to form a logical execution plan that consists of core skeleton nodes. Then, Skyline deploys the computing tasks. If absolute task isolation is required to prevent mutual impact between computing tasks in the scenario, the normalized logical execution plan will be converted into a Flink SQL task. To maximize cluster resource utilization in the case of computing resource insufficiency, Skyline searches for physical tasks that have the same skeleton structure as the logical execution plan in all computing metadata of the current cluster. If such a physical task exists, the task is merged into existing physical tasks. Otherwise, Skyline creates a physical task. Physical tasks are written by using the stream API and can automatically load the computing policy without the need of restart.
In Flink, the most direct way for optimization is to reduce the state size. A small state size indicates higher task stability. This way, real-time computing tasks with huge workloads can stably output data with a low latency. A large number of homogeneous sliding window features exist in business scenarios of Ant Group. A sliding window feature is the aggregation value of a specific behavior from the current time to a previous period of time. Homogeneous means that the data computational logic of tasks is the same, but the window lengths are different. If the native HOP function of Flink is used for a sliding window, computing resources are infinitely expanded. In addition, I/O explosion may occur when the result data is exported to the external storage system. In this case, the sliding window is restructured to a fixed window for the state. When data arrives at the window, the data is merged with data in a fixed pane. The length of the pane is the length of the sliding window. When data is flushed out of the window, secondary aggregation is performed on the data in panes. This significantly reduces the state size of the computing task that uses the sliding window. Homogeneous computing can also be performed based on the same state. The original data flush mechanism of sliding windows is also changed. If two consecutive sliding windows have the same data, the data of the latter window is not flushed out because only the data of the latest window is checked during feature serving. In addition, if the data of the latest window does not change, the data of the latter window does not need to be flushed out.
Cold start of features leverages unified batch and stream processing of Flink. The production logic of real-time features is converted into an equivalent Flink batch SQL task. Before a streaming task is submitted, the Flink batch SQL task is submitted to supplement historical data. Then, the streaming task is reset to 00:00 to merge the data of both batch and streaming tasks.
Feature serving supports feature queries for online model inference. In actual scenarios, upper-layer services have strict requirements on feature query performance. A query request contains hundreds of features whose data may be scattered in different storage systems due to the complexity of the data link. The average RT must be less than 10 ms, and the response time for 99.99% of requests must be less than 100 ms. The UFE-serving engine is tailored to achieve low RT and low long-tail latency in the case of a large number of requests and highly concurrent accesses.
The UFE-serving engine involves the following layers:
The following section describes the I/O optimization process in a batch feature query.
The UFE-serving engine hierarchically abstracts data. The following sample code shows an example of feature-related SQL statements:
select sum(amount) as total amount_24H
from trade_table
where gmt_occur between now()-24H and now();
In the SQL statements, the trade_table parameter specifies a view. A storage system produces different views, and a view produces different features. The UFE-serving engine builds a global optimal I/O plan for all feature-related SQL statements in a batch feature query. During this process, the UFE-serving engine traverses all feature-related SQL statements to collect information about the columns and windows of the views, and implements the I/O classification and merging algorithm on the I/O data. This algorithm classifies I/O data based on the view storage type. The I/O data of the same view storage type that belongs to different columns of the same row and that belongs to different rows in the same table is merged into one I/O operation. The scan range of a single I/O operation is also reduced based on the information such as valid columns and window ranges collected by the SQL statements. This helps reduce the number of interactions between the query engine and storage system during a single feature serving process, and reduce the scope of a data scan. After concurrent queries are performed on different storage systems, the engine splits the query results into different features.
Through I/O merging optimization and the built-in technologies of the UFE-serving engine, such as automatic hotspot discovery and high-concurrency optimization, the long-tail latency rate of the feature query is kept below one ten-thousandth. Furthermore, the average RT is notably low.
The value of either an online or offline real-time feature changes over the timeline. Feature simulation is used to calculate the instantaneous value of a feature at all historical time points based on a historical driving table (historical feature query traffic) and a historical information table (historical transaction events). In risk management and consumer credit scenarios, time travel computing is necessary because the impact of new features on online transactions must be fully evaluated for strategy adjustments or iterations on new models. In most cases, sample data required in these scenarios spans over half a year.
If a user writes SQL statements in their own data warehouse and the amount of data is small, the point-in-time (PIT) value can be calculated. However, if the historical driving table contains tens of billions of data records, a large amount of data is shuffled and joined. This results in serious data bloat. In this case, no native computing method of a compute engine can complete data computing within a short period of time.
The core challenge in feature simulation is to ensure the performance and stability of large-scale data computing based on PIT semantics.
The preceding flowchart shows the core process of feature simulation. The engine performs data pre-pruning based on the driving table, feature logic, and event table to remove the events that are unnecessarily used in the event table. After data pre-pruning, the engine splits detailed data into hourly and daily bills and adds time partitions to the detailed data for subsequent data pruning. At the same time, the engine splits the driving table by time partition to build multiple simulation computing tasks that can run in parallel. Then, the engine performs secondary aggregation on the driving table and obtained intermediate bills to calculate the final feature result. During secondary aggregation, the engine calculates the start time and end time of the window for the feature based on the data in the driving table. Then, the engine merges the daily bills and the hourly bills at both ends and finally merges the details at both ends of the hourly bills based on the calculated window information. This is because the output of simulation calculation is accurate to milliseconds, which is consistent with the online data output. This join method has better performance than the native join statement that is written by users because the details include time partitions during data processing. The detailed data includes time partitions during the preceding data processing. When the engine reads the detailed data, it prunes a large amount of data based on the hourly and daily partitions to which the data belongs. After data splitting optimization and secondary aggregation are complete, the feature store can perform large-scale PIT computing. For example, the feature store can produce features within 24 hours for 10 billion data records that are generated in a 90-day window.
Data Lake for Stream Computing: The Evolution of Apache Paimon
Understanding Batch Processing vs Stream Processing: Key Differences and Applications
150 posts | 43 followers
FollowApache Flink Community China - September 15, 2022
Apache Flink Community - June 11, 2024
Alibaba Clouder - December 2, 2020
Apache Flink Community China - January 11, 2021
Hologres - May 31, 2022
Apache Flink Community China - September 27, 2020
150 posts | 43 followers
FollowRealtime Compute for Apache Flink offers a highly integrated platform for real-time data processing, which optimizes the computing of Apache Flink.
Learn MoreA fully-managed Apache Kafka service to help you quickly build data pipelines for your big data analytics.
Learn MoreAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreMore Posts by Apache Flink Community