FeatureStore is a centralized data management and sharing platform provided by Platform for AI (PAI). You can use FeatureStore to organize, store, and manage feature data that you used in machine learning and AI trainings. FeatureStore allows you to easily share features with multiple people and teams, ensures the consistency of offline and online feature data, and helps you access online features in an effective manner.
What is FeatureStore?
FeatureStore is a feature management tool provided by PAI. You can use FeatureStore to store and manage machine learning features.
FeatureStore integrates with multiple Alibaba Cloud services, such as DataHub, Realtime Compute for Apache Flink, Hologres, and Graphcompute to provide comprehensive feature management features. The features include receiving data such as user behavior logs, item characteristics, and real-time updated characteristics from DataHub, synchronizing the data to MaxCompute, or saving the data processed by Flink to online storage services. Applications of recommendation engines, user growth or financial risk control can call the FeatureStore SDK to access the feature data in online storage services.
The following figure describes the overall process of using FeatureStore. In the figure, the input data from MaxCompute and DataHub is processed by feature computing and sample model management modules, and published to online storage services for various client applications.
Terms
Feature entity
A feature entity is a collection of semantically related features that provide information about an object. For example, you can create two entities named user and item for a recommendation system.
Feature view
A feature view contains information about a set of features and their derived features. A feature view is a subset of features that are described by a specific feature entity. A feature view presents the mapping between an offline feature table and an online feature table.
Join Id
A field that associates a feature view with a feature entity. Each feature entity has a join ID. You can use join IDs to associate features across multiple feature views.
NoteEach feature view has a primary key (index) that can be used to retrieve features. The primary key can be different from the name of the join ID.
For example, you can set user_id (the primary key of the user table) and item_id (the primary key of the item table) as join IDs for a recommendation system.
Label table
A label table stores the labels used for model training. It contains the target attributes of model training and the join IDs of the feature entities. For a recommendation system, the label table is generally generated by grouping the data in the behavior table based on user_id, item_id, or request_id.
Scenarios
Recommendation system and advertisement sorting: The feature data in this scenario includes the browsing history, purchase records, and persona of users. FeatureStore allows you to update and manage multiple versions of real-time user features and item features in a centralized manner. This improves the timeliness of model features and improves model performance. FeatureStore can be used to implement targeted advertising and can help improve the accuracy and effectiveness of advertising.
User growth and risk control: The feature data in this scenario includes the personal information, transaction behavior, and credit history of users. FeatureStore allows you to effectively manage and process multiple versions of user features, perform risk assessment by using machine learning or deep learning models, such as XGBoost or GBDT models. This improves the accuracy and efficiency of business risk control.
Search engine ranking: The feature data in this scenario includes keyword matching, click-through rate, and sales volume. You can use FeatureStore to train a ranking model to recall results from search engines, such as Elasticsearch and OpenSearch. The recalled results are used to request the scoring service of TensorFlow models in EAS. This helps you provide more accurate and personalized search results to your users based on user search intentions and preferences.
Offline key-value data synchronization: The feature data in this scenario includes feature data includes commodity attribute tables and user attributes. FeatureStore simplifies the synchronization and scheduling of offline data to online storage services.
Features
Data stores
FeatureStore encapsulates the entire feature-to-model process and supports a variety of offline and real-time data stores. This allows you to register and manage feature tables in FeatureStore in a convenient manner. FeatureStore supports the following data stores:
Offline data store: MaxCompute
Online data stores: Hologres, GraphCompute, and Tablestore
After you register a feature table in FeatureStore:
FeatureStore can automatically create online and offline tables and ensure consistency between online and offline tables.
Feature store saves only one copy of the feature table and allows you to share features to multiple people to reduce resource costs.
You can complete complex operations by using simple code in FeatureStore to increase efficiency. The operations include exporting training tables and importing data to online stores.
Integration with recommendation engine
FeatureStore is deeply integrated with EasyRec to support easy and efficient feature engineering (FG), model training and model deployment. You can use FeatureStore to quickly build a cutting-edge recommendation system. EasyRec processor (scoring service) can cache item feature tables to the memory to ensure efficient model scoring capabilities.
Offline and online feature management
Offline features include the attribute features and statistical features of users and items. Real-time features include updates of users or items that are written by Flink to online stores such as Hologres, and features based on time windows. For example, updates include newly published users and items, and time window-based features include clicks, forwarding, purchase quantity, and conversion rate in one hour. Online stores include Hologres, GraphCompute, and Tablestore.
Old and new feature association
When you develop a new set of user or item features through algorithm development or business intelligence (BI), you can associate the old and new features required by the training set and use the FeatureStore SDK to export samples for offline training. You can also use the FeatureStore SDK to publish the samples to online stores for online services. A single feature view can be referenced by multiple models to save online storage. FeatureStore helps you optimize models by adding features in an efficient manner.
Real-time statistical feature and user sequence feature management
In most cases, the complexity and the demand for timeliness of model features are gradually increasing. Therefore, it is necessary to manage the real-time statistical features and user behavior sequence features obtained by Flink. FeatureStore defines offline user sequence features, such as the sequence of item IDs that the user clicked. Aside from ID sequences of an item, SideInfo of an item is also frequently used in models, which consumes a large amount of online data to transmit. To handle this issue, you can use FeatureStore SDK to cache item features in EasyRec processor (scoring service) to accelerate the inference response and improve the inference performance.
Multi-language SDKs
FeatureStore provides SDKs for GO, Java, and Python, which can help you use FeatureStore in the joint solution of PAI-REC and EasyRec Processor. You can use the SDK for Java to call the EasyRec Processor or scoring engines of other models in your client, such as the search, recommendation, or risk control engines. In addition, you can use the SDK for Python to access data in online stores and perform data analysis and modeling.
Automated FG
FeatureStore provides automated FG that leverages machine learning capabilities to automatically develop new features and reduces the workload on FG of the research and development team.
Feature monitoring
FeatureStore provides feature monitoring and alerting that helps identify and resolve feature exceptions and issues in a timely manner. This also helps improve the efficiency of troubleshooting.
How FeatureStore works
FeatureStore provides data stores. FeatureStore integrates with offline storage and online storage services to allow you to read, write, and manage offline and online feature data in a centralized manner.
You can register offline and online feature tables in the Feature View of FeatureStore. This way, you can use the feature view to summarize and map offline and online feature data.
You can store label tables in MaxCompute and register the label tables in FeatureStore so that FeatureStore can read and process the label table data that is mapped to FeatureStore.
FeatureStore allows you to use the join IDs of feature entities to associate feature views in multiple feature projects. This way, you can associate all features of a feature entity with each other, use label tables to generate a model feature sample table, and store the sample table in the offline store MaxCompute.
Regions
FeatureStore is available in the following regions: China (Beijing), China (Shanghai), China (Hangzhou), China (Shenzhen), and China (Hong Kong).
Procedure
Creates data stores.
You need to create offline and online data stores. For more information, see Configure data stores.
Create a project and configure feature entities, feature views, and label tables, and generate a sample table for training. For more information, see Configure a FeatureStore project.
View the job details.
You can view the job status and details in the Jobs tab. For more information, see Job management.
Synchronize data to online stores. For more information, go to DSW Galley.
If you need to read and use online data of FeatureStore in a Java or Go online engine, join the DingTalk group 34415007523 for technical assistance.
Contact us
If you encounter problems when you use FeatureStore, you can join the DingTalk group (Group ID: 34415007523) for technical assistance.