This topic describes how to use FeatureStore to build and publish a recommendation system from scratch. To do so, you need to create a project in FeatureStore, register feature tables, and then publish a trained model online.
Background information
A recommendation system is a system that can recommend personalized content or products to users based on user interests and preferences. The feature extraction and configuration of users or items matter to the performance of a recommendation system. This topic provides a solution to help you use FeatureStore to build a recommendation system and understand how FeatureStore interacts with recommendation systems by using FeatureStore SDKs of different versions. The solution consists of the following steps: create a project in FeatureStore, register feature tables, create a model feature, export a training dataset, synchronize features from an offline data store to an online data store, train a model by using the training dataset, deploy a service by using Elastic Algorithm Service (EAS), and then configure PAI-REC.
You can also directly run Python code in Notebook to complete the configuration. For more information, see DSW Gallery.
For more information about FeatureStore, see Overview.
If you have any questions when you use FeatureStore, join the DingTalk group (ID: 34415007523) for technical support.
Prerequisites
Before you perform the operations described in this topic, make sure that the following requirements are met:
Service | Description |
Platform for AI (PAI) | PAI is activated and a PAI workspace is created. For more information, see Activate PAI and create the default workspace. |
MaxCompute |
|
Hologres |
|
DataWorks |
|
Object Storage Service (OSS) | OSS is activated. For more information, see Get started by using the OSS console. |
Step 1: Prepare data
Synchronize data tables
In most recommendation scenarios, you need to prepare the following tables: user feature table, item feature table, and label table.
In this example, three simulated tables, including a user table, an item table, and a label table, in the MaxCompute project pai_online_project are used. Each partition of the user table and the item table contains approximately 100,000 data records, and occupies about 70 MB of storage capacity in the MaxCompute project. Each partition of the label table contains approximately 450,000 data records, and occupies about 5 MB of storage capacity in the MaxCompute project.
You need to execute SQL statements in DataWorks to synchronize the user table, item table, and label table from the pai_online_project project to your own MaxCompute. To synchronize data from the simulated tables, perform the following steps:
Log on to the DataWorks console.
In the left-side navigation pane, choose Data Development and Governance > DataStudio.
On the DataStudio page, select the DataWorks workspace that you created and click Go to DataStudio.
Move the pointer over Create and choose Create Node > MaxCompute > ODPS SQL. In the Create Node dialog box, configure the node parameters that are described in the following table.
Parameter
Description
Engine Instance
Select the MaxCompute engine that you created.
Node type
Select ODPS SQL from the Node Type drop-down list.
Path
Choose Business Flow > Workflow > MaxCompute.
Name
Specify a name.
Click Confirm.
On the tab of the node that you created, run the following SQL statements to synchronize the user table, item table, and label table from the pai_online_project project to your MaxCompute project. Select the exclusive resource group that you created as the resource group.
Synchronize the user table: rec_sln_demo_user_table_preprocess_all_feature_v1. (Click to view details)
Synchronize the item table: rec_sln_demo_item_table_preprocess_all_feature_v1. (Click to view details)
Synchronize the label table: rec_sln_demo_label_table. (Click to view details)
Backfill data on the tables to which the data is synchronized.
Log on to the DataWorks console. In the left-side navigation pane, choose Data Development and Governance > Operation Center. On the Operation Center page, select a workspace from the drop-down list and click Go to Operation Center.
In the left-side navigation pane, choose Auto Triggered Node O&M > Auto Triggered Nodes. The Auto Triggered Nodes page appears.
On the Auto Triggered Nodes page, find the node that you want to manage and click DAG in the Actions column.
Right-click the desired node and choose Run > Backfill Data for Current Node.
In the Backfill Data dialogue box, set Data Timestamp to 2023-10-22 to 2023-10-24 and click OK.
After you perform the preceding steps, you can view the user table rec_sln_demo_user_table_preprocess_all_feature_v1, item table rec_sln_demo_item_table_preprocess_all_feature_v1, and label table rec_sln_demo_label_table in your workspace. These three tables are used as examples to describe the operations.
Configure data stores
In most cases, you need to configure an offline data store, such as a MaxCompute project, and an online data store, such as a Hologres instance, a GraphCompute instance, or a Tablestore instance, in FeatureStore. In this example, a MaxCompute project is configured as an offline data store and a Hologres instance is configured as an online data store.
Log on to the PAI console. In the left-side navigation pane, choose Data Preparation > FeatureStore.
On the FeatureStore page, select a workspace from the drop-down list and click Enter FeatureStore.
Configure a MaxCompute data store.
On the Store tab, click Create Store. In the Create Store panel, configure the parameters that are described in the following table for the MaxCompute data store.
Parameter
Description
Type
Select MaxCompute from the Type drop-down list.
Name
Specify a name.
MaxCompute Project Name
Select the MaxCompute project that you created.
Copy the authorization statement and click Go to to synchronize data to the Hologres instance. After you execute the copied statement in DataWorks, the Hologres instance is authorized to synchronize data from the MaxCompute project.
NoteTo grant permissions to the Hologres instance, make sure that your account has the admin permissions. For more information, see Manage user permissions by using commands or Manage user permissions in the MaxCompute console.
Click Submit.
Configure a Hologres data source.
On the Store tab, click Create Store. In the Create Store panel, configure the parameters that are described in the following table for the Hologres data store.
Parameter
Description
Type
Select Hologres from the Type drop-down list.
Name
Specify a name.
Instance ID
Select the Hologres instance that you created.
Database Name
Select the database that you created in the Hologres instance.
Click Submit.
Grant the permissions to access the Hologres instance. For more information, see Configure data sources.
Install FeatureStore SDK for Python
Log on to the DataWorks console.
In the left-side navigation pane, click Resource Groups.
On the Exclusive Resource Groups tab, find the resource group that you want to manage. Move the pointer over the icon in the Actions column and select O&M Assistant.
Click Create Command. In the Create Command panel, configure the parameters that are described in the following table.
Parameter
Description
Command Name
Specify a name. In this example, install is used.
Command Type
Select Manual Installation(You cannot run pip commands to install the third-party packages.).
Command Content
Enter the following command in the Command Content field:
/home/tops/bin/pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple https://feature-store-py.oss-cn-beijing.aliyuncs.com/package/feature_store_py-1.3.1-py3-none-any.whl
Timeout
Specify a timeout period.
Click Create.
Click Run Command. In the message that appears, click Run.
Click Refresh to view the latest status of the command. If the state of the command changes to Successful, FeatureStore SDK is installed.
2. Create a project and register feature tables in FeatureStore
You can create a project and register feature tables in FeatureStore in the PAI console or by using FeatureStore SDK based on your business requirements. You must use FeatureStore SDK to export a training dataset and synchronize data. Therefore, you still need to install FeatureStore SDK for Python after you create a project and register feature tables in the PAI console.
Method 1: Use the PAI console
Create a FeatureStore project.
Log on to the PAI console. In the left-side navigation pane, choose Data Preparation > FeatureStore.
On the FeatureStore page, select a workspace from the drop-down list and click Enter FeatureStore.
Click Create Project. On the Create Project page, configure the project parameters that are described in the following table.
Parameter
Description
Name
Specify a name. In this example, fs_demo is used.
Description
Enter a custom description.
Offline Store (Offline Store)
Select the MaxCompute data store that you configured.
Online Store (Online Store)
Select the Hologres data store that you configured.
Click Submit.
Create feature entities.
On the FeatureStore page, find the created project and click the project name to go to the Project Details page.
On the Feature Entity tab, click Create Feature Entity. In the Create Feature Entity panel, configure the parameters that are described in the following table for the user feature entity.
Parameter
Description
Feature Entity Name
Specify a name. In this example, user is used.
Join Id
Set this parameter to user_id.
Click Submit.
Click Create Feature Entity. In the Create Feature Entity panel, configure the parameters that are described in the following table for the item feature entity.
Parameter
Description
Feature Entity Name
Specify a name. In this example, item is used.
Join Id
Set this parameter to item_id.
Click Submit.
Create feature views.
On the Feature View tab of the Project Details page, click Create Feature View. In the Create Feature View panel, configure the parameters that are described in the following table for the user feature view.
Parameter
Description
View Name
Specify a name. In this example, user_table_preprocess_all_feature_v1 is used.
Type
Select Offline.
Write Mode
Select Use Offline Table.
Store
Select the MaxCompute data store that you configured.
Feature Table
Select the prepared user table rec_sln_demo_user_table_preprocess_all_feature_v1.
Feature Field
Select the user_id primary key field.
Synchronize Online Feature Table
Select Yes.
Feature Entity
Select user.
Feature Lifecycle
Use the default value.
Click Submit.
Click Create Feature View. In the Create Feature View panel, configure the parameters that are described in the following table for the item feature view.
Parameter
Description
View Name
Specify a name. In this example, item_table_preprocess_all_feature_v1 is used.
Type
Select Offline.
Write Mode
Select Use Offline Table.
Store
Select the MaxCompute data store that you configured.
Feature Table
Select the prepared item table rec_sln_demo_item_table_preprocess_all_feature_v1.
Feature Field
Select the item_id primary key field.
Synchronize Online Feature Table
Select Yes.
Feature Entity
Select item.
Feature Lifecycle
Use the default value.
Click Submit.
Create a label table.
On the Label Table tab of the Project Details page, click Create Label Table. In the Create Label Table panel, configure the parameters that are described in the following table for the label table.
Parameter
Description
Store
Select the MaxCompute data store that you configured.
Table Name
Select the prepared label table rec_sln_demo_label_table.
Click Submit.
Create a model feature.
On the Model Features tab of the Project Details page, click Create Model Feature. In the Create Model Feature panel, configure the parameters that are described in the following table for the model feature.
Parameter
Description
Model Feature Name
Specify a name. In this example, fs_rank_v1 is used.
Select Feature
Select the user feature view and item feature view that you created.
Label Table Name
Select the label table rec_sln_demo_label_table that you created.
Click Submit.
On the Model Features tab, find the model feature that you created and click the name of the model feature.
On the Basic Information tab of the Model Feature Details panel, view the value of the Export Table Name parameter. In this example, the value of the Export Table Name parameter is fs_demo_fs_rank_v1_trainning_set. You can use this table to generate features and train a model.
Install FeatureStore SDK for Python. For more information, see the Method 2: Use FeatureStore SDK for Python section of this topic.
Method 2: Use FeatureStore SDK for Python
For more information about how to use FeatureStore SDK, see Feature Store.
Step 3: Export a training dataset and train a model
Export a training dataset.
Log on to the DataWorks console.
In the left-side navigation pane, choose Data Development and Governance > DataStudio.
On the DataStudio page, select the DataWorks workspace that you created and click Go to DataStudio.
Move the pointer over Create and choose Create Node > MaxCompute > PyODPS 3. In the Create Node dialog box, configure the node parameters that are described in the following table.
Parameter
Description
Engine Instance
Select the MaxCompute engine that you created.
Node Type
Set this parameter to PyODPS 3.
Path
Choose Business Flow > Workflow > MaxCompute.
Name
Specify a name.
Click Confirm.
Copy the following code to the code editor:
from feature_store_py.fs_client import FeatureStoreClient from feature_store_py.fs_project import FeatureStoreProject from feature_store_py.fs_datasource import LabelInput, MaxComputeDataSource, TrainingSetOutput from feature_store_py.fs_features import FeatureSelector from feature_store_py.fs_config import LabelInputConfig, PartitionConfig, FeatureViewConfig from feature_store_py.fs_config import TrainSetOutputConfig, EASDeployConfig import datetime import sys cur_day = args['dt'] print('cur_day = ', cur_day) offset = datetime.timedelta(days=-1) pre_day = (datetime.datetime.strptime(cur_day, "%Y%m%d") + offset).strftime('%Y%m%d') print('pre_day = ', pre_day) access_key_id = o.account.access_id access_key_secret = o.account.secret_access_key fs = FeatureStoreClient(access_key_id=access_key_id, access_key_secret=access_key_secret, region='cn-beijing') cur_project_name = 'fs_demo' project = fs.get_project(cur_project_name) label_partitions = PartitionConfig(name = 'ds', value = cur_day) label_input_config = LabelInputConfig(partition_config=label_partitions) user_partitions = PartitionConfig(name = 'ds', value = pre_day) feature_view_user_config = FeatureViewConfig(name = 'user_table_preprocess_all_feature_v1', partition_config=user_partitions) item_partitions = PartitionConfig(name = 'ds', value = pre_day) feature_view_item_config = FeatureViewConfig(name = 'item_table_preprocess_all_feature_v1', partition_config=item_partitions) feature_view_config_list = [feature_view_user_config, feature_view_item_config] train_set_partitions = PartitionConfig(name = 'ds', value = cur_day) train_set_output_config = TrainSetOutputConfig(partition_config=train_set_partitions) model_name = 'fs_rank_v1' cur_model = project.get_model(model_name) task = cur_model.export_train_set(label_input_config, feature_view_config_list, train_set_output_config) task.wait() print("task_summary = ", task.task_summary)
Click Properties on the right side of the tab. In the Properties panel, configure the scheduling parameters that are described in the following table.
Parameter
Description
Scheduling Parameter
Parameter Name
Set this parameter to dt.
Parameter Value
Set this parameter to $[yyyymmdd-1].
Resource Group
Resource Group
Select the exclusive resource group that you created.
Dependencies
Select the user and item tables that you created.
After the node is configured and tested, save and submit the node configurations.
Backfill data for the node. For more information, see the Synchronize data from simulated tables section of this topic.
Optional. View the export job.
On the FeatureStore page, find the created project and click the project name to go to the Project Details page.
On the Project Details page, click Jobs.
On the Jobs tab, find the job that you want to manage and click the name of the job. In the panel that appears, view the basic information, configurations, and logs of the job.
Train a model.
EasyRec is an open source recommendation system framework that can be seamlessly connected to FeatureStore to train, export, and publish models. We recommend that you use EasyRec to train a model by using the fs_demo_fs_rank_v1_trainning_set table as the training dataset.
For more information about the open source code of EasyRec, see EasyRec.
For more information about EasyRec, see What is EasyRec?
For more information about how to use EasyRec to train models, see train_config.
If you have other questions about EasyRec, join the DingTalk group (ID: 32260796) for technical support.
Step 4: Publish the model
After you train and export the model, you can deploy and publish the model. If you use a self-managed recommendation system, you can use FeatureStore SDK for Python, FeatureStore SDK for Go, FeatureStore SDK for C++, or FeatureStore SDK for Java provided by FeatureStore to connect your recommendation system to FeatureStore. You can also join the DingTalk group (ID 32260796) for technical support on how to connect your recommendation system to FeatureStore. FeatureStore is seamlessly integrated with other Alibaba Cloud services. You can use Alibaba Cloud services to quickly build and publish a recommendation system.
In this example, Alibaba Cloud services are used to publish a model.
Step 1: Configure routine data synchronization nodes
Before you publish a model, you must configure routine data synchronization nodes to synchronize data from the offline data store to the online data store on a regular basis. Then, data can be read from the online data store in real time. In this example, data in the user and item tables needs to be synchronized on a regular basis. To configure routine data synchronization nodes, perform the following steps:
Log on to the DataWorks console.
In the left-side navigation pane, choose Data Development and Governance > DataStudio.
On the DataStudio page, select the DataWorks workspace that you created and click Go to DataStudio.
Synchronize data from the user table on a regular basis.
Move the pointer over Create and choose Create Node > MaxCompute > PyODPS 3.
Copy the following code to the code editor. The code is used to synchronize data from the user_table_preprocess_all_feature_v1 feature view on a regular basis.
Synchronize data from the user_table_preprocess_all_feature_v1 feature view (Click to view details)
Click Properties on the right side of the tab. In the Properties panel, configure the scheduling parameters that are described in the following table.
Parameter
Description
Scheduling Parameter
Parameter Name
Set this parameter to dt.
Parameter Value
Set this parameter to $[yyyymmdd-1].
Resource Group
Resource Group
Select the exclusive resource group that you created.
Dependencies
Select the user table that you created.
After the node is configured and tested, save and submit the node configurations.
Backfill data for the node. For more information, see the Synchronize data from simulated tables section of this topic.
Synchronize data from the item table on a regular basis.
Move the pointer over Create and choose Create Node > MaxCompute > PyODPS 3. In the Create Node dialog box, configure the node parameters.
Click Confirm.
Copy the following code to the code editor:
Synchronize data from the item_table_preprocess_all_feature_v1 feature view (Click to view details)
Click Properties on the right side of the tab. In the Properties panel, configure the scheduling parameters that are described in the following table.
Parameter
Description
Scheduling Parameter
Parameter Name
Set this parameter to dt.
Parameter Value
Set this parameter to $[yyyymmdd-1].
Resource Group
Resource Group
Select the exclusive resource group that you created.
Dependencies
Select the item table that you created.
After the node is configured and tested, save and submit the node configurations.
Backfill data for the node. For more information, see the Synchronize data from simulated tables section of this topic.
After the data is synchronized, view the latest features that are synchronized in the Hologres data store.
Step 2: Create and deploy a service by using EAS
The service is used to receive requests from the recommendation engine, score the items based on the requests, and then return scores. The EasyRec processor is integrated with FeatureStore SDK for C++, which implements feature extraction with low latency and high performance. After the EasyRec processor extracts features by using FeatureStore SDK for C++, the EasyRec processor sends the extracted features to the model for inference and returns the scores to the recommendation engine.
To deploy the service, perform the following steps:
Log on to the DataWorks console.
In the left-side navigation pane, choose Data Development and Governance > DataStudio.
On the DataStudio page, select the DataWorks workspace that you created and click Go to DataStudio.
Move the pointer over Create and choose Create Node > MaxCompute > PyODPS 3.
Copy the following code to the code editor:
import os import json config = { "name": "fs_demo_v1", "metadata": { "cpu": 4, "rpc.max_queue_size": 256, "rpc.enable_jemalloc": 1, "gateway": "default", "memory": 16000 }, "model_path": f"oss://beijing0009/EasyRec/deploy/rec_sln_demo_dbmtl_v1/{args['ymd']}/export/final_with_fg", # The path of the trained model. You can specify a custom path. "model_config": { "access_key_id": f'{o.account.access_id}', "access_key_secret": f'{o.account.secret_access_key}', "region": "cn-beijing", # Replace the value with the ID of the region in which PAI resides. In this example, cn-beijing is used. "fs_project": "fs_demo", # Replace the value with the name of your project in FeatureStore. In this example, fs_demo is used. "fs_model": "fs_rank_v1", # Replace the value with the name of your model feature in FeatureStore. In this example, fs_rank_v1 is used. "fs_entity": "item", "load_feature_from_offlinestore": True, "steady_mode": True, "period": 2880, "outputs": "probs_is_click,y_ln_playtime,probs_is_praise", "fg_mode": "tf" }, "processor": "easyrec-1.9", "processor_type": "cpp" } with open("echo.json", "w") as output_file: json.dump(config, output_file) # Run the following code if you deploy the service for the first time: os.system(f"/home/admin/usertools/tools/eascmd -i {o.account.access_id} -k {o.account.secret_access_key} -e pai-eas.cn-beijing.aliyuncs.com create echo.json") # Run the following line for routine updates: # os.system(f"/home/admin/usertools/tools/eascmd -i {o.account.access_id} -k {o.account.secret_access_key} -e pai-eas.cn-beijing.aliyuncs.com modify fs_demo_v1 -s echo.json")
Click Properties on the right side of the tab. In the Properties panel, configure the scheduling parameters that are described in the following table.
Parameter
Description
Scheduling Parameter
Parameter Name
Set this parameter to dt.
Parameter Value
Set this parameter to $[yyyymmdd-1].
Resource Group
Resource Group
Select the exclusive resource group that you created.
Dependencies
Select the training job and the item_table_preprocess_all_feature_v1 feature view.
After the node is configured and tested, run the node to view the deployment status.
After the deployment is complete, comment out Line 34 in the code and uncomment Line 37 to run the job on a regular basis.
Optional. View the deployed service on the Inference Service tab of the Elastic Algorithm Service (EAS) page. For more information, see Deploy a model service in the PAI console.
Optional. Connect EAS to the virtual private cloud (VPC) in which the data store resides. Data stores, such as a Hologres data store, can be accessed only over the specified VPC. In this example, a Hologres data store is used. You can view the basic information about the Hologres data store such as the VPC ID and vSwitch ID in the Hologres console. In the upper-right corner of the Elastic Algorithm Service (EAS) page of the PAI console, click Configure Direct Connection. In the Configure Direct Connection dialog box, enter the VPC ID and vSwitch ID of the Hologres data store in the VPC and vSwitch fields, and configure the Security Group Name parameter. You can select an existing security group or create a new security group. The port that is enabled for the security group must meet the requirements for connection to the Hologres data store. In most cases, port 80 is used for connection to a Hologres data store. Therefore, you must select a security group for which port 80 is enabled. Click OK. The service is available after it is updated.
Step 3: Configure PAI-REC
PAI-REC is a recommendation engine service, which integrates FeatureStore SDK for Go and can be seamlessly integrated with FeatureStore and EAS.
To configure PAI-REC, perform the following steps:
Configure the FeatureStoreConfs parameter.
RegionId
: the ID of the region in which FeatureStore resides. In this example, cn-beijing is used.ProjectName
: the name of the project that you created in FeatureStore. In this example, fs_demo is used.
"FeatureStoreConfs": { "pairec-fs": { "RegionId": "cn-beijing", "AccessId": "${AccessKey}", "AccessKey": "${AccessSecret}", "ProjectName": "fs_demo" } },
Configure the FeatureConfs parameter.
FeatureStoreName
: Set this parameter to pairec-fs that is specified in the FeatureStoreConfs parameter.FeatureStoreModelName
: the name of the model feature that you created. In this example, fs_rank_v1 is used.FeatureStoreEntityName
: the name of the feature entity that you created. In this example, user is used. The parameter settings enable PAI-REC to extract features from the user feature entity in the fs_rank_v1 model by using FeatureStore SDK for Go.
"FeatureConfs": { "recreation_rec": { "AsynLoadFeature": true, "FeatureLoadConfs": [ { "FeatureDaoConf": { "AdapterType": "featurestore", "FeatureStoreName": "pairec-fs", "FeatureKey": "user:uid", "FeatureStoreModelName": "fs_rank_v1", "FeatureStoreEntityName": "user", "FeatureStore": "user" } } ] } },
Configure the AlgoConfs parameter.
The AlgoConfs parameter specifies the scoring service in EAS to which PAI-REC connects.
Name
: the name of the service that you deployed by using EAS.Url
andAuth
: the URL and token that are used to access the service that you deployed by using EAS. You can click the service name on the Elastic Algorithm Service (EAS) page, and then click View Endpoint Information on the Service Details tab to obtain the URL and token. For more information, see FAQ about EAS.
"AlgoConfs": [ { "Name": "fs_demo_v1", "Type": "EAS", "EasConf": { "Processor": "EasyRec", "Timeout": 300, "ResponseFuncName": "easyrecMutValResponseFunc", "Url": "eas_url_xxx", "EndpointType": "DIRECT", "Auth": "eas_token" } } ],