This topic describes how to use FeatureStore SDK to manage features in a recommendation system without the need to use other Alibaba Cloud services.
Background information
A recommendation system can recommend personalized content or products to users based on user interests and preferences. The feature extraction and configuration of users or items matter to the performance of a recommendation system. This topic provides a solution to help you use FeatureStore to build a recommendation system and understand how FeatureStore manages feature data by using FeatureStore SDKs of different versions in a recommendation system.
For more information about FeatureStore, see Overview.
If you have any questions when you use FeatureStore, join the DingTalk group (ID 34415007523) for technical support.
Prerequisites
Before you perform the operations described in this topic, make sure that the requirements that are described in the following table are met.
Service | Description |
Platform for AI (PAI) |
|
MaxCompute |
|
Hologres |
|
DataWorks |
|
1. Prepare data
Synchronize data from simulated tables
In most recommendation scenarios, you need to prepare the following tables: user feature table, item feature table, and label table.
In this example, three simulated tables, including a user table, an item table, and a label table, in the MaxCompute project pai_online_project are used. Each partition of the user table and the item table contains approximately 100,000 data records, and occupies about 70 MB of storage capacity in the MaxCompute project. Each partition of the label table contains approximately 450,000 data records, and occupies about 5 MB of storage capacity in the MaxCompute project.
You need to execute SQL statements in DataWorks to synchronize data in the user table, item table, and label table from the pai_online_project project to your MaxCompute project. To synchronize data from the simulated tables, perform the following steps:
Log on to the DataWorks console.
In the left-side navigation pane, choose Data Development and Governance > DataStudio.
On the DataStudio page, select the DataWorks workspace that you created and click Go to DataStudio.
Move the pointer over Create and choose Create Node > MaxCompute > ODPS SQL. In the Create Node dialog box, configure the node parameters that are described in the following table.
Parameter
Description
Engine Instance
Select the MaxCompute compute engine instance that you created.
Node Type
Select ODPS SQL from the Node Type drop-down list.
Path
Choose Business Flow > Workflow > MaxCompute.
Name
Specify a custom name.
Click Confirm.
On the tab of the node that you created, execute the following SQL statements to synchronize data in the user table, item table, and label table from the pai_online_project project to your MaxCompute project. Select the exclusive resource group that you created as the resource group.
Synchronize data from the user table rec_sln_demo_user_table_preprocess_all_feature_v1 (Click to view details)
Synchronize data from the item table rec_sln_demo_item_table_preprocess_all_feature_v1 (Click to view details)
Synchronize data from the label table rec_sln_demo_label_table (Click to view details)
After you perform the preceding steps, you can view the user table rec_sln_demo_user_table_preprocess_all_feature_v1, item table rec_sln_demo_item_table_preprocess_all_feature_v1, and label table rec_sln_demo_label_table in your workspace. These tables are used in the subsequent operations.
Configure data stores
In most cases, you need to configure an offline data store, such as a MaxCompute project, and an online data store, such as a Hologres instance, a GraphCompute instance, or a Tablestore instance, in FeatureStore. In this example, a MaxCompute project is configured as an offline data store and a Hologres instance is configured as an online data store.
Log on to the PAI console. In the left-side navigation pane, choose Data Preparation > FeatureStore.
On the FeatureStore page, select a workspace from the drop-down list and click Enter FeatureStore.
Configure a MaxCompute data store.
On the Store tab, click Create Store. In the Create Store panel, configure the parameters that are described in the following table for the MaxCompute data store.
Parameter
Description
Type
Select MaxCompute from the Type drop-down list.
Name
Specify a custom name.
MaxCompute Project Name
Select the MaxCompute project that you created.
Copy the authorization statement and click Go to to synchronize data to the Hologres instance. After you execute the copied statement in DataWorks, the Hologres instance is authorized to synchronize data from the MaxCompute project.
NoteTo grant permissions to the Hologres instance, make sure that your account has the admin permissions. For more information, see Manage user permissions by using commands or Manage user permissions in the MaxCompute console.
Click Submit.
Configure a Hologres data store.
On the Store tab, click Create Store. In the Create Store panel, configure the parameters that are described in the following table for the Hologres data store.
Parameter
Description
Type
Select Hologres from the Type drop-down list.
Name
Specify a custom name.
Instance ID
Select the Hologres instance that you created.
Database Name
Select the database that you created in the Hologres instance.
Click Submit.
Grant the permissions to access the Hologres instance. For more information, see Configure data sources.
2. Create a project and register feature tables in FeatureStore
You can create a project and register feature tables in FeatureStore in the PAI console or by using FeatureStore SDK based on your business requirements. You must use FeatureStore SDK to export a training dataset and synchronize data. Therefore, you still need to install FeatureStore SDK for Python after you create a project and register feature tables in the PAI console.
Method 1: Use the PAI console
Create a project in FeatureStore.
Log on to the PAI console. In the left-side navigation pane, choose Data Preparation > FeatureStore.
On the FeatureStore page, select a workspace from the drop-down list and click Enter FeatureStore.
Click Create Project. On the Create Project page, configure the project parameters that are described in the following table.
Parameter
Description
Name
Specify a custom name. In this example, fs_demo is used.
Description
Enter a custom description.
Offline Store (Offline Store)
Select the MaxCompute data store that you configured.
Online Store (Online Store)
Select the Hologres data store that you configured.
Click Submit.
Create feature entities.
On the FeatureStore page, find the created project and click the project name to go to the Project Details page.
On the Feature Entity tab, click Create Feature Entity. In the Create Feature Entity panel, configure the parameters that are described in the following table for the user feature entity.
Parameter
Description
Feature Entity Name
Specify a custom name. In this example, user is used.
Join Id
Set this parameter to user_id.
Click Submit.
Click Create Feature Entity. In the Create Feature Entity panel, configure the parameters that are described in the following table for the item feature entity.
Parameter
Description
Feature Entity Name
Specify a custom name. In this example, item is used.
Join Id
Set this parameter to item_id.
Click Submit.
Create feature views.
On the Feature View tab of the Project Details page, click Create Feature View. In the Create Feature View panel, configure the parameters that are described in the following table for the user feature view.
Parameter
Description
View Name
Specify a custom name. In this example, user_table_preprocess_all_feature_v1 is used.
Type
Select Offline.
Write Mode
Select Use Offline Table.
Store
Select the MaxCompute data store that you configured.
Feature Table
Select the prepared user table rec_sln_demo_user_table_preprocess_all_feature_v1.
Feature Field
Select the user_id primary key field.
Synchronize Online Feature Table
Select Yes.
Feature Entity
Select user.
Feature Lifecycle
Use the default value.
Click Submit.
Click Create Feature View. In the Create Feature View panel, configure the parameters that are described in the following table for the item feature view.
Parameter
Description
View Name
Specify a custom name. In this example, item_table_preprocess_all_feature_v1 is used.
Type
Select Offline.
Write Mode
Select Use Offline Table.
Store
Select the MaxCompute data store that you configured.
Feature Table
Select the prepared item table rec_sln_demo_item_table_preprocess_all_feature_v1.
Feature Field
Select the item_id primary key field.
Synchronize Online Feature Table
Select Yes.
Feature Entity
Select item.
Feature Lifecycle
Use the default value.
Click Submit.
Create a label table.
On the Label Table tab of the Project Details page, click Create Label Table. In the Create Label Table panel, configure the parameters that are described in the following table for the label table.
Parameter
Description
Store
Select the MaxCompute data store that you configured.
Table Name
Select the prepared label table rec_sln_demo_label_table.
Click Submit.
Create a model feature.
On the Model Features tab of the Project Details page, click Create Model Feature. In the Create Model Feature panel, configure the parameters that are described in the following table for the model feature.
Parameter
Description
Model Feature Name
Specify a custom name. In this example, fs_rank_v1 is used.
Select Feature
Select the user feature view and item feature view that you created.
Label Table Name
Select the label table rec_sln_demo_label_table that you created.
Click Submit.
On the Model Features tab, find the model feature that you created and click the name of the model feature.
On the Basic Information tab of the Model Feature Details panel, view the value of the Export Table Name parameter. In this example, the value of the Export Table Name parameter is fs_demo_fs_rank_v1_trainning_set. You can use this table to generate features and train a model.
Install FeatureStore SDK for Python. For more information, see the Use FeatureStore to manage features in a recommendation system section of this topic.
Method 2: Use FeatureStore SDK for Python
Log on to the DataWorks console.
In the left-side navigation pane, click Resource Groups.
On the Exclusive Resource Groups tab, find the resource group that you want to manage. Move the pointer over the icon in the Actions column and select O&M Assistant.
Click Create Command. In the Create Command panel, configure the command parameters that are described in the following table.
Parameter
Description
Command Name
Specify a custom name. In this example, install is used.
Command Type
Select Manual Installation(You cannot run pip commands to install the third-party packages.).
Command Content
/home/tops/bin/pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple https://feature-store-py.oss-cn-beijing.aliyuncs.com/package/feature_store_py-1.3.1-py3-none-any.whl
Timeout
Specify a timeout period.
Click Create.
Click Run Command. In the message that appears, click Run.
Click Refresh to view the latest status of the command. If the state of the command changes to Successful, FeatureStore SDK is installed.
For more information about how to use FeatureStore SDK, see DSW Gallery.
3. Configure routine data synchronization nodes
Before you publish a model, you must configure routine data synchronization nodes to synchronize data from the offline data store to the online data store on a regular basis. Then, data can be read from the online data store in real time. In this example, data in the user and item tables needs to be synchronized on a regular basis. To configure routine data synchronization nodes, perform the following steps:
Log on to the DataWorks console.
In the left-side navigation pane, choose Data Development and Governance > DataStudio.
On the DataStudio page, select the DataWorks workspace that you created and click Go to DataStudio.
Synchronize data from the user table on a regular basis.
Move the pointer over Create and choose Create Node > MaxCompute > PyODPS 3.
Copy the following code to the code editor. The code is used to synchronize data from the user_table_preprocess_all_feature_v1 feature view on a regular basis.
from feature_store_py.fs_client import FeatureStoreClient import datetime from feature_store_py.fs_datasource import MaxComputeDataSource import sys cur_day = args['dt'] print('cur_day = ', cur_day) access_key_id = o.account.access_id access_key_secret = o.account.secret_access_key fs = FeatureStoreClient(access_key_id=access_key_id, access_key_secret=access_key_secret, region='cn-beijing') cur_project_name = 'fs_demo' project = fs.get_project(cur_project_name) feature_view_name = 'user_table_preprocess_all_feature_v1' batch_feature_view = project.get_feature_view(feature_view_name) task = batch_feature_view.publish_table(partitions={'ds':cur_day}, mode='Overwrite') task.wait() task.print_summary()
Click Properties on the right side of the tab. In the Properties panel, configure the scheduling parameters that are described in the following table.
Parameter
Description
Scheduling Parameter
Parameter Name
Set this parameter to dt.
Parameter Value
Set this parameter to $[yyyymmdd-1].
Resource Group
Resource Group
Select the exclusive resource group that you created.
Dependencies
Select the user table that you created.
After the node is configured and tested, save and submit the node configurations.
Backfill data for the node. For more information, see the Synchronize data tables section of this topic.
Synchronize data from the item table on a regular basis.
Move the pointer over Create and choose Create Node > MaxCompute > PyODPS 3. In the Create Node dialog box, configure the node parameters.
Click Confirm.
Copy the following code to the code editor:
Synchronize data from the item_table_preprocess_all_feature_v1 feature view (Click to view details)
Click Properties on the right side of the tab. In the Properties panel, configure the scheduling parameters that are described in the following table.
Parameter
Description
Scheduling Parameter
Parameter Name
Set this parameter to dt.
Parameter Value
Set this parameter $[yyyymmdd-1].
Resource Group
Resource Group
Select the exclusive resource group that you created.
Dependencies
Select the item table that you created.
After the node is configured and tested, save and submit the node configurations.
Backfill data for the node. For more information, see the Synchronize data tables section of this topic.
After the data is synchronized, view the latest features that are synchronized in the Hologres data store.
4. Export a training dataset
Log on to the DataWorks console.
In the left-side navigation pane, choose Data Development and Governance > DataStudio.
On the DataStudio page, select the DataWorks workspace that you created and click Go to DataStudio.
Move the pointer over Create and choose Create Node > MaxCompute > PyODPS 3. In the Create Node dialog box, configure the node parameters that are described in the following table.
Parameter
Description
Engine Instance
Select the MaxCompute compute engine instance that you created.
Node Type
Set this parameter to PyODPS 3.
Path
Choose Business Flow > Workflow > MaxCompute.
Name
Specify a custom name.
Click Confirm.
Copy the following code to the code editor:
from feature_store_py.fs_client import FeatureStoreClient from feature_store_py.fs_project import FeatureStoreProject from feature_store_py.fs_datasource import LabelInput, MaxComputeDataSource, TrainingSetOutput from feature_store_py.fs_features import FeatureSelector from feature_store_py.fs_config import LabelInputConfig, PartitionConfig, FeatureViewConfig from feature_store_py.fs_config import TrainSetOutputConfig, EASDeployConfig import datetime import sys cur_day = args['dt'] print('cur_day = ', cur_day) offset = datetime.timedelta(days=-1) pre_day = (datetime.datetime.strptime(cur_day, "%Y%m%d") + offset).strftime('%Y%m%d') print('pre_day = ', pre_day) access_key_id = o.account.access_id access_key_secret = o.account.secret_access_key fs = FeatureStoreClient(access_key_id=access_key_id, access_key_secret=access_key_secret, region='cn-beijing') cur_project_name = 'fs_demo' project = fs.get_project(cur_project_name) label_partitions = PartitionConfig(name = 'ds', value = cur_day) label_input_config = LabelInputConfig(partition_config=label_partitions) user_partitions = PartitionConfig(name = 'ds', value = pre_day) feature_view_user_config = FeatureViewConfig(name = 'user_table_preprocess_all_feature_v1', partition_config=user_partitions) item_partitions = PartitionConfig(name = 'ds', value = pre_day) feature_view_item_config = FeatureViewConfig(name = 'item_table_preprocess_all_feature_v1', partition_config=item_partitions) feature_view_config_list = [feature_view_user_config, feature_view_item_config] train_set_partitions = PartitionConfig(name = 'ds', value = cur_day) train_set_output_config = TrainSetOutputConfig(partition_config=train_set_partitions) model_name = 'fs_rank_v1' cur_model = project.get_model(model_name) task = cur_model.export_train_set(label_input_config, feature_view_config_list, train_set_output_config) task.wait() print("task_summary = ", task.task_summary)
Click Properties on the right side of the tab. In the Properties panel, configure the scheduling parameters that are described in the following table.
Parameter
Description
Scheduling Parameter
Parameter Name
Set this parameter to dt.
Parameter Value
Set this parameter to $[yyyymmdd-1].
Resource Group
Resource Group
Select the exclusive resource group that you created.
Dependencies
Select the user and item tables that you created.
After the node is configured and tested, save and submit the node configurations.
Backfill data for the node. For more information, see the Synchronize data from simulated tables section of this topic.
5. Install and use FeatureStore SDK
FeatureStore SDK for Go
FeatureStore SDK for Go is open source. For more information, see aliyun-pai-featurestore-go-sdk.
Install FeatureStore SDK for Go
Run the following code to install FeatureStore SDK for Go:
go get github.com/aliyun/aliyun-pai-featurestore-go-sdk/v2
Use FeatureStore SDK for Go
Run the following commands to initialize the client:
accessId := os.Getenv("AccessId") accessKey := os.Getenv("AccessKey") regionId := "cn-hangzhou" projectName := "fs_test_ots" client, err := NewFeatureStoreClient(regionId, accessId, accessKey, projectName)
NoteThe FeatureStore client must run in a virtual private cloud (VPC) to allow FeatureStore SDK to directly connect to online data stores. For example, FeatureStore SDK can access a Hologres instance or a GraphCompute instance only over a specific VPC.
Retrieve features from a feature view.
// get project by name project, err := client.GetProject("fs_test_ots") if err != nil { // t.Fatal(err) } // get featureview by name user_feature_view := project.GetFeatureView("user_fea") if user_feature_view == nil { // t.Fatal("feature view not exist") } // get online features features, err := user_feature_view.GetOnlineFeatures([]interface{}{"100043186", "100060369"}, []string{"*"}, nil)
In the preceding code,
[]string{"*"}
indicates that all features in the feature view are retrieved. You can also specify the features that you want to retrieve.Sample response:
[ { "city":"Hefei", "follow_cnt":1, "gender":"male", "user_id":"100043186" }, { "city":"", "follow_cnt":5, "gender":"male", "user_id":"100060369" } ]
Retrieve feature data from a model feature.
Each model feature can be associated with multiple feature entities. You can specify multiple join IDs to retrieve the corresponding features at a time.
The following sample code specifies two
join IDs
:user_id
anditem_id
. For each join ID, the same number of values must be specified.// get project by name project, err := client.GetProject("fs_test_ots") if err != nil { // t.Fatal(err) } // get ModelFeature by name model_feature := project.GetModelFeature("rank") if model_feature == nil { // t.Fatal("model feature not exist") } // get online features features, err := model_feature.GetOnlineFeatures(map[string][]interface{}{"user_id": {"100000676", "100004208"}, "item_id":{"238038872", "264025480"}} )
Sample response:
[ { "age":26, "author":100015828, "category":"14", "city":"Shenyang", "duration":63, "gender":"male", "item_id":"238038872", "user_id":"100000676" }, { "age":23, "author":100015828, "category":"15", "city":"Xi'an", "duration":22, "gender":"male", "item_id":"264025480", "user_id":"100004208" } ]
You can also specify a feature entity to retrieve its features.
Sample response:
[ { "age":26, "city":"Shenyang", "gender":"male", "user_id":"100000676" }, { "age":23, "city":"Xi'an", "gender":"male", "user_id":"100004208" } ]
FeatureStore SDK for Java
FeatureStore SDK for Java is open source. For more information, see aliyun-pai-featurestore-java-sdk.
In this example, a Hologres data store is used.
Run the following code to load environment variables and initialize the service.
public static String accessId = ""; public static String accessKey = ""; # Configure the host based on the region in which the service resides. public static String host = ""; # Obtain the AccessKey ID and AccessKey secret from the configured environment variables. static { accessId = System.getenv("ACCESS_KEY_ID"); accessKey = System.getenv("ACCESS_KEY_SECRET"); }
Initialize the Configuration class, which includes the region ID, AccessKey ID, AccessKey secret, and project name.
Configuration cf = new Configuration("cn-hangzhou",Constants.accessId,Constants.accessKey,"ele28"); cf.setDomain(Constants.host);// By default, the VPC environment is used.
Initialize the client.
ApiClient apiClient = new ApiClient(cf); # Initialize the FeatureStore client. FeatureStoreClient featureStoreClient = new FeatureStoreClient(apiClient);
Obtain the project name. In this example, the project is named ele28.
Project project=featureStoreClient.getProject("ele28"); if(project==null){ throw new RuntimeException("Project not found"); }
Retrieve the feature view of the project. In this example, the feature view is named mc_test.
FeatureView featureView=project.getFeatureView("mc_test"); if (featureView == null) { throw new RuntimeException("FeatureView not found"); }
Retrieve feature data from a real time feature view.
Map<String,String> m1=new HashMap<>(); m1.put("gender","gender1"); // Configure an alias. user_id='100027781'(FS_INT64) age='28'(FS_INT64) city='null'(FS_STRING) item_cnt='0'(FS_INT64) follow_cnt='0'(FS_INT64) follower_cnt='2'(FS_INT64) register_time='1697641608'(FS_INT64) tags='0'(FS_STRING) gender1='female'(FS_STRING) ---------------
You can use
String[]{"*"}
to retrieve all features from the feature view. You can also specify the features that you want to retrieve.FeatureResult featureResult1=featureView.getOnlineFeatures(new String[]{"100017768","100027781","100072534"},new String[]{"*"},m1);
Sample response:
while(featureResult1.next()){ System.out.println("---------------"); # Specify the feature name. for(String m:featureResult1.getFeatureFields()){ System.out.print(String.format("%s=%s(%s) ",m,featureResult1.getObject(m),featureResult1.getType(m))); } System.out.println("---------------"); }
Sample response:
--------------- user_id='100017768'(FS_INT64) age='28'(FS_INT64) city='Dongguan'(FS_STRING) item_cnt='1'(FS_INT64) follow_cnt='1'(FS_INT64) follower_cnt='0'(FS_INT64) register_time='1697202320'(FS_INT64) tags='1,2'(FS_STRING) gender1='female'(FS_STRING) ---------------
Retrieve a model.
Model model=project.getModelFeature("model_t1"); if(model==null){ throw new RuntimeException("Model not found"); }
Retrieve data from a model feature.
The following sample code specifies two join IDs: user_id and item_id. The number of values that are specified for user_id must be the same as that for item_id. In this example, only one value is specified for user_id and item_id.
Map<String, List<String>> m2=new HashMap<>(); m2.put("user_id",Arrays.asList("101683057")); m2.put("item_id",Arrays.asList("203665415"));
Retrieve all the feature data of the user feature entity that is associated with a model feature.
FeatureResult featureResult2 = model.getOnlineFeaturesWithEntity(m2,"user");
Sample response:
--------------- user_id='101683057' age='28' city='Shenzhen' follower_cnt='234' follow_cnt='0' gender='male' item_cnt='0' register_time='1696407642' tags='2' item_id='203665415' author='132920407' category='14' click_count='0' duration='18.0' praise_count='10' pub_time='1698218997' title='#Idiom story' ---------------
FeatureStore SDK for C++
FeatureStore SDK for C++ is integrated with the EasyRec processor that is deployed as a scoring service. FeatureStore SDK for C++ is optimized for feature extraction, cache management, and read operations to provide high-performance and low-latency solutions for large-scale recommendation scenarios. FeatureStore SDK for C++ provides the following capabilities:
Reduced memory usage: The memory usage is greatly reduced when you process a large amount of complex feature data by using FeatureStore SDK for C++, especially in the case of high feature load.
Accelerated feature extraction: FeatureStore SDK for C++ extracts feature data from a MaxCompute data store to the EAS cache instead of extracting feature data from an online data store such as a Hologres instance or a GraphCompute instance. This reduces the amount of time that is required to load feature data. In addition, MaxCompute provides higher stability and better extensibility, which reduces the impacts that are caused by scaling up online storage.
Improved model scoring: FeatureStore SDK for C++ further improves the tp100 scoring metric, supports more stable response time, and reduces the number of timeout requests. This improves the reliability of recommendation services and user experience.
References
You can use FeatureStore with other Alibaba Cloud services to build a recommendation system. For more information, see Use FeatureStore to manage features in a recommendation system.