Use Bipartite GraphSAGE for matching recall - Platform For AI

This topic describes how to use Bipartite Graph SAmple and aggreGatE (GraphSAGE) to obtain feature vectors of users and items for matching recall.

Background information

Graph neural network is a widely discussed concept in deep learning. The open source Graph-Learn framework of Platform for AI (PAI) provides a large number of graph learning algorithms. GraphSAGE is a matching algorithm for graph neural networks. Bipartite GraphSAGE is an extension of GraphSAGE and is used to process bipartite graphs. Bipartite GraphSAGE is used by Taobao for matching recall.

In a bipartite graph, each user or item is represented by a vertex. The correlation, such as clicking or purchasing, between a user and an item is represented by an edge. The system samples adjacent vertices of each vertex that represents a user and each vertex that represents an item based on the User-Item-User-Item... and Item-User-Item-User… meta paths.

Prerequisites

A workspace is created. For more information, see Create a workspace.
MaxCompute resources are associated with the workspace. For more information, see Manage workspaces.

Procedure

Go to the Machine Learning Designer page.
1. Log on to the PAI console.
2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
3. In the left-side navigation pane, choose Model Training > Visualized Modeling (Designer).

Create a pipeline.

On the Visualized Modeling (Designer) page, click the Preset Templates tab.
In the Preset Templates section, click Create on the RecSys-GraphEmbedding card.
In the Create Pipeline dialog box, configure the parameters. You can use their default values.
The value specified for the Pipeline Data Path parameter is the Object Storage Service (OSS) bucket path of the temporary data and models generated during the runtime of the pipeline.
Click OK.
It requires about 10 seconds to create the pipeline.
On the pipelines tab, double-click the RecSys-GraphEmbedding pipeline to open the pipeline.

View the components of the pipeline on the canvas. The following figure shows the pipeline that is automatically created based on the preset template.

图神经网络召回

Component	Description
1	The component imports data from the table that records user behavior on items. The table contains the following fields: user: the ID of the user. The value must be of the BIGINT type. item: the ID of the item. The value must be of the BIGINT type. weight: the behavior performed by the user on the item. The value must be of the DOUBLE type. For example, a value of 1 specifies that the user purchased the item and a value of 2 specifies that the user added the item to favorites.
2	The component imports data from the user feature table. The table contains the following fields: user: the ID of the user. The value must be of the BIGINT type. feature: one or more features of the user. The value must be of the STRING type. Each user must be added with at least one feature. Separate multiple features with colons (:). Each feature must be indicated by a FLOAT-type number. The system processes the features as continuous features. Example: `1:1:1`.
3	The component imports data from the item feature table. The table contains the following fields: item: the ID of the item. The value must be of the BIGINT type. feature: one or more features of the item. The value must be of the STRING type. Each item must be added with at least one feature. Separate multiple features with colons (:). Each feature must be indicated by a FLOAT-type number. The system processes the features as continuous features. Example: `1:1:2`.
4	The component generates a user vector table and an item vector table for matching recall. The following section describes these parameters. User feature number: the number of features of the user. The value is the total number of features in the user feature table. Item feature number: the number of features of the item. The value is the total number of features in the item feature table. epoch: the number of epochs for model training. batch_size: the size of data samples in each batch of the training job. learning_rate: the learning rate for model training. drop_out: the dropout rate of each layer during training. hidden_dim the dimension of the hidden layer. output_dim: the dimension of the final output embedding. Number of neighbors in user multi-hop sampling: the number of sampled neighbors for user. For example, a value of [10,2] specifies that the first hop samples 10 neighbors and the second hop samples 2 neighbors. The system samples neighbors of each vertex that represents a user based on the User-Item-User-Item... path. Item multi-hop sampling neighbor number: the number of sampled neighbors for item. The system samples neighbors of each vertex that represents an item based on the Item-User-Item-User... path. Negative samples: the number of negative samples that correspond to a positive sample. We recommend that you specify a value within the range from 5 to 10. Aggregation type: the aggregation type of neighbors. Valid values: gcn, sum and mean. 'gcn' specifies that neighbors are aggregated in a Graph Convolutional Network (GCN) manner. 'mean' specifies that the neighborhood aggregation operation is performed by calculating an average. 'sum' specifies that the neighborhood aggregation operation is performed by calculating the sum of the neighbor embeddings. In most cases, 'mean' is preferred. Whether to do feature batch normalization: specifies whether to normalize the input features. Whether to use synchronized training: specifies whether to use synchronous training. Default value: False. The epoch parameter becomes invalid during synchronized training. When you set this parameter to False, if the training result does not converge, switch this parameter to True to enable synchronous training. Note: If you set this parameter to False, the training provides better results. Maximum number of synchronized training steps: Configure this parameter only if you use synchronous training. The epoch parameter becomes invalid when you configure this parameter. You can use the number of edges or multiply the number of workers by the value of the batch_size parameter to estimate the number of steps required to traverse the bipartite graph.

Run the pipeline and view the results.
1. In the upper-left corner of the canvas, click the icon.
2. After you run the pipeline, right-click the graphSage component on the canvas and choose View Data > user_embedding. In the dialog box that appears, you can view the feature vectors that are generated for users.
3. After you run the pipeline, right-click the graphSage component on the canvas and choose View Data > item_embedding. In the dialog box that appears, you can view the feature vectors that are generated for items.