Label propagation algorithm - Platform For AI - Alibaba Cloud Documentation Center

A label propagation algorithm (LPA) is a semi-supervised machine learning algorithm. The label (community) of a vertex depends on the labels of the neighboring vertices. The degree of dependence is determined by the similarity between vertices. Data becomes stable by performing iterative propagation updates. The Label Propagation Clustering component can provide the group of each vertex after the convergence of all vertices in a graph.

Algorithm description

Graph clustering is used to divide a graph into subgraphs based on the topology of the graph. Therefore, the links between the vertices in a subgraph are more than the links between the subgraphs.
This algorithm initializes each vertex by using a unique label, iterates through vertices, and assigns a vertex the label that most frequently appears among its neighboring vertices in a community. The algorithm stops assigning a label to a vertex until each vertex has the label that most frequently appears among its neighboring vertices.

Configure the component

Method 1: Configure the component on the pipeline page

You can add the Label Propagation Clustering component on the pipeline page of Machine Learning Designer in the Platform for AI (PAI) console. The following table describes the parameters.

Tab	Parameter	Description
Fields Setting	Vertex Table: Vertex Column	The vertex column in the vertex table.
	Vertex Table: Weight Column	The vertex weight column in the vertex table.
	Edge Table: Source Vertex Column	The start vertex column in the edge table.
	Edge Table: Target Vertex Column	The end vertex column in the edge table.
	Edge Table: Weight Column	The edge weight column in the edge table.
Parameters Setting	Maximum Iterations	The maximum number of iterations. Default value: 30.
Tuning	Workers	The number of vertices for parallel job execution. The degree of parallelism and framework communication costs increase with the value of this parameter.
Tuning	Memory Size per Worker (MB)	The maximum size of memory that a single job can use. Unit: MB. Default value: 4096. If the size of used memory exceeds the value of this parameter, the `OutOfMemory` error is reported.

Method 2: Configure the component by using PAI commands

You can configure the Label Propagation Clustering component by using PAI commands. You can use the SQL Script component to run PAI commands. For more information, see Scenario 4: Execute PAI commands within the SQL script component in the "SQL Script" topic.

PAI -name LabelPropagationClustering
    -project algo_public
    -DinputEdgeTableName=LabelPropagationClustering_func_test_edge
    -DfromVertexCol=flow_out_id
    -DtoVertexCol=flow_in_id
    -DinputVertexTableName=LabelPropagationClustering_func_test_node
    -DvertexCol=node
    -DoutputTableName=LabelPropagationClustering_func_test_result
    -DhasEdgeWeight=true
    -DedgeWeightCol=edge_weight
    -DhasVertexWeight=true
    -DvertexWeightCol=node_weight
    -DrandSelect=true
    -DmaxIter=100;

Parameter	Required	Default value	Description
inputEdgeTableName	Yes	No default value	The name of the input edge table.
inputEdgeTablePartitions	No	Full table	The partitions in the input edge table.
fromVertexCol	Yes	No default value	The start vertex column in the input edge table.
toVertexCol	Yes	No default value	The end vertex column in the input edge table.
inputVertexTableName	Yes	No default value	The name of the input vertex table.
inputVertexTablePartitions	No	Full table	The partitions in the input vertex table.
vertexCol	Yes	No default value	The vertex column in the input vertex table.
outputTableName	Yes	No default value	The name of the output table.
outputTablePartitions	No	No default value	The partitions in the output table.
lifecycle	No	No default value	The lifecycle of the output table.
workerNum	No	No default value	The number of vertices for parallel job execution. The degree of parallelism and framework communication costs increase with the value of this parameter.
workerMem	No	4096	The maximum size of memory that a single job can use. Unit: MB. Default value: 4096. If the size of used memory exceeds the value of this parameter, the `OutOfMemory` error is reported.
splitSize	No	64	The data split size. Unit: MB.
hasEdgeWeight	No	false	Specifies whether the edges in the input edge table have weights.
edgeWeightCol	No	No default value	The edge weight column in the input edge table.
hasVertexWeight	No	false	Specifies whether the vertices in the input vertex table have weights.
vertexWeightCol	No	No default value	The vertex weight column in the input vertex table.
randSelect	No	false	Specifies whether the maximum label value is to be randomly selected.
maxIter	No	30	The maximum number of iterations.

Example

Add the SQL Script component as a vertex to the canvas and execute the following SQL statements to generate training data.

drop table if exists LabelPropagationClustering_func_test_edge;
create table LabelPropagationClustering_func_test_edge as
select * from
(
    select '1' as flow_out_id,'2' as flow_in_id,0.7 as edge_weight
    union all
    select '1' as flow_out_id,'3' as flow_in_id,0.7 as edge_weight
    union all
    select '1' as flow_out_id,'4' as flow_in_id,0.6 as edge_weight
    union all
    select '2' as flow_out_id,'3' as flow_in_id,0.7 as edge_weight
    union all
    select '2' as flow_out_id,'4' as flow_in_id,0.6 as edge_weight
    union all
    select '3' as flow_out_id,'4' as flow_in_id,0.6 as edge_weight
    union all
    select '4' as flow_out_id,'6' as flow_in_id,0.3 as edge_weight
    union all
    select '5' as flow_out_id,'6' as flow_in_id,0.6 as edge_weight
    union all
    select '5' as flow_out_id,'7' as flow_in_id,0.7 as edge_weight
    union all
    select '5' as flow_out_id,'8' as flow_in_id,0.7 as edge_weight
    union all
    select '6' as flow_out_id,'7' as flow_in_id,0.6 as edge_weight
    union all
    select '6' as flow_out_id,'8' as flow_in_id,0.6 as edge_weight
    union all
    select '7' as flow_out_id,'8' as flow_in_id,0.7 as edge_weight
)tmp
;
drop table if exists LabelPropagationClustering_func_test_node;
create table LabelPropagationClustering_func_test_node as
select * from
(
    select '1' as node,0.7 as node_weight
    union all
    select '2' as node,0.7 as node_weight
    union all
    select '3' as node,0.7 as node_weight
    union all
    select '4' as node,0.5 as node_weight
    union all
    select '5' as node,0.7 as node_weight
    union all
    select '6' as node,0.5 as node_weight
    union all
    select '7' as node,0.7 as node_weight
    union all
    select '8' as node,0.7 as node_weight
)tmp;

Data structure

Add the SQL Script component as a vertex to the canvas and run the following PAI commands to train the model.

drop table if exists ${o1};
PAI -name LabelPropagationClustering
    -project algo_public
    -DinputEdgeTableName=LabelPropagationClustering_func_test_edge
    -DfromVertexCol=flow_out_id
    -DtoVertexCol=flow_in_id
    -DinputVertexTableName=LabelPropagationClustering_func_test_node
    -DvertexCol=node
    -DoutputTableName=${o1}
    -DhasEdgeWeight=true
    -DedgeWeightCol=edge_weight
    -DhasVertexWeight=true
    -DvertexWeightCol=node_weight
    -DrandSelect=true
    -DmaxIter=100;

Right-click the SQL Script component and choose View Data > SQL Script Output to view the training results.

| node | group_id |
| ---- | -------- |
| 1    | 3        |
| 3    | 3        |
| 5    | 7        |
| 7    | 7        |
| 2    | 3        |
| 4    | 3        |
| 6    | 7        |
| 8    | 7        |