A label propagation algorithm (LPA) is a semi-supervised machine learning algorithm. The label (community) of a vertex depends on the labels of the neighboring vertices. The degree of dependence is determined by the similarity between vertices. Data becomes stable by performing iterative propagation updates. The Label Propagation Clustering component can provide the group of each vertex after the convergence of all vertices in a graph.
Algorithm description
Graph clustering is used to divide a graph into subgraphs based on the topology of the graph. Therefore, the links between the vertices in a subgraph are more than the links between the subgraphs.
This algorithm initializes each vertex by using a unique label, iterates through vertices, and assigns a vertex the label that most frequently appears among its neighboring vertices in a community. The algorithm stops assigning a label to a vertex until each vertex has the label that most frequently appears among its neighboring vertices.
Configure the component
Method 1: Configure the component on the pipeline page
You can add the Label Propagation Clustering component on the pipeline page of Machine Learning Designer in the Platform for AI (PAI) console. The following table describes the parameters.
Tab | Parameter | Description |
Fields Setting | Vertex Table: Vertex Column | The vertex column in the vertex table. |
Vertex Table: Weight Column | The vertex weight column in the vertex table. | |
Edge Table: Source Vertex Column | The start vertex column in the edge table. | |
Edge Table: Target Vertex Column | The end vertex column in the edge table. | |
Edge Table: Weight Column | The edge weight column in the edge table. | |
Parameters Setting | Maximum Iterations | The maximum number of iterations. Default value: 30. |
Tuning | Workers | The number of vertices for parallel job execution. The degree of parallelism and framework communication costs increase with the value of this parameter. |
Memory Size per Worker (MB) | The maximum size of memory that a single job can use. Unit: MB. Default value: 4096. If the size of used memory exceeds the value of this parameter, the |
Method 2: Configure the component by using PAI commands
You can configure the Label Propagation Clustering component by using PAI commands. You can use the SQL Script component to run PAI commands. For more information, see Scenario 4: Execute PAI commands within the SQL script component in the "SQL Script" topic.
PAI -name LabelPropagationClustering
-project algo_public
-DinputEdgeTableName=LabelPropagationClustering_func_test_edge
-DfromVertexCol=flow_out_id
-DtoVertexCol=flow_in_id
-DinputVertexTableName=LabelPropagationClustering_func_test_node
-DvertexCol=node
-DoutputTableName=LabelPropagationClustering_func_test_result
-DhasEdgeWeight=true
-DedgeWeightCol=edge_weight
-DhasVertexWeight=true
-DvertexWeightCol=node_weight
-DrandSelect=true
-DmaxIter=100;
Parameter | Required | Default value | Description |
inputEdgeTableName | Yes | No default value | The name of the input edge table. |
inputEdgeTablePartitions | No | Full table | The partitions in the input edge table. |
fromVertexCol | Yes | No default value | The start vertex column in the input edge table. |
toVertexCol | Yes | No default value | The end vertex column in the input edge table. |
inputVertexTableName | Yes | No default value | The name of the input vertex table. |
inputVertexTablePartitions | No | Full table | The partitions in the input vertex table. |
vertexCol | Yes | No default value | The vertex column in the input vertex table. |
outputTableName | Yes | No default value | The name of the output table. |
outputTablePartitions | No | No default value | The partitions in the output table. |
lifecycle | No | No default value | The lifecycle of the output table. |
workerNum | No | No default value | The number of vertices for parallel job execution. The degree of parallelism and framework communication costs increase with the value of this parameter. |
workerMem | No | 4096 | The maximum size of memory that a single job can use. Unit: MB. Default value: 4096. If the size of used memory exceeds the value of this parameter, the |
splitSize | No | 64 | The data split size. Unit: MB. |
hasEdgeWeight | No | false | Specifies whether the edges in the input edge table have weights. |
edgeWeightCol | No | No default value | The edge weight column in the input edge table. |
hasVertexWeight | No | false | Specifies whether the vertices in the input vertex table have weights. |
vertexWeightCol | No | No default value | The vertex weight column in the input vertex table. |
randSelect | No | false | Specifies whether the maximum label value is to be randomly selected. |
maxIter | No | 30 | The maximum number of iterations. |
Example
Add the SQL Script component as a vertex to the canvas and execute the following SQL statements to generate training data.
drop table if exists LabelPropagationClustering_func_test_edge; create table LabelPropagationClustering_func_test_edge as select * from ( select '1' as flow_out_id,'2' as flow_in_id,0.7 as edge_weight union all select '1' as flow_out_id,'3' as flow_in_id,0.7 as edge_weight union all select '1' as flow_out_id,'4' as flow_in_id,0.6 as edge_weight union all select '2' as flow_out_id,'3' as flow_in_id,0.7 as edge_weight union all select '2' as flow_out_id,'4' as flow_in_id,0.6 as edge_weight union all select '3' as flow_out_id,'4' as flow_in_id,0.6 as edge_weight union all select '4' as flow_out_id,'6' as flow_in_id,0.3 as edge_weight union all select '5' as flow_out_id,'6' as flow_in_id,0.6 as edge_weight union all select '5' as flow_out_id,'7' as flow_in_id,0.7 as edge_weight union all select '5' as flow_out_id,'8' as flow_in_id,0.7 as edge_weight union all select '6' as flow_out_id,'7' as flow_in_id,0.6 as edge_weight union all select '6' as flow_out_id,'8' as flow_in_id,0.6 as edge_weight union all select '7' as flow_out_id,'8' as flow_in_id,0.7 as edge_weight )tmp ; drop table if exists LabelPropagationClustering_func_test_node; create table LabelPropagationClustering_func_test_node as select * from ( select '1' as node,0.7 as node_weight union all select '2' as node,0.7 as node_weight union all select '3' as node,0.7 as node_weight union all select '4' as node,0.5 as node_weight union all select '5' as node,0.7 as node_weight union all select '6' as node,0.5 as node_weight union all select '7' as node,0.7 as node_weight union all select '8' as node,0.7 as node_weight )tmp;
Data structure
Add the SQL Script component as a vertex to the canvas and run the following PAI commands to train the model.
drop table if exists ${o1}; PAI -name LabelPropagationClustering -project algo_public -DinputEdgeTableName=LabelPropagationClustering_func_test_edge -DfromVertexCol=flow_out_id -DtoVertexCol=flow_in_id -DinputVertexTableName=LabelPropagationClustering_func_test_node -DvertexCol=node -DoutputTableName=${o1} -DhasEdgeWeight=true -DedgeWeightCol=edge_weight -DhasVertexWeight=true -DvertexWeightCol=node_weight -DrandSelect=true -DmaxIter=100;
Right-click the SQL Script component and choose View Data > SQL Script Output to view the training results.
| node | group_id | | ---- | -------- | | 1 | 3 | | 3 | 3 | | 5 | 7 | | 7 | 7 | | 2 | 3 | | 4 | 3 | | 6 | 7 | | 8 | 7 |