A vertex clustering coefficient is a measure of the degree to which vertices in an undirected graph tend to cluster together, which indicates the density. You can use the vertex clustering coefficient to compare the densities of different groups and estimate the maximum density of a group. The Vertex Clustering Coefficient component can provide the number of neighboring vertices, the density, and the number of pairs of neighbors.
Algorithm description
A vertex clustering coefficient is used to calculate the peripheral density of a vertex in an undirected graph. The density of a star network is 0, and that of a fully meshed network is 1.
Configure the component
Method 1: Configure the component on the pipeline page
You can add the Vertex Clustering Coefficient component on the pipeline page of Machine Learning Designer in the Platform for AI (PAI) console. The following table describes the parameters.
Tab | Parameter | Description |
Fields Setting | Start Vertex | The start vertex column in the edge table. |
End Vertex | The end vertex column in the edge table. | |
Parameters Setting | Largest Vertex Degree | If the vertex degree is larger than the value of this parameter, sampling is required. Default value: 500. |
Tuning | Workers | The number of vertices for parallel job execution. The degree of parallelism and framework communication costs increase with the value of this parameter. |
Memory Size per Worker (MB) | The maximum size of memory that a single job can use. Unit: MB. Default value: 4096. If the size of used memory exceeds the value of this parameter, the | |
Data Split Size (MB) | The data split size. Unit: MB. Default value: 64. |
Method 2: Configure the component by using PAI commands
You can configure the Vertex Clustering Coefficient component by using PAI commands. You can use the SQL Script component to run PAI commands. For more information, see Scenario 4: Execute PAI commands within the SQL script component in the "SQL Script" topic.
PAI -name NodeDensity
-project algo_public
-DinputEdgeTableName=NodeDensity_func_test_edge
-DfromVertexCol=flow_out_id
-DtoVertexCol=flow_in_id
-DoutputTableName=NodeDensity_func_test_result
-DmaxEdgeCnt=500;
Parameter | Required | Default value | Description |
inputEdgeTableName | Yes | No default value | The name of the input edge table. |
inputEdgeTablePartitions | No | Full table | The partitions in the input edge table. |
fromVertexCol | Yes | No default value | The start vertex column in the input edge table. |
toVertexCol | Yes | No default value | The end vertex column in the input edge table. |
outputTableName | Yes | No default value | The name of the output table. |
outputTablePartitions | No | No default value | The partitions in the output table. |
lifecycle | No | No default value | The lifecycle of the output table. |
maxEdgeCnt | No | 500 | If the vertex degree is greater than the value of this parameter, sampling is required. |
workerNum | No | No default value | The number of vertices for parallel job execution. The degree of parallelism and framework communication costs increase with the value of this parameter. |
workerMem | No | 4096 | The maximum size of memory that a single job can use. Unit: MB. Default value: 4096. If the size of used memory exceeds the value of this parameter, the |
splitSize | No | 64 | The data split size. Unit: MB. |
Example
Add the SQL Script component as a vertex to the canvas and execute the following SQL statements to generate training data.
drop table if exists NodeDensity_func_test_edge; create table NodeDensity_func_test_edge as select * from ( select '1' as flow_out_id, '2' as flow_in_id union all select '1' as flow_out_id, '3' as flow_in_id union all select '1' as flow_out_id, '4' as flow_in_id union all select '1' as flow_out_id, '5' as flow_in_id union all select '1' as flow_out_id, '6' as flow_in_id union all select '2' as flow_out_id, '3' as flow_in_id union all select '3' as flow_out_id, '4' as flow_in_id union all select '4' as flow_out_id, '5' as flow_in_id union all select '5' as flow_out_id, '6' as flow_in_id union all select '5' as flow_out_id, '7' as flow_in_id union all select '6' as flow_out_id, '7' as flow_in_id )tmp; drop table if exists NodeDensity_func_test_result; create table NodeDensity_func_test_result ( node string, node_cnt bigint, edge_cnt bigint, density double, log_density double );
Data structure
Add the SQL Script component as a vertex to the canvas and run the following PAI commands to train the model.
drop table if exists ${o1}; PAI -name NodeDensity -project algo_public -DinputEdgeTableName=NodeDensity_func_test_edge -DfromVertexCol=flow_out_id -DtoVertexCol=flow_in_id -DoutputTableName=${o1} -DmaxEdgeCnt=500;
Right-click the SQL Script component and choose View Data > SQL Script Output to view the training results.
| node | node_cnt | edge_cnt | density | log_density | | ---- | -------- | -------- | ------- | ----------- | | 1 | 5 | 4 | 0.4 | 1.45657 | | 2 | 2 | 1 | 1.0 | 1.24696 | | 3 | 3 | 2 | 0.66667 | 1.35204 | | 4 | 3 | 2 | 0.66667 | 1.35204 | | 5 | 4 | 3 | 0.5 | 1.41189 | | 6 | 3 | 2 | 0.66667 | 1.35204 | | 7 | 2 | 1 | 1.0 | 1.24696 |