All Products
Search
Document Center

Platform For AI:Vertex Clustering Coefficient

Last Updated:Nov 28, 2024

The Vertex Clustering Coefficient is a metric in network analysis used to quantify the degree of clustering among a node's neighbors. Specifically, it represents the ratio of the actual number of edges present among a node's neighbors to the total possible number of edges that could exist between them. The value of this coefficient ranges from 0 to 1, with a higher value indicating a tighter connection among the neighbors of the node, reflecting the local clustering characteristics within the network.

Algorithm description

In an undirected graph, the clustering coefficient of a node represents the density of connections around that node. A star network has a density of 0, while a fully connected network has a density of 1.

In network analysis, star networks and fully connected networks are two typical network topologies:

  • Star network: This structure consists of a central node and multiple peripheral nodes, with all peripheral nodes connected only to the central node. The characteristic of a star network is that the clustering coefficient of the central node is 0 because its neighbors (the peripheral nodes) do not have direct connections with each other.

  • Fully connected network: In this structure, every node is directly connected to all other nodes. The characteristic of a fully connected network is that all nodes have a clustering coefficient of 1 because all possible connections exist between each node and its neighbors.

These two structures represent the extreme cases in network topology, with the star network having the lowest local clustering and the fully connected network having the highest local clustering.

Configure the component

Method 1: Configure the component on the pipeline page

Add a Vertex Clustering Coefficient component on the pipeline page and configure the following parameters:

Category

Parameter

Description

Fields Setting

Start Vertex

The start vertex column in the edge table.

End Vertex

The end vertex column in the edge table.

Parameters Setting

Largest Vertex Degree

If the vertex degree is larger than the value of this parameter, sampling is required. Default value: 500.

Tuning

Workers

The number of vertices for parallel job execution. The degree of parallelism and framework communication costs increase with the value of this parameter.

Memory Size per Worker (MB)

The maximum size of memory that a single job can use. Unit: MB. Default value: 4096.

If the size of used memory exceeds the value of this parameter, the OutOfMemory error is reported.

Data Split Size (MB)

The data split size. Unit: MB. Default value: 64.

Method 2: Configure the component by using PAI commands

You can configure the Vertex Clustering Coefficient component by using PAI commands. You can use the SQL Script component to run PAI commands. For more information, see Scenario 4: Execute PAI commands within the SQL script component in the "SQL Script" topic.

PAI -name NodeDensity
    -project algo_public
    -DinputEdgeTableName=NodeDensity_func_test_edge
    -DfromVertexCol=flow_out_id
    -DtoVertexCol=flow_in_id
    -DoutputTableName=NodeDensity_func_test_result
    -DmaxEdgeCnt=500;

Parameter

Required

Default value

Description

inputEdgeTableName

Yes

No default value

The name of the input edge table.

inputEdgeTablePartitions

No

Full table

The partitions in the input edge table.

fromVertexCol

Yes

No default value

The start vertex column in the input edge table.

toVertexCol

Yes

No default value

The end vertex column in the input edge table.

outputTableName

Yes

No default value

The name of the output table.

outputTablePartitions

No

No default value

The partitions in the output table.

lifecycle

No

No default value

The lifecycle of the output table.

maxEdgeCnt

No

500

If the vertex degree is greater than the value of this parameter, sampling is required.

workerNum

No

No default value

The number of vertices for parallel job execution. The degree of parallelism and framework communication costs increase with the value of this parameter.

workerMem

No

4096

The maximum size of memory that a single job can use. Unit: MB. Default value: 4096.

If the size of used memory exceeds the value of this parameter, the OutOfMemory error is reported.

splitSize

No

64

The data split size. Unit: MB.

Example

  1. Add a SQL Script component. Deselect Use Script Mode and Whether the system adds a create table statement. Enter the following SQL statements.

    drop table if exists NodeDensity_func_test_edge;
    create table NodeDensity_func_test_edge as
    select * from
    (
      select '1' as flow_out_id, '2' as flow_in_id
      union all
      select '1' as flow_out_id, '3' as flow_in_id
      union all
      select '1' as flow_out_id, '4' as flow_in_id
      union all
      select '1' as flow_out_id, '5' as flow_in_id
      union all
      select '1' as flow_out_id, '6' as flow_in_id
      union all
      select '2' as flow_out_id, '3' as flow_in_id
      union all
      select '3' as flow_out_id, '4' as flow_in_id
      union all
      select '4' as flow_out_id, '5' as flow_in_id
      union all
      select '5' as flow_out_id, '6' as flow_in_id
      union all
      select '5' as flow_out_id, '7' as flow_in_id
      union all
      select '6' as flow_out_id, '7' as flow_in_id
    )tmp;
    drop table if exists NodeDensity_func_test_result;
    create table NodeDensity_func_test_result
    (
      node string,
      node_cnt bigint,
      edge_cnt bigint,
      density double,
      log_density double
    );

    Data structure

    image

  2. Add a SQL Script component. Deselect Use Script Mode and Whether the system adds a create table statement. Enter the following PAI commands and connect the two SQL Script components.

    drop table if exists ${o1};
    PAI -name NodeDensity
        -project algo_public
        -DinputEdgeTableName=NodeDensity_func_test_edge
        -DfromVertexCol=flow_out_id
        -DtoVertexCol=flow_in_id
        -DoutputTableName=${o1}
        -DmaxEdgeCnt=500;
  3. Click image in the upper left corner to run the pipeline.

  4. Right-click the SQL Script component created in Step 2 and choose View Data > SQL Script Output to view the training results.

    | node | node_cnt | edge_cnt | density | log_density |
    | ---- | -------- | -------- | ------- | ----------- |
    | 1    | 5        | 4        | 0.4     | 1.45657     |
    | 2    | 2        | 1        | 1.0     | 1.24696     |
    | 3    | 3        | 2        | 0.66667 | 1.35204     |
    | 4    | 3        | 2        | 0.66667 | 1.35204     |
    | 5    | 4        | 3        | 0.5     | 1.41189     |
    | 6    | 3        | 2        | 0.66667 | 1.35204     |
    | 7    | 2        | 1        | 1.0     | 1.24696     |