Maximum Connected Subgraph - Platform For AI - Alibaba Cloud Documentation Center

The Maximum Connected Subgraph algorithm is used to identify the largest connected part in an undirected graph, which is the largest set of nodes in the graph. In an undirected graph, a path can be used to connect two nodes. In most cases, the algorithm is used in scenarios such as network analysis and image processing. The Maximum Connected Subgraph algorithm uses depth-first search (DFS) or breadth-first search (BFS) to traverse a graph, identify all connected components, and then find the subgraph that contains the largest number of nodes.

Configure the component

Method 1: Configure the component on the pipeline page

Configure the parameters of the Maximum Connected Subgraph component on the pipeline page of Machine Learning Designer in the Platform for AI (PAI) console. The following table describes the parameters.

Tab	Parameter	Description
Fields Setting	Start Vertex	The start vertex column in the edge table.
Fields Setting	End Node	The end vertex column in the edge table.
Tuning	Workers	The number of nodes for parallel job execution. The degree of parallelism and framework communication costs increase with the value of this parameter.
	Memory Size per Worker (MB)	The maximum size of memory that can be used by a job. Unit: MB. Default value: 4096. If the size of the used memory exceeds the value of this parameter, the `OutOfMemory` error is reported.
	Data Split Size (MB)	The data split size. Unit: MB. Default value: 64.

Method 2: Use PAI commands

Configure the parameters of the Maximum Connected Subgraph component by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see Scenario 4: Execute PAI commands within the SQL script component.

PAI -name MaximalConnectedComponent
    -project algo_public
    -DinputEdgeTableName=MaximalConnectedComponent_func_test_edge
    -DfromVertexCol=flow_out_id
    -DtoVertexCol=flow_in_id
    -DoutputTableName=MaximalConnectedComponent_func_test_result;

Parameter	Required	Default value	Description
inputEdgeTableName	Yes	No default value	The name of the input edge table.
inputEdgeTablePartitions	No	Full table	The partitions in the input edge table.
fromVertexCol	Yes	No default value	The start vertex column in the input edge table.
toVertexCol	Yes	No default value	The end vertex column in the input edge table.
outputTableName	Yes	No default value	The name of the output table.
outputTablePartitions	No	No default value	The partitions in the output table.
lifecycle	No	No default value	The lifecycle of the output table.
workerNum	No	No default value	The number of nodes for parallel job execution. The degree of parallelism and framework communication costs increase with the value of this parameter.
workerMem	No	4096	The maximum size of memory that can be used by a job. Unit: MB. Default value: 4096. If the size of the used memory exceeds the value of this parameter, the `OutOfMemory` error is reported.
splitSize	No	64	The data split size. Unit: MB.

Example

Add the SQL Script component as a node to the canvas and execute the following SQL statements to generate training data.

drop table if exists MaximalConnectedComponent_func_test_edge;
create table MaximalConnectedComponent_func_test_edge as
select * from
(
  select '1' as flow_out_id,'2' as flow_in_id
  union all
  select '2' as flow_out_id,'3' as flow_in_id
  union all
  select '3' as flow_out_id,'4' as flow_in_id
  union all
  select '1' as flow_out_id,'4' as flow_in_id
  union all
  select 'a' as flow_out_id,'b' as flow_in_id
  union all
  select 'b' as flow_out_id,'c' as flow_in_id
)tmp;
drop table if exists MaximalConnectedComponent_func_test_result;
create table MaximalConnectedComponent_func_test_result
(
  node string,
  grp_id string
);

Data structure

Add the SQL Script component as a node to the canvas and run the following PAI commands to train the model.

drop table if exists ${o1};
PAI -name MaximalConnectedComponent
    -project algo_public
    -DinputEdgeTableName=MaximalConnectedComponent_func_test_edge
    -DfromVertexCol=flow_out_id
    -DtoVertexCol=flow_in_id
    -DoutputTableName=${o1};

Right-click the SQL Script component and choose View Data > SQL Script Output to view the training results.

| node1 | grp_id |
| ----- | ------ |
| a     | c      |
| b     | c      |
| c     | c      |
| 1     | 4      |
| 2     | 4      |
| 3     | 4      |
| 4     | 4      |