The PageRank algorithm is used to measure the importance of a web page. The PageRank algorithm analyzes hyperlinks to determine the importance of a web page based on the number and quality of links to a web page. A larger number of links to a web page indicates a higher ranking of the web page. The weights of the link sources also affect the final PageRank score of the web page. The Page Rank component is used to calculate the weight of each node.
Description
The PageRank algorithm analyzes the links to a web page to evaluate the relative importance of the web page. The PageRank algorithm works based on the following core principles:
A larger number of links from other web pages to a web page indicates higher importance or quality of the web page.
The PageRank algorithm collects the number of links from other web pages to a web page and takes the weights of the other web pages into account. The weight of a web page is calculated based on the PageRank score of the web page and the number of links from the web page to other web pages.
The PageRank algorithm can also be applied to social networks. In a social network, the influence of a user is determined by its personal attributes and the quality of its social connections. For example, the influence of a Sina Weibo user on their followers is affected by the closeness of the relationship with the followers. In most cases, a Sina Weibo user is more likely to have influence on their families, classmates, and colleagues. In a social network, the edge weight reflects the closeness of the relationship between users and is considered to be the relationship strength index.
PageRank formula that includes the link weight
W(i): the weight of Node i.
C(Ai): the link weight.
d: the damping coefficient.
W(A): the influence index of each user and the node weight after the algorithm iteration becomes stable.
Configure the component
Method 1: Configure the component on the pipeline page
Configure the parameters of the Page Rank component on the pipeline page of Machine Learning Designer in the Platform for AI (PAI) console. The following table describes the parameters.
Tab | Parameter | Description |
Fields Setting | Source Vertex Column | The start vertex column in the edge table. |
Target Vertex Column | The end vertex column in the edge table. | |
Edge Weight Column | The edge weight column in the edge table. | |
Parameters Setting | Maximum Iterations | The number of iterations before the algorithm automatically converges. Default value: 30. |
Damping Coefficient | The probability that a user continues browsing. | |
Tuning | Workers | The number of nodes for parallel job execution. The degree of parallelism and framework communication costs increase with the value of this parameter. |
Memory Size per Worker (MB) | The maximum size of memory that can be used by a job. Unit: MB. Default value: 4096. If the size of the used memory exceeds the value of this parameter, the |
Method 2: Use PAI commands
Configure the parameters of the Page Rank component by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see Scenario 4: Execute PAI commands within the SQL script component.
PAI -name PageRankWithWeight
-project algo_public
-DinputEdgeTableName=PageRankWithWeight_func_test_edge
-DfromVertexCol=flow_out_id
-DtoVertexCol=flow_in_id
-DoutputTableName=PageRankWithWeight_func_test_result
-DhasEdgeWeight=true
-DedgeWeightCol=weight
-DmaxIter 100;
Parameter | Required | Default value | Description |
inputEdgeTableName | Yes | No default value | The name of the input edge table. |
inputEdgeTablePartitions | No | Full table | The partitions in the input edge table. |
fromVertexCol | Yes | No default value | The start vertex column in the input edge table. |
toVertexCol | Yes | No default value | The end vertex column in the input edge table. |
outputTableName | Yes | No default value | The name of the output table. |
outputTablePartitions | No | No default value | The partitions in the output table. |
lifecycle | No | No default value | The lifecycle of the output table. |
workerNum | No | No default value | The number of nodes for parallel job execution. The degree of parallelism and framework communication costs increase with the value of this parameter. |
workerMem | No | 4096 | The maximum size of memory that can be used by a job. Unit: MB. Default value: 4096. If the size of the used memory exceeds the value of this parameter, the |
splitSize | No | 64 | The data split size. Unit: MB. |
hasEdgeWeight | No | false | Specifies whether the edges in the input edge table have weights. |
edgeWeightCol | No | No default value | The edge weight column in the input edge table. |
maxIter | No | 30 | The maximum number of iterations. |
Example
Add the SQL Script component as a node to the canvas. Unselect Use Script Mode and Whether the system adds a create table statement. Execute the following SQL statements to generate training data.
drop table if exists PageRankWithWeight_func_test_edge; create table PageRankWithWeight_func_test_edge as select * from ( select 'a' as flow_out_id,'b' as flow_in_id,1.0 as weight union all select 'a' as flow_out_id,'c' as flow_in_id,1.0 as weight union all select 'b' as flow_out_id,'c' as flow_in_id,1.0 as weight union all select 'b' as flow_out_id,'d' as flow_in_id,1.0 as weight union all select 'c' as flow_out_id,'d' as flow_in_id,1.0 as weight )tmp;
Data structure
Add the SQL Script component as a node to the canvas. Unselect Use Script Mode and Whether the system adds a create table statement. Connect the nodes in step 1 and 2.
drop table if exists ${o1}; PAI -name PageRankWithWeight -project algo_public -DinputEdgeTableName=PageRankWithWeight_func_test_edge -DfromVertexCol=flow_out_id -DtoVertexCol=flow_in_id -DoutputTableName=${o1} -DhasEdgeWeight=true -DedgeWeightCol=weight -DmaxIter 100;
Click to run the pipeline.
After the run ends, right-click the component in step 2 and choose View Data > SQL Script Output to view the training results.
| node | weight | | ---- | ---------- | | a | 0.12841452 | | b | 0.18299069 | | c | 0.26076174 | | d | 0.42783305 |