The PMI algorithm component of Platform for AI (PAI) is used to count the co-occurrence of all words in several documents and calculate the pointwise mutual information (PMI). This topic describes how to configure the PMI algorithm component.
Background information
In information theory, mutual information (MI) can be regarded as the amount of information that is contained in a random variable of another variable, or the reduction in uncertainty of a random variable due to the known random variable.
PMI is used to quantify the relevance between two words. Definition: PMI(x,y)=ln(p(x,y)/(p(x)p(y)))=ln(#(x,y)D/(#x#y))
. In the definition, #(x,y)
indicates the number of pairs (x,y)
. D indicates the total number of pairs. If x and y appear in the same window, the output is #x+=1
, #y+=1
, and #(x,y)+=1
. For more information about PMI, see PMI.
Configure the component
You can use one of the following methods to configure the PMI component:
Method 1: Configure the component in the PAI console
You can configure the parameters of the PMI component on the pipeline page of Machine Learning Designer.
Tab | Parameter | Description |
Fields Setting | Columns of Documents with Words Separated with Spaces | N/A |
Parameters Setting | Minimum Frequency of Words | Words that appear for a number of times less than this value are filtered out. Default value: 5. |
Window Size | The window size. For example, a value of 5 indicates the five adjacent words on the right of the current word. Words that appear in the window are considered related to the current word. | |
Tuning | Computing Cores | The number of cores used for calculation. By default, the system determines the value. |
Memory Size per Core (Unit: MB) | The memory size of each core. By default, the system determines the value. |
Method 2: Configure the parameters by using PAI commands
The following section describes the parameters. You can use SQL scripts to call PAI commands. For more information, see SQL Script.
PAI -name PointwiseMutualInformation
-project algo_public
-DinputTableName=maple_test_pmi_basic_input
-DdocColName=doc
-DoutputTableName=maple_test_pmi_basic_output
-DminCount=0
-DwindowSize=2
-DcoreNum=1
-DmemSizePerCore=110;
Parameter | Required | Description | Default value |
inputTableName | Yes | Input table | N/A |
outputTableName | Yes | Output table | N/A |
docColName | Yes | The name of the document column after word segmentation, in which words are separated with spaces. | N/A |
windowSize | No | The window size. For example, a value of 5 indicates the five adjacent words on the right of the current word. Words that appear in the window are considered related to the current word. | All content in a row |
minCount | No | The minimum frequency of words for truncation. Words that appear for a number of times lower than this value are filtered out. | 5 |
inputTablePartitions | No | The partitions selected from the input table for training, which are in the Partition_name=value format. To specify multiple partitions, use the following format: name1=value1/name2=value2. Separate multiple partitions with commas (,). | All partitions |
lifecycle | No | The lifecycle of the output table. | N/A |
coreNum | No | The number of cores used for calculation. Valid values: [1,9999]. | Determined by the system |
memSizePerCore | No | The memory size of each core. Unit: MB. Valid values: [1024,65536]. | Determined by the system |
Sample command
Input
Create a table named maple_test_pmi_basic_input by using the ODPS SQL node. For more information, see Develop a MaxCompute SQL task. Sample command:
create table maple_test_pmi_basic_input as select * from ( select "w1 w2 w3 w4 w5 w6 w7 w8 w8 w9" as doc union all select "w1 w3 w5 w6 w9" as doc union all select "w0" as doc union all select "w0 w0" as doc union all select "w9 w1 w9 w1 w9" as doc )tmp;
Sample data in the maple_test_pmi_basic_input table after you run the command:
doc
w1 w2 w3 w4 w5 w6 w7 w8 w8 w9
w1 w3 w5 w6 w9
w0
w0 w0
w9 w1 w9 w1 w9
Run the PAI command
You can use an SQL script component or an ODPS SQL node to run the following PAI commands.
PAI -name PointwiseMutualInformation -project algo_public -DinputTableName=maple_test_pmi_basic_input -DdocColName=doc -DoutputTableName=maple_test_pmi_basic_output -DminCount=0 -DwindowSize=2 -DcoreNum=1 -DmemSizePerCore=110;
Output
Sample output table maple_test_pmi_basic_output:
word1
word2
word1_count
word2_count
co_occurrences_count
pmi
w0
w0
2
2
1
2.0794415416798357
w1
w1
10
10
1
-1.1394342831883648
w1
w2
10
3
1
0.06453852113757116
w1
w3
10
7
2
-0.08961215868968704
w1
w5
10
8
1
-0.916290731874155
w1
w9
10
12
4
0.06453852113757116
w2
w3
3
7
1
0.4212134650763035
w2
w4
3
4
1
0.9808292530117262
w3
w4
7
4
1
0.13353139262452257
w3
w5
7
8
2
0.13353139262452257
w3
w6
7
7
1
-0.42608439531090014
w4
w5
4
8
1
0.0
w4
w6
4
7
1
0.13353139262452257
w5
w6
8
7
2
0.13353139262452257
w5
w7
8
4
1
0.0
w5
w9
8
12
1
-1.0986122886681098
w6
w7
7
4
1
0.13353139262452257
w6
w8
7
7
1
-0.42608439531090014
w6
w9
7
12
1
-0.9650808960435872
w7
w8
4
7
2
0.8266785731844679
w8
w8
7
7
1
-0.42608439531090014
w8
w9
7
12
2
-0.2719337154836418
w9
w9
12
12
2
-0.8109302162163288
References
For more information about Machine Learning Designer components, see Overview of Machine Learning Designer.
Machine Learning Designer provides various preset algorithm components. You can select a component for data processing based on your business scenarios. For more information, see Component reference: overview of all components.