This topic describes the Conditional Random Field component provided by Machine Learning Designer (formerly known as Machine Learning Studio).
A conditional random field (CRF) is a conditional probability distribution model of a group of output random variables based on a group of input random variables. This model presumes that the output random variables constitute a Markov random field (MRF). CRFs can be used in different prediction scenarios. The linear chain CRF is mostly used, especially in annotation scenarios. For more information, see Wikipedia.
Configure the component
You can use one of the following methods to configure the Conditional Random Field component.
Method 1: Configure the component on the pipeline page
You can configure the parameters of the Conditional Random Field component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.
Tab | Parameter | Description |
Fields Setting | ID Columns | The column that contains the ID of each sample. Samples are stored in n-tuples. |
Feature Columns | The word to be annotated and its features if the word has features. | |
Target Columns | The column that you want to select. | |
Parameters Setting | Feature Generation Template | Default value:
. |
Infrequently Used Word Filtering Threshold | Default value: 1. | |
L1 Regularization Coefficient | Default value: 1. | |
L2 Regularization Coefficient | Default value: 0. | |
Maximum Iterations | Default value: 100. | |
Convergence Threshold | Default value: 0.00001. | |
Tuning | Cores | The number of cores. By default, the system determines the value |
Memory Size per Core | The memory size of each core. By default, the system determines the value |
Method 2: Use PAI commands
Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI -name=linearcrf
-project=algo_public
-DinputTableName=crf_input_table
-DidColName=sentence_id
-DfeatureColNames=word,f1
-DlabelColName=label
-DoutputTableName=crf_model
-Dlifecycle=28
-DcoreNum=10
Parameter | Required | Description | Default value |
inputTableName | Yes | The table that contains the input features. | No default value |
inputTablePartitions | No | The partitions selected from the table that contains the input features. | All partitions |
featureColNames | No | The feature columns selected from the input table. | All columns, excluding the label column |
labelColName | Yes | The column that you want to select. | No default value |
idColName | Yes | The column that contains sample labels. | No default value |
outputTableName | Yes | The table that contains output models. | No default value |
outputTablePartitions | No | The partitions selected from the output model table. | All partitions |
template | No | The template that is used to generate features. |
|
freq | No | The parameter for filtering features. Only feature values greater than or equal to the freq value are retained. | 1 |
iterations | No | The maximum number of iterations of optimizations. | 100 |
l1Weight | No | The parameter weight of L1 regularization. | 1.0 |
l2Weight | No | The parameter weight of L2 regularization. | 1.0 |
epsilon | No | The convergence deviation. This parameter specifies the requirement to finish the Limited-memory Broyden Fletcher Goldfarb Shanno (L-BFGS) process, which is the deviation between the log-likelihood values in two iterations. | 0.0001 |
lbfgsStep | No | The historical step size for optimization that is performed by using the L-BFGS algorithm. Only the L-BFGS algorithm supports this parameter. | 10 |
threadNum | No | The number of parallel threads used for model training. | 3 |
lifecycle | No | The lifecycle of the output table. | No default value |
coreNum | No | The number of cores. | Determined by the system |
memSizePerCore | No | The memory size of each core. | Determined by the system |
Example
Input data
sentence_id
word
f1
label
1
Rockwell
NNP
B-NP
1
International
NNP
I-NP
1
Corp
NNP
I-NP
1
's
POS
B-NP
...
...
...
...
823
Ohio
NNP
B-NP
823
grew
VBD
B-VP
823
3.8
CD
B-NP
823
%
NN
I-NP
823
.
.
O
Prediction algorithm PAI command
PAI -name=crf_predict -project=algo_public -DinputTableName=crf_test_input_table -DmodelTableName=crf_model -DidColName=sentence_id -DfeatureColNames=word,f1 -DlabelColName=label -DoutputTableName=crf_predict_result -DdetailColName=prediction_detail -Dlifecycle=28 -DcoreNum=10
Parameter
Required
Description
Default value
inputTableName
Yes
The table that contains the input features.
No default value
inputTablePartitions
No
The partitions selected from the table that contains the input features.
All partitions
featureColNames
No
The feature columns selected from the input table.
All columns, excluding the label column
labelColName
No
The column that you want to select.
No default value
IdColName
Yes
The column that contains sample labels.
No default value
resultColName
No
The result column in the output table.
prediction_result
scoreColName
No
The score column in the output table.
prediction_score
detailColName
No
The detail column in the output table.
No default value
outputTableName
Yes
The output prediction result table.
No default value
outputTablePartitions
No
The partitions selected from the output prediction result table.
All partitions
modelTableName
Yes
The algorithm model table.
No default value
modelTablePartitions
No
The partitions selected from the algorithm model table.
All partitions
lifecycle
No
The lifecycle of the output table.
No default value
coreNum
No
The number of cores.
Determined by the system
memSizePerCore
No
The memory size of each core.
Determined by the system
Output data
sentence_id
word
f1
label
1
Confidence
NN
B-NP
1
in
IN
B-PP
1
the
DT
B-NP
1
pound
NN
I-NP
...
...
...
...
77
have
VBP
B-VP
77
announced
VBN
I-VP
77
similar
JJ
B-NP
77
increases
NNS
I-NP
77
.
.
O
NoteThe label column is optional.