The Random Forest component is a classifier that consists of multiple decision trees. The classification result is determined by the mode of output classes of individual trees.
Configure the component
You can use one of the following methods to configure the Random Forest component.
Method 1: Configure the component on the pipeline page
You can configure the parameters of the Random Forest component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.
Tab | Parameter | Description |
Fields Setting | Feature Columns | By default, the columns except the label columns and weight columns are selected. |
Excluded Columns | The columns that are not used for training. These columns cannot be used as feature columns. | |
Forced Conversion Column | Comply with the following rules to parse columns:
Note To parse the columns of the BIGINT type to the columns of the categorical type, you must use the forceCategorical parameter to specify the type. | |
Weight Columns | The column that contains the weight of each row of samples. The columns of numeric data types are supported. | |
Label Column | The label column in the input table. The columns of the STRING type and numeric data types are supported. | |
Parameters Setting | Number of Decision Trees in the Forest | The number of trees. Valid values: 1 to 1000. |
Single Decision tree Algorithm | If a forest has N trees and the condition is algorithmTypes=[a,b]:
For example, if a forest has five trees and [2,4] indicates 0, 1 indicates the ID3 algorithm, 2 and 3 indicate the CART algorithm, and 4 indicates the C4.5 algorithm. If the value is None, tree algorithms are evenly allocated across the forest. | |
Number of Random Features for Each Decision Tree | Valid values: [1,N]. N represents the number of features. | |
Minimum Number of Leaf Nodes | Valid values: positive integers. Default value: 2. | |
Minimum Ratio of Leaf Nodes to Parent Nodes | Valid values: [0,1]. Default value: 0. | |
Maximum Decision Tree Depth | Valid values: [1,+∞). Default value: ∞. | |
Number of Random Data Input for Each Decision Tree | Valid values: (1000,1000000]. Default value: 100000. |
Method 2: Use PAI commands
Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI -name randomforests
-project algo_public
-DinputTableName="pai_rf_test_input"
-DmodelName="pai_rf_test_model"
-DforceCategorical="f1"
-DlabelColName="class"
-DfeatureColNames="f0,f1"
-DmaxRecordSize="100000"
-DminNumPer="0"
-DminNumObj="2"
-DtreeNum="3";
Parameter | Required | Description | Default value |
inputTableName | Yes | The name of the input table. | N/A |
inputTablePartitions | No | The partitions that are selected from the input table for training. Specify this parameter in one of the following formats:
Note If you specify multiple partitions, separate these partitions with commas (,). | All partitions |
labelColName | Yes | The name of the label column that is selected from the input table. | N/A |
modelName | Yes | The name of the output model. | N/A |
treeNum | Yes | The number of trees in the forest. Valid values: 1 to 1000. | 100 |
excludedColNames | No | The columns that are not used for training. The columns cannot be used as feature columns. | Empty string |
weightColName | No | The name of the weight column in the input table. | N/A |
featureColNames | No | The feature columns that are selected from the input table for training. | All columns except the label columns specified by the labelColName parameter and weight column specified by the weightColName parameter. |
forceCategorical | No | Comply with the following rules to parse columns:
Note To parse the columns of the BIGINT type to the columns of the categorical type, you must use the forceCategorical parameter to specify the type. | INT is a continuous type. |
algorithmTypes | No | The location of a tree algorithm in the forest. If the forest has N trees and the condition is algorithmTypes=[a,b]:
For example, if a forest has five trees and [2,4] indicates 0, 1 indicates the ID3 algorithm, 2 and 3 indicate the CART algorithm, and 4 indicates the C4.5 algorithm. If the value is None, tree algorithms are evenly allocated across the forest. | Evenly allocated |
randomColNum | No | The number of random features that are selected for each split when a single tree is generated. Valid values: [1,N]. N represents the number of features. | log 2N |
minNumObj | No | The minimum amount of data on leaf nodes. The parameter value must be a positive integer. | 2 |
minNumPer | No | The minimum ratio of data on leaf nodes to data on a parent node. Valid values: [0,1]. | 0.0 |
maxTreeDeep | No | The maximum depth of a single tree. Valid values: [1,+∞). | ∞ |
maxRecordSize | No | The number of random data inputs for a tree. Valid values: (1000,1000000]. | 100000 |
Example
Execute the following SQL statements to generate training data:
create table pai_rf_test_input as select * from ( select 1 as f0,2 as f1, "good" as class union all select 1 as f0,3 as f1, "good" as class union all select 1 as f0,4 as f1, "bad" as class union all select 0 as f0,3 as f1, "good" as class union all select 0 as f0,4 as f1, "bad" as class )tmp;
Run the following PAI command to submit the parameters of the Random Forest component:
PAI -name randomforests -project algo_public -DinputTableName="pai_rf_test_input" -Dmodelname="pai_rf_test_model" -DforceCategorical="f1" -DlabelColName="class" -DfeatureColNames="f0,f1" -DmaxRecordSize="100000" -DminNumPer="0" -DminNumObj="2" -DtreeNum="3";
View the Predictive Model Markup Language (PMML) of the model.
<?xml version="1.0" encoding="utf-8"?> <PMML xmlns="http://www.dmg.org/PMML-4_2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="4.2" xsi:schemaLocation="http://www.dmg.org/PMML-4_2 http://www.dmg.org/v4-2/pmml-4-2.xsd"> <Header copyright="Copyright (c) 2014, Alibaba Inc." description=""> <Application name="ODPS/PMML" version="0.1.0"/> <TimestampTue, 12 Jul 2016 07:04:48 GMT</Timestamp> </Header> <DataDictionary numberOfFields="2"> <DataField name="f0" optype="continuous" dataType="integer"/> <DataField name="f1" optype="continuous" dataType="integer"/> <DataField name="class" optype="categorical" dataType="string"> <Value value="bad"/> <Value value="good"/> </DataField> </DataDictionary> <MiningModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests"/> <MiningSchema> <MiningField name="f0" usageType="active"/> <MiningField name="f1" usageType="active"/> <MiningField name="class" usageType="target"/> </MiningSchema> <Segmentation multipleModelMethod="majorityVote"> <Segment id="0"> <True/> <TreeModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests"> <MiningSchema> <MiningField name="f0" usageType="active"/> <MiningField name="f1" usageType="active"/> <MiningField name="class" usageType="target"/> </MiningSchema> <Node id="1"> <True/> <ScoreDistribution value="bad" recordCount="2"/> <ScoreDistribution value="good" recordCount="3"/> <Node id="2" score="good"> <SimplePredicate field="f1" operator="equal" value="2"/> <ScoreDistribution value="good" recordCount="1"/> </Node> <Node id="3" score="good"> <SimplePredicate field="f1" operator="equal" value="3"/> <ScoreDistribution value="good" recordCount="2"/> </Node> <Node id="4" score="bad" <SimplePredicate field="f1" operator="equal" value="4"/> <ScoreDistribution value="bad" recordCount="2"/> </Node> </Node> </TreeModel> </Segment> <Segment id="1"> <True/> <TreeModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests"> <MiningSchema> <MiningField name="f0" usageType="active"/> <MiningField name="f1" usageType="active"/> <MiningField name="class" usageType="target"/> </MiningSchema> <Node id="1"> <True/> <ScoreDistribution value="bad" recordCount="2"/> <ScoreDistribution value="good" recordCount="3"/> <Node id="2" score="good"> <SimpleSetPredicate field="f1" booleanOperator="isIn"> <Array n="2" type="integer"2 3</Array> </SimpleSetPredicate> <ScoreDistribution value="good" recordCount="3"/> </Node> <Node id="3" score="bad"> <SimpleSetPredicate field="f1" booleanOperator="isNotIn"> <Array n="2" type="integer"2 3</Array> </SimpleSetPredicate> <ScoreDistribution value="bad" recordCount="2"/> </Node> </Node> </TreeModel> </Segment> <Segment id="2"> <True/> <TreeModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests"> <MiningSchema> <MiningField name="f0" usageType="active"/> <MiningField name="f1" usageType="active"/> <MiningField name="class" usageType="target"/> </MiningSchema> <Node id="1"> <True/> <ScoreDistribution value="bad" recordCount="2"/> <ScoreDistribution value="good" recordCount="3"/> <Node id="2" score="bad"> <SimplePredicate field="f0" operator="lessOrEqual" value="0.5"/> <ScoreDistribution value="bad" recordCount="1"/> <ScoreDistribution value="good" recordCount="1"/> </Node> <Node id="3" score="good"> <SimplePredicate field="f0" operator="greaterThan" value="0.5"/> <ScoreDistribution value="bad" recordCount="1"/> <ScoreDistribution value="good" recordCount="2"/> </Node> </Node> </TreeModel> </Segment> </Segmentation> </MiningModel> </PMML>
View the visualized output of the Random Forest component.