随机森林 - 人工智能平台 PAI

随机森林是一个包括多决策树的分类器，其分类结果由单棵树输出类别的众数决定。

组件配置

您可以使用以下任意一种方式，配置随机森林组件参数。

方式一：可视化方式

在Designer工作流页面配置组件参数。

页签	参数	描述
字段设置	选择特征列	默认为除标签列和权重列外的所有列。
	排除列	不参与训练的列，不能与选择特征列同时使用。
	强制转换列	解析规则如下： STRING、BOOLEAN及DATETIME类型的列，解析为离散类型。 DOUBLE和BIGINT类型的列，解析为连续类型。说明如果需要将BIGINT类型的列解析为CATEGORICAL，则必须使用forceCategorical参数指定类型。
	权重列的列名	列可以对每行样本进行加权，支持数值类型。
	标签列	输入表的标签列，支持STRING及数值类型。
参数设置	森林中树的个数	取值范围为1~1000。
	单颗树的算法在森林中的位置	如果有N棵树，且algorithmTypes=[a,b]，则： [0,a)为ID3算法。 [a,b)为CART算法。 [b,n]为C4.5算法。例如，在一个拥有5棵树的森林中，如果[2,4]表示0，则1为ID3算法，2,3为CART算法，4为C4.5算法。如果输入None，则算法在森林中均分。
	单棵树随机特征数	取值范围为[1,N]，N表示Feature数量。
	叶节点数据的最小个数	取值范围为正整数，默认值为2。
	叶节点数据个数占父节点的最小比例	取值范围为[0,1]，默认值为0。
	单颗树的最大深度	取值范围为[1,+∞)，默认值为无穷。
	单颗树输入的随机数据个数	取值范围为(1000,1000000]，默认值为100000。

方式二：PAI命令方式

使用PAI命令方式，配置该组件参数。您可以使用SQL脚本组件进行PAI命令调用，详情请参见SQL脚本。

 PAI -name randomforests
     -project algo_public
     -DinputTableName="pai_rf_test_input"
     -DmodelName="pai_rf_test_model"
     -DforceCategorical="f1"
     -DlabelColName="class"
     -DfeatureColNames="f0,f1"
     -DmaxRecordSize="100000"
     -DminNumPer="0"
     -DminNumObj="2"
     -DtreeNum="3";

参数	是否必选	描述	默认值
inputTableName	是	输入表。	无
inputTablePartitions	否	输入表中，参与训练的分区。支持以下格式： Partition_name=value name1=value1/name2=value2：多级格式说明如果指定多个分区，则使用英文逗号（,）分隔。	所有分区
labelColName	是	输入表中，标签列的列名。	无
modelName	是	输出的模型名。	无
treeNum	是	森林中树的数量，取值范围为1~1000。	100
excludedColNames	否	不参与训练的列，不能与featureColNames同时使用。	空
weightColName	否	输入表中的权重列名。	无
featureColNames	否	输入表中，用于训练的特征列名。	除labelColName与weightColName外的所有列
forceCategorical	否	解析规则如下： STRING、BOOLEAN及DATETIME类型的列，解析为离散类型。 DOUBLE和BIGINT类型的列，解析为连续类型。说明如果需要将BIGINT类型的列解析为CATEGORICAL，则必须使用forceCategorical参数指定类型。	INT为连续类型
algorithmTypes	否	单棵树的算法在森林中的位置。如果有N棵树，且algorithmTypes=[a,b]，则： [0,a)为ID3算法。 [a,b)为CART算法。 [b,n]为C4.5算法。例如，在一个拥有5棵树的森林中，[2,4]表示0，则1为ID3算法，2,3为CART算法，4为C4.5算法。如果输入None，则算法在森林中均分。	算法在森林中均分
randomColNum	否	生成单棵树时，每次分裂选择的随机特征数量。取值范围为[1,N]，N表示Feature数量。	log ₂N
minNumObj	否	叶节点数据的最小个数，取值范围为正整数。	2
minNumPer	否	叶节点数据个数占父节点的最小比例，取值范围为[0,1]。	0.0
maxTreeDeep	否	单颗树的最大深度，取值范围为[1,+∞)。	无穷
maxRecordSize	否	单棵树输入的随机数据个数，取值范围为(1000,1000000]。	100000

示例

使用SQL语句，生成训练数据。

create table pai_rf_test_input as
select * from
(
  select 1 as f0,2 as f1, "good" as class
  union all
  select 1 as f0,3 as f1, "good" as class
  union all
  select 1 as f0,4 as f1, "bad" as class
  union all
  select 0 as f0,3 as f1, "good" as class
  union all
  select 0 as f0,4 as f1, "bad" as class
)tmp;

使用PAI命令，提交随机森林算法组件参数。

PAI -name randomforests
     -project algo_public
     -DinputTableName="pai_rf_test_input"
     -Dmodelname="pai_rf_test_model"
     -DforceCategorical="f1"
     -DlabelColName="class"
     -DfeatureColNames="f0,f1"
     -DmaxRecordSize="100000"
     -DminNumPer="0"
     -DminNumObj="2"
     -DtreeNum="3";

查看模型PMML（Predictive Model Markup Language）。

<?xml version="1.0" encoding="utf-8"?>
<PMML xmlns="http://www.dmg.org/PMML-4_2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="4.2" xsi:schemaLocation="http://www.dmg.org/PMML-4_2 http://www.dmg.org/v4-2/pmml-4-2.xsd">
  <Header copyright="Copyright (c) 2014, Alibaba Inc." description="">
    <Application name="ODPS/PMML" version="0.1.0"/>
    <TimestampTue, 12 Jul 2016 07:04:48 GMT</Timestamp>
  </Header>
  <DataDictionary numberOfFields="2">
    <DataField name="f0" optype="continuous" dataType="integer"/>
    <DataField name="f1" optype="continuous" dataType="integer"/>
    <DataField name="class" optype="categorical" dataType="string">
      <Value value="bad"/>
      <Value value="good"/>
    </DataField>
  </DataDictionary>
  <MiningModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests"/>
    <MiningSchema>
      <MiningField name="f0" usageType="active"/>
      <MiningField name="f1" usageType="active"/>
      <MiningField name="class" usageType="target"/>
    </MiningSchema>
    <Segmentation multipleModelMethod="majorityVote">
      <Segment id="0">
        <True/>
        <TreeModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests">
          <MiningSchema>
            <MiningField name="f0" usageType="active"/>
            <MiningField name="f1" usageType="active"/>
            <MiningField name="class" usageType="target"/>
          </MiningSchema>
          <Node id="1">
            <True/>
            <ScoreDistribution value="bad" recordCount="2"/>
            <ScoreDistribution value="good" recordCount="3"/>
            <Node id="2" score="good">
              <SimplePredicate field="f1" operator="equal" value="2"/>
              <ScoreDistribution value="good" recordCount="1"/>
            </Node>
            <Node id="3" score="good">
              <SimplePredicate field="f1" operator="equal" value="3"/>
              <ScoreDistribution value="good" recordCount="2"/>
            </Node>
            <Node id="4" score="bad"
              <SimplePredicate field="f1" operator="equal" value="4"/>
              <ScoreDistribution value="bad" recordCount="2"/>
            </Node>
          </Node>
        </TreeModel>
      </Segment>
      <Segment id="1">
        <True/>
        <TreeModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests">
          <MiningSchema>
            <MiningField name="f0" usageType="active"/>
            <MiningField name="f1" usageType="active"/>
            <MiningField name="class" usageType="target"/>
          </MiningSchema>
          <Node id="1">
            <True/>
            <ScoreDistribution value="bad" recordCount="2"/>
            <ScoreDistribution value="good" recordCount="3"/>
            <Node id="2" score="good">
              <SimpleSetPredicate field="f1" booleanOperator="isIn">
                <Array n="2" type="integer"2 3</Array>
              </SimpleSetPredicate>
              <ScoreDistribution value="good" recordCount="3"/>
            </Node>
            <Node id="3" score="bad">
              <SimpleSetPredicate field="f1" booleanOperator="isNotIn">
                <Array n="2" type="integer"2 3</Array>
              </SimpleSetPredicate>
              <ScoreDistribution value="bad" recordCount="2"/>
            </Node>
          </Node>
        </TreeModel>
      </Segment>
      <Segment id="2">
        <True/>
        <TreeModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests">
          <MiningSchema>
            <MiningField name="f0" usageType="active"/>
            <MiningField name="f1" usageType="active"/>
            <MiningField name="class" usageType="target"/>
          </MiningSchema>
          <Node id="1">
            <True/>
            <ScoreDistribution value="bad" recordCount="2"/>
            <ScoreDistribution value="good" recordCount="3"/>
            <Node id="2" score="bad">
              <SimplePredicate field="f0" operator="lessOrEqual" value="0.5"/>
              <ScoreDistribution value="bad" recordCount="1"/>
              <ScoreDistribution value="good" recordCount="1"/>
            </Node>
            <Node id="3" score="good">
              <SimplePredicate field="f0" operator="greaterThan" value="0.5"/>
              <ScoreDistribution value="bad" recordCount="1"/>
              <ScoreDistribution value="good" recordCount="2"/>
            </Node>
          </Node>
        </TreeModel>
      </Segment>
    </Segmentation>
  </MiningModel>
</PMML>

查看模型可视化输出。