隨機森林 - Platform For AI

隨機森林是一個包括多決策樹的分類器，其分類結果由單棵樹輸出類別的眾數決定。

組件配置

您可以使用以下任意一種方式，配置隨機森林組件參數。

方式一：可視化方式

在Designer工作流程頁面配置組件參數。

頁簽	參數	描述
欄位設定	選擇特徵列	預設為除標籤列和權重列外的所有列。
	排除列	不參與訓練的列，不能與選擇特徵列同時使用。
	強制轉換列	解析規則如下： STRING、BOOLEAN及DATETIME類型的列，解析為離散類型。 DOUBLE和BIGINT類型的列，解析為連續類型。說明如果需要將BIGINT類型的列解析為CATEGORICAL，則必須使用forceCategorical參數指定類型。
	權重列的列名	列可以對每行樣本進行加權，支援數實值型別。
	標籤列	輸入表的標籤列，支援STRING及數實值型別。
參數設定	森林中樹的個數	取值範圍為1~1000。
	單顆樹的演算法在森林中的位置	如果有N棵樹，且algorithmTypes=[a,b]，則： [0,a)為ID3演算法。 [a,b)為CART演算法。 [b,n]為C4.5演算法。例如，在一個擁有5棵樹的森林中，如果[2,4]表示0，則1為ID3演算法，2,3為CART演算法，4為C4.5演算法。如果輸入None，則演算法在森林中均分。
	單棵樹隨機特徵數	取值範圍為[1,N]，N表示Feature數量。
	分葉節點資料的最小個數	取值範圍為正整數，預設值為2。
	分葉節點資料個數占父節點的最小比例	取值範圍為[0,1]，預設值為0。
	單顆樹的最大深度	取值範圍為[1,+∞)，預設值為無窮。
	單顆樹輸入的隨機資料個數	取值範圍為(1000,1000000]，預設值為100000。

方式二：PAI命令方式

使用PAI命令方式，配置該組件參數。您可以使用SQL指令碼組件進行PAI命令調用，詳情請參見SQL指令碼。

 PAI -name randomforests
     -project algo_public
     -DinputTableName="pai_rf_test_input"
     -DmodelName="pai_rf_test_model"
     -DforceCategorical="f1"
     -DlabelColName="class"
     -DfeatureColNames="f0,f1"
     -DmaxRecordSize="100000"
     -DminNumPer="0"
     -DminNumObj="2"
     -DtreeNum="3";

參數	是否必選	描述	預設值
inputTableName	是	輸入表。	無
inputTablePartitions	否	輸入表中，參與訓練的分區。支援以下格式： Partition_name=value name1=value1/name2=value2：多級格式說明如果指定多個分區，則使用英文逗號（,）分隔。	所有分區
labelColName	是	輸入表中，標籤列的列名。	無
modelName	是	輸出的模型名。	無
treeNum	是	森林中樹的數量，取值範圍為1~1000。	100
excludedColNames	否	不參與訓練的列，不能與featureColNames同時使用。	空
weightColName	否	輸入表中的權重列名。	無
featureColNames	否	輸入表中，用於訓練的特徵列名。	除labelColName與weightColName外的所有列
forceCategorical	否	解析規則如下： STRING、BOOLEAN及DATETIME類型的列，解析為離散類型。 DOUBLE和BIGINT類型的列，解析為連續類型。說明如果需要將BIGINT類型的列解析為CATEGORICAL，則必須使用forceCategorical參數指定類型。	INT為連續類型
algorithmTypes	否	單棵樹的演算法在森林中的位置。如果有N棵樹，且algorithmTypes=[a,b]，則： [0,a)為ID3演算法。 [a,b)為CART演算法。 [b,n]為C4.5演算法。例如，在一個擁有5棵樹的森林中，[2,4]表示0，則1為ID3演算法，2,3為CART演算法，4為C4.5演算法。如果輸入None，則演算法在森林中均分。	演算法在森林中均分
randomColNum	否	產生單棵樹時，每次分裂選擇的隨機特徵數量。取值範圍為[1,N]，N表示Feature數量。	log ₂N
minNumObj	否	分葉節點資料的最小個數，取值範圍為正整數。	2
minNumPer	否	分葉節點資料個數占父節點的最小比例，取值範圍為[0,1]。	0.0
maxTreeDeep	否	單顆樹的最大深度，取值範圍為[1,+∞)。	無窮
maxRecordSize	否	單棵樹輸入的隨機資料個數，取值範圍為(1000,1000000]。	100000

樣本

使用SQL語句，產生訓練資料。

create table pai_rf_test_input as
select * from
(
  select 1 as f0,2 as f1, "good" as class
  union all
  select 1 as f0,3 as f1, "good" as class
  union all
  select 1 as f0,4 as f1, "bad" as class
  union all
  select 0 as f0,3 as f1, "good" as class
  union all
  select 0 as f0,4 as f1, "bad" as class
)tmp;

使用PAI命令，提交隨機森林演算法組件參數。

PAI -name randomforests
     -project algo_public
     -DinputTableName="pai_rf_test_input"
     -Dmodelname="pai_rf_test_model"
     -DforceCategorical="f1"
     -DlabelColName="class"
     -DfeatureColNames="f0,f1"
     -DmaxRecordSize="100000"
     -DminNumPer="0"
     -DminNumObj="2"
     -DtreeNum="3";

查看模型PMML（Predictive Model Markup Language）。

<?xml version="1.0" encoding="utf-8"?>
<PMML xmlns="http://www.dmg.org/PMML-4_2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="4.2" xsi:schemaLocation="http://www.dmg.org/PMML-4_2 http://www.dmg.org/v4-2/pmml-4-2.xsd">
  <Header copyright="Copyright (c) 2014, Alibaba Inc." description="">
    <Application name="ODPS/PMML" version="0.1.0"/>
    <TimestampTue, 12 Jul 2016 07:04:48 GMT</Timestamp>
  </Header>
  <DataDictionary numberOfFields="2">
    <DataField name="f0" optype="continuous" dataType="integer"/>
    <DataField name="f1" optype="continuous" dataType="integer"/>
    <DataField name="class" optype="categorical" dataType="string">
      <Value value="bad"/>
      <Value value="good"/>
    </DataField>
  </DataDictionary>
  <MiningModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests"/>
    <MiningSchema>
      <MiningField name="f0" usageType="active"/>
      <MiningField name="f1" usageType="active"/>
      <MiningField name="class" usageType="target"/>
    </MiningSchema>
    <Segmentation multipleModelMethod="majorityVote">
      <Segment id="0">
        <True/>
        <TreeModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests">
          <MiningSchema>
            <MiningField name="f0" usageType="active"/>
            <MiningField name="f1" usageType="active"/>
            <MiningField name="class" usageType="target"/>
          </MiningSchema>
          <Node id="1">
            <True/>
            <ScoreDistribution value="bad" recordCount="2"/>
            <ScoreDistribution value="good" recordCount="3"/>
            <Node id="2" score="good">
              <SimplePredicate field="f1" operator="equal" value="2"/>
              <ScoreDistribution value="good" recordCount="1"/>
            </Node>
            <Node id="3" score="good">
              <SimplePredicate field="f1" operator="equal" value="3"/>
              <ScoreDistribution value="good" recordCount="2"/>
            </Node>
            <Node id="4" score="bad"
              <SimplePredicate field="f1" operator="equal" value="4"/>
              <ScoreDistribution value="bad" recordCount="2"/>
            </Node>
          </Node>
        </TreeModel>
      </Segment>
      <Segment id="1">
        <True/>
        <TreeModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests">
          <MiningSchema>
            <MiningField name="f0" usageType="active"/>
            <MiningField name="f1" usageType="active"/>
            <MiningField name="class" usageType="target"/>
          </MiningSchema>
          <Node id="1">
            <True/>
            <ScoreDistribution value="bad" recordCount="2"/>
            <ScoreDistribution value="good" recordCount="3"/>
            <Node id="2" score="good">
              <SimpleSetPredicate field="f1" booleanOperator="isIn">
                <Array n="2" type="integer"2 3</Array>
              </SimpleSetPredicate>
              <ScoreDistribution value="good" recordCount="3"/>
            </Node>
            <Node id="3" score="bad">
              <SimpleSetPredicate field="f1" booleanOperator="isNotIn">
                <Array n="2" type="integer"2 3</Array>
              </SimpleSetPredicate>
              <ScoreDistribution value="bad" recordCount="2"/>
            </Node>
          </Node>
        </TreeModel>
      </Segment>
      <Segment id="2">
        <True/>
        <TreeModel modelName="xlab_m_random_forests_1_75078_v0" functionName="classification" algorithmName="RandomForests">
          <MiningSchema>
            <MiningField name="f0" usageType="active"/>
            <MiningField name="f1" usageType="active"/>
            <MiningField name="class" usageType="target"/>
          </MiningSchema>
          <Node id="1">
            <True/>
            <ScoreDistribution value="bad" recordCount="2"/>
            <ScoreDistribution value="good" recordCount="3"/>
            <Node id="2" score="bad">
              <SimplePredicate field="f0" operator="lessOrEqual" value="0.5"/>
              <ScoreDistribution value="bad" recordCount="1"/>
              <ScoreDistribution value="good" recordCount="1"/>
            </Node>
            <Node id="3" score="good">
              <SimplePredicate field="f0" operator="greaterThan" value="0.5"/>
              <ScoreDistribution value="bad" recordCount="1"/>
              <ScoreDistribution value="good" recordCount="2"/>
            </Node>
          </Node>
        </TreeModel>
      </Segment>
    </Segmentation>
  </MiningModel>
</PMML>

查看模型可視化輸出。