如何使用MaxComputeSQLML及羅吉斯迴歸二分類模型 - MaxCompute

本文以開來源資料集（Mushroom Data Set）為例，為您介紹如何快速使用MaxCompute SQLML及機器學習的羅吉斯迴歸二分類模型預測蘑菇是否有毒。

前提條件

登入阿里雲帳號，並完成實名認證。更多資訊，請參見準備阿里雲帳號。

如果您需要使用RAM使用者身份進行操作，請確認帳號可用並已授權，詳情請參見準備RAM使用者。

操作步驟

可選：開通MaxCompute隨用隨付服務、DataWorks（基礎版）服務及Platform for AI（PAI(Designer、DLC、EAS)隨用隨付開通）服務，三種服務的開通地區保持一致。
1. 進入阿里雲MaxCompute產品首頁，單擊立即購買。
  更多MaxCompute服務開通訊息，請參見開通MaxCompute和DataWorks。
  說明
  如果您未開通過MaxCompute服務，通過該方式開通MaxCompute服務時，預設會為您開通DataWorks基礎版服務（免費）和MaxCompute隨用隨付服務。
  如果您已開通MaxCompute隨用隨付服務，請忽略本步驟。
2. 進入DataWorks購買頁面，購買基礎版服務。
  更多DataWorks服務開通訊息，請參見開通DataWorks。
  說明
  如果您已開通DataWorks基礎版服務，請忽略本步驟。
3. 進入Platform for AI購買頁面，開通PAI並建立預設工作空間。
  更多Platform for AI服務開通訊息，請參見開通PAI並建立預設工作空間。
  說明
  如果您已開通PAI並建立了工作空間，請忽略本步驟。
下載Mushroom Data Set資料集檔案agaricus-lepiota.data，並儲存為TXT、CSV或LOG檔案類型。例如agaricus-lepiota.data.txt。
登入DataWorks控制台，建立或配置DataWorks工作空間。
- 如果已有DataWorks工作空間，請為目標工作空間配置MaxCompute計算引擎，並開啟調度PAI演算法任務開關。
  1. 單擊左側導覽列的工作空間，進入工作空間列表頁面。
  2. 在工作空間列表頁面，單擊目標工作空間操作列的管理。
  3. 在左側導覽列選擇工作空間，在基礎配置頁簽的基本屬性地區開啟調度PAI演算法任務。
  4. 單擊左側導覽列的資料來源 > 資料來源列表，進入資料來源頁面，建立MaxCompute資料來源。資料來源建立方法請參見建立MaxCompute資料來源。
- 如果沒有DataWorks工作空間，請建立DataWorks工作空間。配置計算引擎服務為MaxCompute，並開啟調度PAI演算法任務開關。更多建立DataWorks工作空間資訊，請參見建立工作空間。

通過DataWorks建立表mushroom_classification並匯入準備好的資料集資訊。

單擊目標DataWorks工作空間操作列的快速進入 > 資料開發，建立表mushroom_classification。

更多建立表操作資訊，請參見建立並使用MaxCompute表。

建立表的DDL語句樣本如下：

create table mushroom_classification (
    label      string               comment 'poisonous=p,edible=e',
    cap_shape  string               comment 'bell=b,conical=c,convex=x,flat=f,knobbed=k,sunken=s',
    cap_surface string              comment 'fibrous=f,grooves=g,scaly=y,smooth=s',
    cap_color string                comment 'brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y',
    bruises string                  comment 'bruises=t,no=f',
    odor string                     comment 'almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s',
    gill_attachment string          comment 'attached=a,descending=d,free=f,notched=n',
    gill_spacing string             comment 'close=c,crowded=w,distant=d',
    gill_size string                comment 'broad=b,narrow=n',
    gill_color string               comment 'black=k,brown=n,buff=b,chocolate=h,gray=g,green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y',
    stalk_shape string              comment 'enlarging=e,tapering=t',
    stalk_root string               comment 'bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?',
    stalk_surface_above_ring string comment 'fibrous=f,scaly=y,silky=k,smooth=s',
    stalk_surface_below_ring string comment 'fibrous=f,scaly=y,silky=k,smooth=s',
    stalk_color_above_ring string   comment 'brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y',
    stalk_color_below_ring string   comment 'brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y',
    veil_type string                comment 'partial=p,universal=u',
    veil_color string               comment 'brown=n,orange=o,white=w,yellow=y',
    ring_number string              comment 'none=n,one=o,two=t',
    ring_type string                comment 'cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z',
    spore_print_color string        comment 'black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y',
    population string               comment 'abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y',
    habitat string                  comment 'grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d'
);

將資料集檔案agaricus-lepiota.data.txt的資訊匯入表mushroom_classification中，欄位匹配方式選擇按位置匹配。
更多上傳資料操作資訊，請參見上傳本機資料。
使用DataWorks的臨時查詢功能，建立MaxCompute ODPS SQL節點，執行SQL命令驗證資料匯入結果。
更多臨時查詢操作資訊，請參見使用臨時查詢運行SQL語句（可選）。
命令樣本如下：
```
select * from mushroom_classification;
```
返回結果如下：

對匯入表mushroom_classification中的資料通過one-hot編碼方式進行處理。

由於羅吉斯迴歸二分類模型要求欄位為數實值型別，此處通過one-hot編碼方式，將枚舉類型的值轉為數實值型別。例如cap_shape對應的值為b、c、x、f、k、s6個值，one-hot編碼方式會將這6個枚舉值轉為6列，每一列對應一個枚舉值，當cap_shape的值與對應列的枚舉值相等時填1，否則填0。

可選：建立商務程序。例如mc_test。
更多建立商務程序操作資訊，請參見建立周期商務程序。
說明
如果您已有建立好的商務程序，可直接使用，請忽略本步驟。

建立MaxCompute ODPS Script節點，編寫代碼，對匯入的資料按照one-hot編碼方式進行處理並寫入新表mushroom_classification_one_hot中。

更多建立ODPS Script節點資訊，請參見開發ODPS Script任務。

命令樣本如下：

create temporary function one_hot as 'onehot.OneHotEncoding' using
#CODE ('lang'='JAVA')
package onehot;

import com.aliyun.odps.udf.UDFException;
import com.aliyun.odps.udf.UDTF;
import com.aliyun.odps.udf.annotation.Resolve;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

@Resolve({"string,string,string,string,string,string,string,string,string,string," +
        "string,string,string,string,string,string,string,string,string,string,string,string" +
        "->" +
        "bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,"+
        "bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint," +
        "bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint," +
        "bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint," +
        "bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint," +
        "bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint," +
        "bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint," +
        "bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint," +
        "bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint," +
        "bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint," +
        "bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint," +
        "bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint," +
        "bigint,bigint,bigint,bigint,bigint,bigint"})
public class OneHotEncoding extends UDTF {
  private static char[][] features = {
          { 'b','c','x','f','k','s'}, //cap-shape
          { 'f','g','y','s'}, //cap-surface
          { 'n','b','c','g','r','p','u','e','w','y'}, //cap-color
          { 't','f'}, //bruises
          { 'a','l','c','y','f','m','n','p','s'}, //odor
          { 'a','d','f','n'}, //gill-attachment
          { 'c','w','d'}, //gill-spacing
          { 'b','n'}, //gill-size
          { 'k','n','b','h','g','r','o','p','u','e','w','y'}, //gill-color
          { 'e','t'}, //stalk-shape
          { 'b','c','u','e','z','r','?'}, //stalk-root
          { 'f','y','k','s'}, //stalk-surface-above-ring
          { 'f','y','k','s'}, //stalk-surface-below-ring
          { 'n','b','c','g','o','p','e','w','y'}, //stalk-color-above-ring
          { 'n','b','c','g','o','p','e','w','y'}, //stalk-color-below-ring
          { 'p','u'}, //veil-type
          { 'n','o','w','y'}, //veil-color
          { 'n','o','t'}, //ring-number
          { 'c','e','f','l','n','p','s','z'}, //ring-type
          { 'k','n','b','h','r','o','u','w','y'}, //spore-print-color
          { 'a','c','n','s','v','y'}, //population
          { 'g','l','m','p','u','w','d'}, //habitat
  };
  @Override
  public void process(Object[] objects) throws UDFException, IOException {
    List<Long> featuresEncoding = new ArrayList<>(126);
    for (int i = 0; i < objects.length; i++) {
      String value = (String)objects[i];
      char[] feature = features[i];
      for (char c : feature) {
        featuresEncoding.add(value.charAt(0) == c ? 1L : 0L);
      }
    }
    forward(featuresEncoding.toArray());
  }
}

#END CODE;

create table mushroom_classification_one_hot as
select t.*, label
from mushroom_classification
lateral view 
one_hot(cap_shape,cap_surface,cap_color,bruises,odor,gill_attachment, 
        gill_spacing, gill_size, gill_color, stalk_shape,stalk_root ,
        stalk_surface_above_ring,stalk_surface_below_ring,stalk_color_above_ring,
        stalk_color_below_ring,veil_type,veil_color,ring_number,ring_type,spore_print_color,
        population,habitat) t
AS f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,f12,f13,f14,f15,f16,f17,f18,f19,f20,
f21,f22,f23,f24,f25,f26,f27,f28,f29,f30,f31,f32,f33,f34,f35,f36,f37,f38,f39,f40,
f41,f42,f43,f44,f45,f46,f47,f48,f49,f50,f51,f52,f53,f54,f55,f56,f57,f58,f59,f60,
f61,f62,f63,f64,f65,f66,f67,f68,f69,f70,f71,f72,f73,f74,f75,f76,f77,f78,f79,f80,
f81,f82,f83,f84,f85,f86,f87,f88,f89,f90,f91,f92,f93,f94,f95,f96,f97,f98,f99,f100,
f101,f102,f103,f104,f105,f106,f107,f108,f109,f110,f111,f112,f113,f114,f115,f116,
f117,f118,f119,f120,f121,f122,f123,f124,f125,f126;

使用DataWorks的臨時查詢功能，建立MaxCompute ODPS SQL節點，執行SQL命令驗證one-hot處理結果。
命令樣本如下：
```
select * from mushroom_classification_one_hot;
```
返回結果如下：

使用DataWorks的臨時查詢功能，建立MaxCompute ODPS SQL節點，基於表mushroom_classification_one_hot中的資料建立訓練資料集和測試資料集。

命令樣本如下：

--訓練資料集。1/4的資料用於模型訓練。
create table mushroom_training as 
select * from mushroom_classification_one_hot where sample(4,1);

--測試資料集。其餘3/4的資料用於預測和評估。
create table mushroom_predict as 
select * from mushroom_classification_one_hot except all select * from mushroom_training;

建立機器學習模型並做預測。
1. 使用DataWorks的臨時查詢功能，建立MaxCompute ODPS SQL節點，基於訓練資料集建立羅吉斯迴歸二分類模型lr_test_model。
  命令樣本如下：
```
create model lr_test_model
with properties('model_type'='logisticregression_binary', 'goodValue'='p','maxIter'='1000')
as select * from mushroom_training;
```
  說明
  properties中還可以指定更多參數，參數和Platform for AI平台保持一致，請參見線性支援向量機。
  SQL引擎會把as後的查詢語句提取出來單獨運行，結果存放在一個暫存資料表中，您可以在作業的Logview的Summary資訊中查看。暫存資料表的生命週期為1天，逾時會自動回收。
  如果後續需要刪除模型，可以執行drop offlinemodel lr_test_model命令。
2. 使用DataWorks的臨時查詢功能，建立MaxCompute ODPS SQL節點，基於模型lr_test_model，通過內建函數ml_predict對測試資料集中的資料進行預測。
  命令樣本如下：
```
create table mushroom_predict_result as 
select * from ml_predict(
    lr_test_model, 
    (select * from mushroom_predict)
);
```
  說明
  SQL引擎會把ml_predict函數下的子查詢結果儲存到暫存資料表。暫存資料表的生命週期為1天，逾時會自動回收。
  ml_predict的結果可以直接放在SQL查詢from子句中，也可以通過insert或create table as語句存到另一個表中。更多ml_predict資訊，請參見支援的預測模型函數。
3. 使用DataWorks的臨時查詢功能，建立MaxCompute ODPS SQL節點，執行SQL命令查看錶mushroom_predict_result中的預測結果。
  命令樣本如下：
```
select * from mushroom_predict_result;
```
  返回結果如下：
通過內建函數ml_evaluate評估模型的預測準確度。
更多ml_evaluate資訊，請參見支援的評估模型函數。