This topic provides an example on how to use MaxCompute SQLML and the logistic regression of the binary classification model of Platform for AI (PAI). In this example, Mushroom Data Set, an open source dataset is used to make predictions about whether mushrooms are toxic.
Prerequisites
An Alibaba Cloud account is created, and real-name verification is performed for the account. For more information, see Create an Alibaba Cloud account.
If you want to perform operations as a RAM user, make sure that the RAM user is available and is granted the related permissions. For more information, see Prepare a RAM user.
Procedure
Optional: Activate the MaxCompute service that uses the pay-as-you-go billing method, DataWorks Basic Edition, and the services of Platform for AI that use the pay-as-you-go billing method. The services of PAI include Machine Learning Designer, Deep Learning Containers (DLC), and Elastic Algorithm Service (EAS). The services must be deployed in the same region.
Go to the product page of MaxCompute and click Buy Now.
For more information about how to activate MaxCompute, see Activate MaxCompute and DataWorks.
NoteIf you do not activate MaxCompute but use this method to activate MaxCompute, DataWorks Basic Edition and the MaxCompute service that uses the pay-as-you-go billing method are activated by default.
If you have activated the MaxCompute service that uses the pay-as-you-go billing method, skip this step.
Go to the buy page of DataWorks and activate DataWorks Basic Edition.
For more information about how to activate DataWorks, see Activate DataWorks.
NoteIf you have activated DataWorks Basic Edition, skip this step.
Go to the Platform for AI, activate the services of PAI, and create the default workspace.
For more information about how to activate the services of Platform for AI, see Activate PAI and create a default workspace.
NoteIf you have activated the services of PAI and created a workspace, skip this step.
Download the Mushroom Data Set file named agaricus-lepiota.data and save the file as a TXT, CSV, or LOG file, such as agaricus-lepiota.data.txt.
Log on to the DataWorks console and create or configure a DataWorks workspace.
If a DataWorks workspace is available, go to the Workspace page to associate the MaxCompute compute engine with the DataWorks workspace and turn on Schedule PAI Task for the workspace.
In the left-side navigation pane of the DataWorks console, click Workspaces to go to the Workspaces page.
On the Workspaces page, find the desired workspace and click Manage in the Actions column.
In the left-side navigation pane of the SettingCenter page, click Workspace. In the Basic Properties section of the Basic Settings tab, turn on Schedule PAI Task.
In the left-side navigation pane, choose Data Sources > Data Sources. On the Data Sources page, add a MaxCompute data source. For more information about how to add a MaxCompute data source, see Add a MaxCompute data source.
If no DataWorks workspaces are available, create one. In the Create Workspace panel, associate the MaxCompute compute engine with the workspace and turn on Schedule PAI Nodes. For more information about how to create a DataWorks workspace, see Create a workspace.
In the DataWorks console, create a table named mushroom_classification and import data of the prepared dataset to the table.
On the Workspaces page, find the desired workspace and choose Shortcuts > Data Development in the Actions column. On the DataStudio page, create the table mushroom_classification.
For more information about how to create a table, see Create and manage MaxCompute tables.
The following DDL statement is used to create a table:
create table mushroom_classification ( label string comment 'poisonous=p,edible=e', cap_shape string comment 'bell=b,conical=c,convex=x,flat=f,knobbed=k,sunken=s', cap_surface string comment 'fibrous=f,grooves=g,scaly=y,smooth=s', cap_color string comment 'brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y', bruises string comment 'bruises=t,no=f', odor string comment 'almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s', gill_attachment string comment 'attached=a,descending=d,free=f,notched=n', gill_spacing string comment 'close=c,crowded=w,distant=d', gill_size string comment 'broad=b,narrow=n', gill_color string comment 'black=k,brown=n,buff=b,chocolate=h,gray=g,green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y', stalk_shape string comment 'enlarging=e,tapering=t', stalk_root string comment 'bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?', stalk_surface_above_ring string comment 'fibrous=f,scaly=y,silky=k,smooth=s', stalk_surface_below_ring string comment 'fibrous=f,scaly=y,silky=k,smooth=s', stalk_color_above_ring string comment 'brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y', stalk_color_below_ring string comment 'brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y', veil_type string comment 'partial=p,universal=u', veil_color string comment 'brown=n,orange=o,white=w,yellow=y', ring_number string comment 'none=n,one=o,two=t', ring_type string comment 'cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z', spore_print_color string comment 'black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y', population string comment 'abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y', habitat string comment 'grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d' );
Import the data in the agaricus-lepiota.data.txt file to the mushroom_classification table. In the Data Import Wizard dialog box, select Match by location and click Import Data.
For more information about how to upload data, see Create tables and upload data.
Use the ad hoc query feature of DataWorks, create an ODPS SQL node, and then execute SQL statements to verify the data import results.
For more information about the ad hoc query feature, see Use the ad-hoc query feature to execute SQL statements (optional).
Sample statement:
select * from mushroom_classification;
The following result is returned.
Use one-hot encoding to process data in the mushroom_classification table.
The logistic regression for binary classification models requires that fields be of NUMERIC type. Therefore, one-hot encoding is used to convert the enumerated values into numeric values. For example, valid values for cap_shape are
b, c, x, f, k, and s
. One-hot encoding converts the six enumerated values into six columns. Each column corresponds to an enumerated value. If the value of cap_shape is equal to the enumerated value of the column to which it corresponds, enter 1. Otherwise, enter 0.Optional: Create a workflow. For example, you can create a workflow named mc_test.
For more information about how to create a workflow, see Create an auto triggered workflow.
NoteIf you have created a workflow, skip this step.
Create an ODPS Script node, write code, use one-hot encoding to process the imported data, and then write the processed data into the new table named mushroom_classification_one_hot.
For more information about how to create an ODPS Script node, see Develop a MaxCompute script task.
Sample statements:
create temporary function one_hot as 'onehot.OneHotEncoding' using #CODE ('lang'='JAVA') package onehot; import com.aliyun.odps.udf.UDFException; import com.aliyun.odps.udf.UDTF; import com.aliyun.odps.udf.annotation.Resolve; import java.io.IOException; import java.util.ArrayList; import java.util.List; @Resolve({"string,string,string,string,string,string,string,string,string,string," + "string,string,string,string,string,string,string,string,string,string,string,string" + "->" + "bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,"+ "bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint," + "bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint," + "bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint," + "bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint," + "bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint," + "bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint," + "bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint," + "bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint," + "bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint," + "bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint," + "bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint,bigint," + "bigint,bigint,bigint,bigint,bigint,bigint"}) public class OneHotEncoding extends UDTF { private static char[][] features = { { 'b','c','x','f','k','s'}, //cap-shape { 'f','g','y','s'}, //cap-surface { 'n','b','c','g','r','p','u','e','w','y'}, //cap-color { 't','f'}, //bruises { 'a','l','c','y','f','m','n','p','s'}, //odor { 'a','d','f','n'}, //gill-attachment { 'c','w','d'}, //gill-spacing { 'b','n'}, //gill-size { 'k','n','b','h','g','r','o','p','u','e','w','y'}, //gill-color { 'e','t'}, //stalk-shape { 'b','c','u','e','z','r','?'}, //stalk-root { 'f','y','k','s'}, //stalk-surface-above-ring { 'f','y','k','s'}, //stalk-surface-below-ring { 'n','b','c','g','o','p','e','w','y'}, //stalk-color-above-ring { 'n','b','c','g','o','p','e','w','y'}, //stalk-color-below-ring { 'p','u'}, //veil-type { 'n','o','w','y'}, //veil-color { 'n','o','t'}, //ring-number { 'c','e','f','l','n','p','s','z'}, //ring-type { 'k','n','b','h','r','o','u','w','y'}, //spore-print-color { 'a','c','n','s','v','y'}, //population { 'g','l','m','p','u','w','d'}, //habitat }; @Override public void process(Object[] objects) throws UDFException, IOException { List<Long> featuresEncoding = new ArrayList<>(126); for (int i = 0; i < objects.length; i++) { String value = (String)objects[i]; char[] feature = features[i]; for (char c : feature) { featuresEncoding.add(value.charAt(0) == c ? 1L : 0L); } } forward(featuresEncoding.toArray()); } } #END CODE; create table mushroom_classification_one_hot as select t.*, label from mushroom_classification lateral view one_hot(cap_shape,cap_surface,cap_color,bruises,odor,gill_attachment, gill_spacing, gill_size, gill_color, stalk_shape,stalk_root , stalk_surface_above_ring,stalk_surface_below_ring,stalk_color_above_ring, stalk_color_below_ring,veil_type,veil_color,ring_number,ring_type,spore_print_color, population,habitat) t AS f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,f12,f13,f14,f15,f16,f17,f18,f19,f20, f21,f22,f23,f24,f25,f26,f27,f28,f29,f30,f31,f32,f33,f34,f35,f36,f37,f38,f39,f40, f41,f42,f43,f44,f45,f46,f47,f48,f49,f50,f51,f52,f53,f54,f55,f56,f57,f58,f59,f60, f61,f62,f63,f64,f65,f66,f67,f68,f69,f70,f71,f72,f73,f74,f75,f76,f77,f78,f79,f80, f81,f82,f83,f84,f85,f86,f87,f88,f89,f90,f91,f92,f93,f94,f95,f96,f97,f98,f99,f100, f101,f102,f103,f104,f105,f106,f107,f108,f109,f110,f111,f112,f113,f114,f115,f116, f117,f118,f119,f120,f121,f122,f123,f124,f125,f126;
Use the ad hoc query feature of DataWorks, create an ODPS SQL node, and then execute SQL statements to verify the results of one-hot encoding-based processing.
Sample statement:
select * from mushroom_classification_one_hot;
The following result is returned.
Use the ad hoc query feature of DataWorks, create an ODPS SQL node, and then create a training dataset and a test dataset based on the data in mushroom_classification_one_hot.
Sample statements:
-- Create a training dataset. About 25% of the data is used to train models. create table mushroom_training as select * from mushroom_classification_one_hot where sample(4,1); -- Create a test dataset. About the rest 75% of the data is used to predict and evaluate models. create table mushroom_predict as select * from mushroom_classification_one_hot except all select * from mushroom_training;
Create a machine learning model and make predictions.
Use the ad hoc query feature of DataWorks, create an ODPS SQL node, and then create a logistic regression for the binary classification model based on the training dataset. The model is named lr_test_model.
Sample statement:
create model lr_test_model with properties('model_type'='logisticregression_binary', 'goodValue'='p','maxIter'='1000') as select * from mushroom_training;
NoteYou can configure more parameters in
properties
. The parameters must be the same as the parameters in Platform for AI. For more information, see Linear SVM.The SQL engine extracts and executes the statement after
as
and saves the results in a temporary table. You can view the results on the Summary tab in the LogView of a job. The lifecycle of the temporary table is one day. After the lifecycle elapses, the temporary table is automatically deleted.If you want to drop the model, execute the
drop offlinemodel lr_test_model
statement.
Use the ad hoc query feature of DataWorks, create an ODPS SQL node, and then use
ml_predict
to predict the data in the test dataset based on the lr_test_model model.Sample statements:
create table mushroom_predict_result as select * from ml_predict( lr_test_model, (select * from mushroom_predict) );
NoteThe SQL engine saves the subquery results of
ml_predict
to a temporary table. The lifecycle of the temporary table is one day. After the lifecycle elapses, the temporary table is automatically deleted.The results of
ml_predict
can be placed in theFROM
clause of the SQL statement. You can also execute theINSERT
orCREATE TABLE AS
statement to save the results to another table. For more information aboutml_predict
, see Supported model prediction functions.
Use the ad hoc query feature of DataWorks, create an ODPS SQL node, and then execute the required statement to view the prediction results for the mushroom_predict_result table.
Sample statement:
select * from mushroom_predict_result;
The following result is returned.
Use
ml_evaluate
to evaluate the prediction accuracy of the model.For more information about
ml_evaluate
, see Supported model evaluation functions.