Quickly create a custom pipeline - Platform For AI - Alibaba Cloud Documentation Center

Machine Learning Designer uses pipelines to build and debug models. Start by planning and creating a pipeline, then organize the various components according to your processing and scheduling logic. In this example, a blank pipeline is created to build a binary classification model for heart disease prediction.

Prerequisites

Platform for AI (PAI) is activated and a workspace has been created.

Step 1: Create a pipeline

Go to Visualized Modeling (Designer), select a workspace, and enter the Visualized Modeling (Designer) page. On the page that appears, create a pipeline and open it.

Parameter	Description
Pipeline Name	Enter a custom name.
Data Storage	We recommend that you set this parameter to an OSS bucket path to store temporary data and model during the runs. If not specified, the default storage of the workspace is used. The system automatically create a temporary folder named `Data Storage Path/Task ID/Node ID` for each run so you do not have to configure OSS paths for all components in your pipeline. This also faciliates the unified management of your data.
Visibility	Visible to Me: The pipeline is created under the My Pipelines folder. Only you and the workspace administrator can see the pipeline. Visible to Current Workspace: The pipeline is created under the Pipelines Visible to Workspaces folder. Everyone in the workspace can see the pipeline.

Step 2: Prepare and preprocess data

Prepare the data source and complete data preprocessing before you build a model. This facilitates subsequent model training based on your business requirements.

Prepare data

In the pipeline that you created, add components in the Data Source/Target category to read data from a data source, such as MaxCompute or Object Storage Service (OSS). For more information, see the specific component documentation under Component reference: data source or destination. This topic uses the Read Table component to read the public data related to heart disease cases provided by PAI. For more information about datasets, see Heart disease datasets.

designer快速入门2

Select an appropriate table for the Read Table component in the Data Source/Target category to read data.
In the left-side component list, click Data Source/Target, drag the Read Table component to the canvas on the right to read MaxCompute table data. A pipeline node named Read Table-1 is automatically generated on the canvas.
Configure the source data table on the node configuration page.
Click the Read Table-1 node on the canvas, and enter the MaxCompute table name in the Table Name field in the node configuration section on the right. In this topic, the pai_online_project.heart_disease_prediction table is used to read the public data related to heart disease cases provided by PAI.
Switch to the Fields Information tab in the node configuration section to view the field details of the public data.

Preprocess data

The heart disease prediction described in this topic is a binary classification problem. The logistic regression model component requires input data of the DOUBLE or BIGINT type. This section describes how to perform preprocessing operations, such as data type conversion, on data related to heart disease cases for model training.

Data preprocessing: Convert non-numeric fields into numeric fields.
1. Search for the SQL Script component and drag it to the canvas. A pipeline node named SQL Script-1 is generated.
2. Draw a line from the Read Table-1 node to the t1 input port of the SQL Script-1 node. This way, the Read Table-1 node becomes the data source of the SQL Script-1 node.
3. Configure the SQL Script-1 node.
  Click the SQL Script-1 node, and enter the following code in the SQL script editor on the right. On the Parameters Setting tab, t1 is displayed in the Input Source field.
```
select age,
(case sex when 'male' then 1 else 0 end) as sex,
(case cp when 'angina' then 0  when 'notang' then 1 else 2 end) as cp,
trestbps,
chol,
(case fbs when 'true' then 1 else 0 end) as fbs,
(case restecg when 'norm' then 0  when 'abn' then 1 else 2 end) as restecg,
thalach,
(case exang when 'true' then 1 else 0 end) as exang,
oldpeak,
(case slop when 'up' then 0  when 'flat' then 1 else 2 end) as slop,
ca,
(case thal when 'norm' then 0  when 'fix' then 1 else 2 end) as thal,
(case status  when 'sick' then 1 else 0 end) as ifHealth
from  ${t1};
```
4. Click Save in the upper-left corner of the canvas to save the pipeline settings.
5. Right-click the SQL Script-1 component, and click Run from Root Node To Here to debug and run the pipeline.
  Each node in the pipeline is run in sequence. After a node is run as expected, the node is marked with a icon in the node box displayed in the upper right corner of the node.
  Note
  You can also click the (Run) icon in the left-corner of the canvas to run the entire pipeline. If the pipeline is complex, we recommend that you run a specific node or some nodes based on the component. This facilitates pipeline debugging. If a node fails to run, right-click the node and select View Log to diagnose the failure.
6. After the pipeline is run, right-click a node, such as SQL Script-1, and select View Data to check whether the output data of the node is correct.
Data preprocessing: Convert the fields into the DOUBLE data type to meet the input data requirements of the logistic regression model.
Drag the Data Type Conversion component to the canvas and connect the SQL Script-1 node to the Data Type Conversion-1 node by referring to the previous step. This way, the Data Type Conversion-1 node becomes the downstream node of the SQL Script-1 node. Click the Data Type Conversion-1 node. On the Fields Setting tab, click Select Fields in the Convert To Double Type Columns field, select all fields, and then convert the fields into the DOUBLE data type.
Data preprocessing: Normalize the data to convert the values of each feature to values ranging from 0 to 1. This removes the impact of dimensions on the prediction results.
Drag the Normalization component to the canvas and connect the Data Type Conversion-1 node to the Normalization-1 node by referring to the previous step. This way, the Normalization-1 node becomes the downstream node of the Data Type Conversion-1 node. Click the Normalization-1 node. On the Fields Setting tab, select all fields.
Data preprocessing: Split the data into a training dataset and a prediction dataset for subsequent model training and prediction.
Drag the Split component to the canvas and connect the Normalization-1 node to the Split-1 node. This way, the Split-1 node becomes the downstream node of the Normalization-1 node. After the Split-1 node is run, two data tables are generated.
By default, the Split component splits data into a model training set and a model prediction set at a ratio of 4:1. Click the Split-1 node. On the Parameters Setting tab on the right, specify the Splitting Fraction parameter. For more information about other parameters, see Split.
Right-click the Data Type Conversion-1 node, and click Run from Here to run the nodes in the pipeline from the Data Type Conversion-1 node.

Step 3: Train the model

In each sample, each patient is either sick or healthy. Therefore, heart disease prediction is a binary classification problem. This section describes how to use the logistic regression for binary classification component to build a heart disease prediction model.

designer快速入门3

Drag the Logistic Regression for Binary Classification component to the canvas and connect Output Table 1 of the Split-1 node to the Logistic Regression for Binary Classification-1 node. This way, the Logistic Regression for Binary Classification-1 node becomes the downstream node of the Output Table 1 of the Split-1 node.
Configure the Logistic Regression for Binary Classification-1 node.
Click the Logistic Regression for Binary Classification-1 node. On the Fields Setting tab on the right, select the ifhealth field for the Target Columns parameter, and select all fields except the value of the Target Column parameter for the Training Feature Columns parameter. For more information about other parameters, see Logistic Regression for Binary Classification.
Run the Logistic Regression for Binary Classification node.

Step 4: Use the model for prediction

Drag the Prediction component to the canvas and connect Output Table 2 of the Split-1 node and the Logistic Regression for Binary Classification-1 node to the Prediction-1 node. This way, the Prediction-1 node becomes the downstream node of Output Table 2 of the Split-1 node and the Logistic Regression for Binary Classification-1 node.
Click the Prediction-1 node. On the Fields Setting tab, select the ifhealth field for the Reserved Columns parameter, and select all fields except the ifhealth field for the Feature Columns parameter.
Run the Prediction-1 node and view the prediction results.
After the Prediction-1 node is run, right-click the Prediction-1 node, select View Data > Prediction Result Output Port, and then view the prediction data.

Step 5: Evaluate the model

Drag the Binary Classification Evaluation component to the canvas connect the Prediction-1 node to the Binary Classification Evaluation-1 node. This way, the Binary Classification Evaluation-1 node becomes the downstream node of the Prediction-1 node.
Click the Binary Classification Evaluation-1 node. On the Fields Setting tab on the right, select the ifhealth field for the Original Label Column parameter.
Run the Binary Classification Evaluation-1 node and view the model evaluation results.
After the Binary Classification Evaluation-1 node is run, right-click the Binary Classification Evaluation-1 node, select Visual Analysis, and then view different evaluation metrics in a visualized manner.

References

Machine Learning Designer provides a variety of templates that you can use to build models. For more information, see Demo for creating a pipeline by using a template.
Pipelines can be scheduled in Machine Learning Designer by using DataWorks tasks. For more information, see Use DataWorks tasks to schedule pipelines in Machine Learning Designer.
You can configure global variables in pipelines. This feature helps you manage online pipelines and use DataWorks tasks to schedule pipelines. This way, the flexibility and efficiency of pipelines are improved. For more information, see Advanced feature: global variable.
Billing of Machine Learning Designer.
Component reference: Overview of all components.