In most cases, you need to prepare and preprocess the data that is required to build and test a model. Then, the prepared data is processed based on your business requirements for model development. This topic describes how to prepare and preprocess data in Machine Learning Platform for AI (PAI). In the example, public data provided by PAI is used.
Prerequisites
A pipeline is created. For more information, see Prepare data.
Step 1: Go to the pipeline configuration page
Log on to the PAI console. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to view.
In the left-side navigation pane of the workspace page, choose to go to the Machine Learning Designer page.
On the Visualized Modeling (Designer) page, select the pipeline that you created and click Open.
Step 2: Prepare data
In this example, public data provided by PAI on heart disease cases is used. You can use the Read Table component to read the public data without the need to create a table or write data to the table.
In most cases, you need to prepare a table in MaxCompute or Object Storage Service (OSS) when you develop data. Then, you need to use a Data Source/Target component such as Read Table, Write Table, or Read File Data to query or write data to the table. For more information, see the topics in Component reference: data source or destination.
In the left-side component list, enter a keyword in the search box to search for the Read Table component.
Drag the Read Table component to the canvas on the right. A pipeline node named Read Table-1 is automatically generated.
Click the Read Table-1 component. On the Select Table tab in the right-side pane of the canvas, set the Table Name parameter to
pai_online_project.heart_disease_prediction
. This allows you to read public data on heart disease cases.You can also click the Fields Information tab in the right-side pane of the canvas to view the details of the columns in the public data.
Step 3: Preprocess data
In this example, the public data on heart disease cases is used as raw data, and all field values of the raw data are normalized during preprocessing. To normalize the field values, perform the following steps:
Convert all non-numeric fields in the raw data into numeric fields by using an SQL statement. This ensures that all fields are numeric fields after preprocessing.
Convert the data in columns into the Double type. This ensures that the data meets the requirements for normalization.
Normalize the values in all columns in the table.
The following section describes the operations.
Convert non-numeric fields into numeric fields.
In the left-side component list, enter a keyword in the search box to search for the SQL Script component.
Drag the SQL Script component to the canvas on the right. A pipeline node named SQL Script-1 is automatically generated.
Draw a line from the Read Table-1 node to the t1 input port of the SQL Script-1 node. This sets the t1 input port of the SQL Script-1 node as the downstream node of the Read Table-1 node.
Select the SQL Script-1 node. On the Parameters Setting tab in the right-side pane, t1 is automatically populated into the Input Source field. Then, enter the following SQL code in the SQL Script code editor:
select age, (case sex when 'male' then 1 else 0 end) as sex, (case cp when 'angina' then 0 when 'notang' then 1 else 2 end) as cp, trestbps, chol, (case fbs when 'true' then 1 else 0 end) as fbs, (case restecg when 'norm' then 0 when 'abn' then 1 else 2 end) as restecg, thalach, (case exang when 'true' then 1 else 0 end) as exang, oldpeak, (case slop when 'up' then 0 when 'flat' then 1 else 2 end) as slop, ca, (case thal when 'norm' then 0 when 'fix' then 1 else 2 end) as thal, (case status when 'sick' then 1 else 0 end) as ifHealth from ${t1};
NoteThe SQL Script -1 node has four input ports: t1, t2, t3, and t4. In the preceding sample code, ${t1} indicates that t1 is used. If you use a different input port, the port name is automatically populated into the Input Source field on the Parameters Setting tab of the SQL Script-1 node. In this case, you must change the port in the preceding code.
Click the Run icon in the upper part of the canvas. The Read Table-1 and SQL Script-1 nodes are run in sequence.
Convert the data in all columns into the Double type.
In the left-side component list, enter a keyword in the search box to search for the Data Type Conversion component.
Drag the Data Type Conversion component to the canvas on the right. A pipeline node named Data Type Conversion-1 is automatically generated.
Draw a line from the SQL Script-1 node to the Data Type Conversion-1 node.
Click the Data Type Conversion-1 component on the canvas. On the Fields Setting tab in the right-side pane, click Select Fields in the Convert to Double Type Columns section and select all columns to convert data in the columns into the Double type.
Normalize field values.
In the left-side component list, enter a keyword in the search box to search for the Normalization component.
Drag the Normalization component to the canvas on the right. A pipeline node named Normalization-1 is automatically generated.
Draw a line from the Data Type Conversion-1 node to the Normalization-1 node.
Click the Normalization-1 component on the canvas. On the Fields Setting tab in the right-side pane, select all columns.
In the left-side component list, enter a keyword in the search box to search for the Split component. Drag the Split component to the canvas on the right. Draw a line from the Normalization-1 node to the Split-1 node that is generated.
By default, the Split component splits the raw data into a model training set and a model prediction set at a ratio of 4:1. To change the ratio, click the Split component and configure the Splitting Fraction parameter on the Parameters Setting tab.
In the top toolbar of the canvas, click Save.
Step 4: Debug and run the pipeline
Right-click the Data Type Conversion-1 component on the canvas and select Run from Here to run the pipeline. The system runs the components in the pipeline in sequence. After a node is run as expected, the node is marked with a icon in the node box. You can right-click a node that is run as expected and select View Data to check whether the output data is correct.
If the pipeline is complex, you can save and run the pipeline each time you add a node to the pipeline. If a node fails to run, you can right-click the node and select View Log to troubleshoot the failure.
What to do next
After the data is preprocessed, you must visualize the data. For more information, see Data Visualization.