Build models to predict the hazy weather

This topic describes how to build models to predict the hazy weather based on the analysis of data that is collected in Beijing for one year. The models can be used to find out the pollutant that is most likely to cause hazy weather. The pollutant is measured based on the concentration of PM 2.5.

Datasets

In the following sample experiment, the air quality data that is collected every hour in Beijing during 2016 is used. The following table describes the fields of the air quality data.

Field	Data type	Description
time	STRING	The date. This field is accurate to the day.
hour	STRING	The hour in which the data is collected.
pm2	STRING	The PM 2.5 index.
pm10	STRING	The PM 10 index.
so2	STRING	The sulfur dioxide index.
co	STRING	The carbon monoxide index.
no2	STRING	The nitrogen dioxide index.

Go to the Machine Learning Designer page.
1. Log on to the Machine Learning Platform for AI console.
2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
3. In the left-side navigation pane, choose Model Training > Visualized Modeling (Designer) to go to the Machine Learning Designer page.

Create a pipeline.

On the Visualized Modeling (Designer) page, click the Preset Templates tab.
In the Air Quality Prediction section of the Preset Templates tab, click Create.
In the Create Pipeline dialog box, configure the parameters. You can use their default values.
The value specified for the Pipeline Data Path parameter is the Object Storage Service (OSS) bucket path of the temporary data and models generated during the runtime of the pipeline.
Click OK.
It requires about 10 seconds to create the pipeline.
On the Pipelines tab, double-click the created Air Quality Prediction template to open it.

View the components of the pipeline on the canvas as shown in the following figure. The system automatically creates the pipeline based on the preset template.

Section	Description
①	The components displayed in this section read and preprocess data. The data source component reads the source data. The type transform component converts the source data in the STRING type to the DOUBLE type. The sql component converts the values in the label column to binary values of 0 or 1. In this pipeline, the pm2 column is the label column. In the pm2 column, values greater than 200 indicate heavy hazy weather. The sql component marks the values greater than 200 in the pm2 column as 1 and the values that are smaller than or equal to 200 as 0 The following SQL statement provides an example: `select time,hour,(case when pm2>200 then 1 else 0 end),pm10,so2,co,no2 from ${t1};` The normalize component converts pollutant concentrations with different units to normalized values without units.
②	The components displayed in this section perform statistical analysis. The histograms component visualizes the distribution of each pollutant. For example, the following figure shows that the interval in which most of the PM 2.5 concentrations fall is 11.74 to 15.61. The total number of PM 2.5 concentrations in this interval is 430. The data view component visualizes the impact of different intervals of each pollutant on the results. For example, the following figure shows the data of the nitrogen dioxide concentration. When the nitrogen dioxide concentration falls in the interval of 112.33 to 113.9, seven values of the label column are converted to 0 and nine are converted to 1. This indicates that when the nitrogen dioxide concentration falls in the interval of 112.33 to 113.9, the occurrence probability of heavy hazy weather is high. Entropy and Gini indicate the impact of the feature interval on the target value in terms of the information amount. A larger indicates a greater impact.
③	The components displayed in this section train models and make predictions. In this pipeline, the random forests and logistic regression components train the models.
④	The components displayed in this section evaluate the models.

Run the pipeline and view the results.
1. In the upper-left corner of the canvas, click the Run icon.
2. After the pipeline is run, right-click the evaluate component that is connected as a downstream component of the random forests component. In the shortcut menu that appears, click Visual Analysis.
3. In the evaluate section, click the Evaluation Chart tab to view the prediction results of the models that are trained by the random forests component.
  The area under curve (AUC) value in the preceding figure indicates that the accuracy of the trained model for air quality prediction is higher than 99%. This model is trained by the random forests component.
4. Right-click the evaluate component that is connected as a downstream component of the logistic regression component. In the shortcut menu that appears, click Visual Analysis.
5. In the evaluate section, click the Evaluation Chart tab to view the prediction results of the models that are trained by the logistic regression component.
  The AUC value in the preceding figure indicates the accuracy of the model for hazy weather prediction is higher than 98%. This model is trained by the logistic regression component.