Alibaba Cloud Machine Learning Platform for AI: Air Quality Forecasting

By Garvin Li

In today's top trending topics, you would most probably see air pollution in the list. In cities such as Beijing, it is not uncommon to see people wearing masks while walking in the streets. The haze not only affects people's travel and entertainment, but also causes significant harm to people's health. By analyzing the real weather data of Beijing over the past year, this article discovers that nitrogen dioxide is the most relevant pollutant with haze (i.e. PM 2.5), thus revealing the culprit for haze.

In this article, we will be creating a haze prediction model using the template from Alibaba Cloud Machine Learning Platform.

Dataset Introduction

Data source: Beijing weather index for the whole year of 2016.

The air index data for each hour since January 1, 2016 is collected. The detailed fields are as follows.

Data Exploration Procedure

The experiment process is as follows.

The entire experiment is divided into four parts:

Data import and preprocessing
Statistical analysis
Model training and prediction
Model evaluation and analysis

The details are as follows.

1. Data Import and Preprocessing

Data Import

Click Data Source and select Create Table. The uploaded data supports .txt and .csv files.

After the data is imported, right click the component and select View Data. The result is as follows.

Data Preprocessing

Convert string-type data to a double type through the "Type Conversion" component.

Convert the target column to a boolean type of 0 and 1 through the "SQL Script" component. In this experiment, "pm2" is listed as the target column. Values above 200 are marked as 1 for heavy haze, and values below 200 are marked as 0. The SQL statement is as follows.

select time,hour,(case when pm2>200 then 1 else 0 end),pm10,so2,co,no2 from ${t1};

Normalization

The main purpose of normalization is to remove the dimension, that is, to unify the units of pollutants with different indexes.

2. Statistical Analysis

Histogram

The Histogram component allows you to visually view the distribution of different data in different ranges.

This experiment visually presents the distribution of data in each field. As shown in the figure below, taking PM2.5 as an example, the most significant range of values is 11.74 to 15.61, with a total of 430 records.

Data View

The data view allows you to see the impact of different ranges of different indexes on the results.

As shown in the following figure, taking NO2 as an example, 7 targets with a target column of 0 are generated in the range 112.33 to 113.9, and 9 targets with a target column of 1 are generated. That is, when NO2 is in the range of 112.33 to 113.9, there is a very high chance of severe haze. The entropy and Gini coefficient indicate the impact of this feature range on the target value (the impact on the aspect of information), and the larger the value, the greater the impact.

3. Model Training and Prediction

In this case, two different algorithms are used to predict and analyze the results: random forest and logistic regression.

Random Forecast

The data set is split, in which 80% is used for model training, and 20% is used for prediction. Click Model on the left side of the console, select Saved Models. Right click the model, select View Model, then the tree model of the random forest is visually shown as follows.

The prediction result is as follows:

The AUC in the above figure is 0.99, which indicates that with the weather index data used in this document, it can predict whether there will be haze or not, and the accuracy rate can reach more than 90%.

Logistic Regression

A linear model can be obtained by training with the logistic regression algorithm, as shown in the following figure.

The prediction result is as follows:

The AUC in the above figure is 0.98, which is slightly lower than the result computed through the random forest algorithm. When excluding the impact of the parameter adjustment on the results, it indicates that the training effect of the random forest is be better for this data set.

Model Evaluation and Analysis

Based on the model and prediction results above, the air index having the greatest impact on PM2.5 is analyzed.

The logistic regression model generated is shown as the following figure:

The larger the model coefficient of the logistic regression algorithm after normalized computing, the greater the impact on the result. The coefficient symbol is positive for positive correlation and negative for negative correlation. In the above figure, PM10 and NO2 have the largest positive coefficients.

PM10 and PM2.5 are just different particle sizes and have an inclusion relationship, which can be ignored.
NO2 (nitrogen dioxide) has the greatest impact on PM2.5.

NO2 emission is often reported as one the main factors that creates PM2.5 pollutants. Our results are consistent with findings from the scientific community, which also indicate that NO2 is mainly from automobile exhaust emission.

To learn more about Alibaba Cloud Machine Learning Platform for Artificial Intelligence (PAI), visit www.alibabacloud.com/product/machine-learning

Community

Alibaba Cloud Machine Learning Platform for AI: Air Quality Forecasting

Dataset Introduction

Data Exploration Procedure

1. Data Import and Preprocessing

Data Import

Data Preprocessing

Normalization

2. Statistical Analysis

Histogram

Data View

3. Model Training and Prediction

Random Forecast

Logistic Regression

Model Evaluation and Analysis

Read previous post:

Read next post:

GarvinLi

You may also like

Comments

GarvinLi

Related Products

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Platform For AI

ApsaraDB for HBase