This topic describes Pearson Correlation Coefficient that measures the linear correlation of two features. The larger the absolute value, the stronger the correlation.
Scenarios
Variables to which Pearson Correlation Coefficient can be applied must meet the following requirements:
The standard deviation of neither variable is 0.
The variables are in a linear relationship and are continuous.
The variables are in a bivariate normal distribution, or in a unimodal distribution that resembles a normal distribution.
Pearson Correlation Coefficient is commonly used to determine the linear relationship of two features in a machine learning model. If two features are highly correlated, they may be interchangeable. In this case, you can discard one of them to ensure effectiveness of the model.
Syntax
CREATE FEATURE feature_name WITH ( feature_class = '', x_cols = '', parameters=()) AS (SELECT select_expr [, select_expr] ... FROM table_reference)
Parameter description:
Parameter | Description | Example |
feature_name | The name of the feature. | pearson_001 |
feature_class | The type of the feature. Set the value to pearson. | pearson |
x_cols | Custom parameters for creating the feature. Each value must be a floating point or an integer. Separate multiple variables with commas (,). | dx1,dx2 |
parameters | Custom parameters for creating the feature. The following parameters are supported:
| categorical_feature='dx3' |
select_expr | The name of the column used to create the feature. | dx4 |
table_reference | The name of the table containing the column used to create the feature. | airlines_test_1000 |
Example
/*polar4ai*/CREATE FEATURE pearson_001 WITH ( feature_class = 'pearson',x_cols='Airline,Flight,AirportFrom,AirportTo,DayOfWeek,Time,Length',parameters=(null_strategy='mean',categorical_feature='Airline,Flight,AirportFrom,AirportTo,DayOfWeek')) AS (SELECT * FROM airlines_test_1000);