The Linear Regression component is used to analyze the linear relationship between a dependent variable and multiple independent variables.
Configure the component
You can use one of the following methods to configure the Linear Regression component.
Method 1: Configure the component on the pipeline page
You can configure the parameters of the Linear Regression component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.
Tab | Parameter | Description |
Fields Setting | Feature Columns | The feature columns that are selected from the input table for training. |
Label Column | The label column. The columns of the DOUBLE and BIGINT types are supported. | |
Use Sparse Format | Specifies whether the input data is in the sparse format. Data in the sparse format is presented by using key-value pairs. | |
KV Pair Delimiter | The default delimiter is a comma (,). | |
KV Delimiter | The delimiter that is used to separate keys and values. Colons (:) are used by default. | |
Parameters Setting | Maximum Iterations | The maximum number of iterations performed by the algorithm. |
Minimum Likelihood Deviance | The algorithm is terminated if the difference of log-likelihood between two iterations is less than the value specified by this parameter. | |
Specifies the regularization type | The regularization type. Valid values: L1, L2, and None. | |
Regularization Coefficient | The regularization coefficient. This parameter is invalid if the Specifies the regularization type parameter is set to None. | |
Generate Model Evaluation Table | The metrics include R-Squared, adjusted R-Squared, AIC, degree of freedom, residual standard deviation, and residual deviation. | |
Regression Coefficient Evaluation | The metrics include the t value and p value, and the confidence interval is [2.5%,97.5%]. This parameter is valid only if Generate Model Evaluation Table is selected. | |
Tuning | Number of Computing Cores | The number of cores. By default, the system determines the value. |
Memory Size per Core | The memory size of each core. By default, the system determines the value. |
Method 2: Use PAI commands
Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI -name linearregression
-project algo_public
-DinputTableName=lm_test_input
-DfeatureColNames=x
-DlabelColName=y
-DmodelName=lm_test_input_model_out;
Parameter | Required | Description | Default value |
inputTableName | Yes | The name of the input table. | N/A |
modelName | Yes | The name of the output model. | N/A |
outputTableName | No | The name of the output model evaluation table. This parameter is required if the enableFitGoodness parameter is set to true. | N/A |
labelColName | Yes | The label column. This parameter specifies the dependent variable. The columns of the DOUBLE and BIGINT types are supported. You can select only one column. | N/A |
featureColNames | Yes | The feature columns. This parameter specifies the independent variables. If data in the input table is in the dense format, the columns of the DOUBLE and BIGINT types are supported. If the input data is in the sparse format, only the columns of the STRING type are supported. | N/A |
inputTablePartitions | No | The partitions that are selected from the input table for training. | N/A |
enableSparse | No | Specifies whether data in the input table is in the sparse format. Valid values: true and false. | false |
itemDelimiter | No | The delimiter that is used to separate key-value pairs. This parameter is valid if the enableSparse parameter is set to true. | , |
kvDelimiter | No | The delimiter that is used to separate keys and values. This parameter is valid if the enableSparse parameter is set to true. | : |
maxIter | No | The maximum number of iterations performed by the algorithm. | 100 |
epsilon | No | The minimum likelihood error. The algorithm is terminated if the difference of log-likelihood between two iterations is less than the value specified by this parameter. | 0.000001 |
regularizedType | No | The regularization type. Valid values: l1, l2, and None. | None |
regularizedLevel | No | The regularization coefficient. This parameter is invalid if the regularizedType parameter is set to None. | 1 |
enableFitGoodness | No | Specifies whether to generate the model evaluation table. The metrics include R-Squared, adjusted R-Squared, AIC, degree of freedom, residual standard deviation, and residual deviation. Valid values: true and false. | false |
enableCoefficientEstimate | No | Specifies whether to evaluate the regression coefficient. The metrics include the t value and p value, and the confidence interval is [2.5%,97.5%]. This parameter is valid if the enableFitGoodness parameter is set to true. Valid values: true and false. | false |
lifecycle | No | The lifecycle of the output model evaluation table. | -1 |
coreNum | No | The number of cores used in computing. | Determined by the system |
memSizePerCore | No | The memory size of each core. Valid values: 1024 to 20 × 1024. Unit: MB. | Determined by the system |
Example
Execute the following SQL statements to generate test data:
drop table if exists lm_test_input; create table lm_test_input as select * from ( select 10 as y, 1.84 as x1, 1 as x2, '0:1.84 1:1' as sparsecol1 union all select 20 as y, 2.13 as x1, 0 as x2, '0:2.13' as sparsecol1 union all select 30 as y, 3.89 as x1, 0 as x2, '0:3.89' as sparsecol1 union all select 40 as y, 4.19 as x1, 0 as x2, '0:4.19' as sparsecol1 union all select 50 as y, 5.76 as x1, 0 as x2, '0:5.76' as sparsecol1 union all select 60 as y, 6.68 as x1, 2 as x2, '0:6.68 1:2' as sparsecol1 union all select 70 as y, 7.58 as x1, 0 as x2, '0:7.58' as sparsecol1 union all select 80 as y, 8.01 as x1, 0 as x2, '0:8.01' as sparsecol1 union all select 90 as y, 9.02 as x1, 3 as x2, '0:9.02 1:3' as sparsecol1 union all select 100 as y, 10.56 as x1, 0 as x2, '0:10.56' as sparsecol1 ) tmp;
Run the following PAI command to submit the parameters configured for the Linear Regression component:
PAI -name linearregression -project algo_public -DinputTableName=lm_test_input -DlabelColName=y -DfeatureColNames=x1,x2 -DmodelName=lm_test_input_model_out -DoutputTableName=lm_test_input_conf_out -DenableCoefficientEstimate=true -DenableFitGoodness=true -Dlifecycle=1;
Run the following PAI command to submit the parameters configured for the Prediction component:
pai -name prediction -project algo_public -DmodelName=lm_test_input_model_out -DinputTableName=lm_test_input -DoutputTableName=lm_test_input_predict_out -DappendColNames=y;
View the generated model evaluation table lm_test_input_conf_out.
+------------+------------+------------+------------+--------------------+------------+ | colname | value | tscore | pvalue | confidenceinterval | p | +------------+------------+------------+------------+--------------------+------------+ | Intercept | -6.42378496687763 | -2.2725755951390028 | 0.06 | {"2.5%": -11.964027, "97.5%": -0.883543} | coefficient | | x1 | 10.260063429838898 | 23.270944360826963 | 0.0 | {"2.5%": 9.395908, "97.5%": 11.124219} | coefficient | | x2 | 0.35374498323846265 | 0.2949247320997519 | 0.81 | {"2.5%": -1.997160, "97.5%": 2.704650} | coefficient | | rsquared | 0.9879675667384592 | NULL | NULL | NULL | goodness | | adjusted_rsquared | 0.9845297286637332 | NULL | NULL | NULL | goodness | | aic | 59.331109494251805 | NULL | NULL | NULL | goodness | | degree_of_freedom | 7.0 | NULL | NULL | NULL | goodness | | standardErr_residual | 3.765777749448906 | NULL | NULL | NULL | goodness | | deviance | 99.26757440771128 | NULL | NULL | NULL | goodness | +------------+------------+------------+------------+--------------------+------------+
View the prediction result table lm_test_input_predict_out indicated by the following code:
+------------+-------------------+------------------+-------------------+ | y | prediction_result | prediction_score | prediction_detail | +------------+-------------------+------------------+-------------------+ | 10 | NULL | 12.808476727264404 | {"y": 12.8084767272644} | | 20 | NULL | 15.43015013867922 | {"y": 15.43015013867922} | | 30 | NULL | 33.48786177519568 | {"y": 33.48786177519568} | | 40 | NULL | 36.565880804147355 | {"y": 36.56588080414735} | | 50 | NULL | 52.674180388994415 | {"y": 52.67418038899442} | | 60 | NULL | 62.82092871092313 | {"y": 62.82092871092313} | | 70 | NULL | 71.34749583130122 | {"y": 71.34749583130122} | | 80 | NULL | 75.75932310613193 | {"y": 75.75932310613193} | | 90 | NULL | 87.1832221199846 | {"y": 87.18322211998461} | | 100 | NULL | 101.92248485222113 | {"y": 101.9224848522211} | +------------+-------------------+------------------+-------------------+