Linear Regression - Platform For AI - Alibaba Cloud Documentation Center

The Linear Regression component is used to analyze the linear relationship between a dependent variable and multiple independent variables.

Configure the component

You can use one of the following methods to configure the Linear Regression component.

Method 1: Configure the component on the pipeline page

You can configure the parameters of the Linear Regression component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.

Tab	Parameter	Description
Fields Setting	Feature Columns	The feature columns that are selected from the input table for training.
	Label Column	The label column. The columns of the DOUBLE and BIGINT types are supported.
	Use Sparse Format	Specifies whether the input data is in the sparse format. Data in the sparse format is presented by using key-value pairs.
	KV Pair Delimiter	The default delimiter is a comma (,).
	KV Delimiter	The delimiter that is used to separate keys and values. Colons (:) are used by default.
Parameters Setting	Maximum Iterations	The maximum number of iterations performed by the algorithm.
	Minimum Likelihood Deviance	The algorithm is terminated if the difference of log-likelihood between two iterations is less than the value specified by this parameter.
	Specifies the regularization type	The regularization type. Valid values: L1, L2, and None.
	Regularization Coefficient	The regularization coefficient. This parameter is invalid if the Specifies the regularization type parameter is set to None.
	Generate Model Evaluation Table	The metrics include R-Squared, adjusted R-Squared, AIC, degree of freedom, residual standard deviation, and residual deviation.
	Regression Coefficient Evaluation	The metrics include the t value and p value, and the confidence interval is [2.5%,97.5%]. This parameter is valid only if Generate Model Evaluation Table is selected.
Tuning	Number of Computing Cores	The number of cores. By default, the system determines the value.
Tuning	Memory Size per Core	The memory size of each core. By default, the system determines the value.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

PAI -name linearregression
    -project algo_public
    -DinputTableName=lm_test_input
    -DfeatureColNames=x
    -DlabelColName=y
    -DmodelName=lm_test_input_model_out;

Parameter	Required	Description	Default value
inputTableName	Yes	The name of the input table.	N/A
modelName	Yes	The name of the output model.	N/A
outputTableName	No	The name of the output model evaluation table. This parameter is required if the enableFitGoodness parameter is set to true.	N/A
labelColName	Yes	The label column. This parameter specifies the dependent variable. The columns of the DOUBLE and BIGINT types are supported. You can select only one column.	N/A
featureColNames	Yes	The feature columns. This parameter specifies the independent variables. If data in the input table is in the dense format, the columns of the DOUBLE and BIGINT types are supported. If the input data is in the sparse format, only the columns of the STRING type are supported.	N/A
inputTablePartitions	No	The partitions that are selected from the input table for training.	N/A
enableSparse	No	Specifies whether data in the input table is in the sparse format. Valid values: true and false.	false
itemDelimiter	No	The delimiter that is used to separate key-value pairs. This parameter is valid if the enableSparse parameter is set to true.	,
kvDelimiter	No	The delimiter that is used to separate keys and values. This parameter is valid if the enableSparse parameter is set to true.	:
maxIter	No	The maximum number of iterations performed by the algorithm.	100
epsilon	No	The minimum likelihood error. The algorithm is terminated if the difference of log-likelihood between two iterations is less than the value specified by this parameter.	0.000001
regularizedType	No	The regularization type. Valid values: l1, l2, and None.	None
regularizedLevel	No	The regularization coefficient. This parameter is invalid if the regularizedType parameter is set to None.	1
enableFitGoodness	No	Specifies whether to generate the model evaluation table. The metrics include R-Squared, adjusted R-Squared, AIC, degree of freedom, residual standard deviation, and residual deviation. Valid values: true and false.	false
enableCoefficientEstimate	No	Specifies whether to evaluate the regression coefficient. The metrics include the t value and p value, and the confidence interval is [2.5%,97.5%]. This parameter is valid if the enableFitGoodness parameter is set to true. Valid values: true and false.	false
lifecycle	No	The lifecycle of the output model evaluation table.	-1
coreNum	No	The number of cores used in computing.	Determined by the system
memSizePerCore	No	The memory size of each core. Valid values: 1024 to 20 × 1024. Unit: MB.	Determined by the system

Example

Execute the following SQL statements to generate test data:

 drop table if exists lm_test_input;
  create table lm_test_input as
  select
    *
  from
  (
    select 10 as y, 1.84 as x1, 1 as x2, '0:1.84 1:1' as sparsecol1
      union all
    select 20 as y, 2.13 as x1, 0 as x2, '0:2.13' as sparsecol1
      union all
    select 30 as y, 3.89 as x1, 0 as x2, '0:3.89' as sparsecol1
      union all
    select 40 as y, 4.19 as x1, 0 as x2, '0:4.19' as sparsecol1
      union all
    select 50 as y, 5.76 as x1, 0 as x2, '0:5.76' as sparsecol1
      union all
    select 60 as y, 6.68 as x1, 2 as x2, '0:6.68 1:2' as sparsecol1
      union all
    select 70 as y, 7.58 as x1, 0 as x2, '0:7.58' as sparsecol1
      union all
    select 80 as y, 8.01 as x1, 0 as x2, '0:8.01' as sparsecol1
      union all
    select 90 as y, 9.02 as x1, 3 as x2, '0:9.02 1:3' as sparsecol1
      union all
    select 100 as y, 10.56 as x1, 0 as x2, '0:10.56' as sparsecol1
  ) tmp;

Run the following PAI command to submit the parameters configured for the Linear Regression component:

PAI -name linearregression
    -project algo_public
    -DinputTableName=lm_test_input
    -DlabelColName=y
    -DfeatureColNames=x1,x2
    -DmodelName=lm_test_input_model_out
    -DoutputTableName=lm_test_input_conf_out
    -DenableCoefficientEstimate=true
    -DenableFitGoodness=true
    -Dlifecycle=1;

Run the following PAI command to submit the parameters configured for the Prediction component:

pai -name prediction
    -project algo_public
    -DmodelName=lm_test_input_model_out
    -DinputTableName=lm_test_input
    -DoutputTableName=lm_test_input_predict_out
    -DappendColNames=y;

View the generated model evaluation table lm_test_input_conf_out.

+------------+------------+------------+------------+--------------------+------------+
| colname    | value      | tscore     | pvalue     | confidenceinterval | p          |
+------------+------------+------------+------------+--------------------+------------+
| Intercept  | -6.42378496687763 | -2.2725755951390028 | 0.06       | {"2.5%": -11.964027, "97.5%": -0.883543} | coefficient |
| x1         | 10.260063429838898 | 23.270944360826963 | 0.0        | {"2.5%": 9.395908, "97.5%": 11.124219} | coefficient |
| x2         | 0.35374498323846265 | 0.2949247320997519 | 0.81       | {"2.5%": -1.997160, "97.5%": 2.704650} | coefficient |
| rsquared   | 0.9879675667384592 | NULL       | NULL       | NULL               | goodness   |
| adjusted_rsquared | 0.9845297286637332 | NULL       | NULL       | NULL               | goodness   |
| aic        | 59.331109494251805 | NULL       | NULL       | NULL               | goodness   |
| degree_of_freedom | 7.0        | NULL       | NULL       | NULL               | goodness   |
| standardErr_residual | 3.765777749448906 | NULL       | NULL       | NULL               | goodness   |
| deviance   | 99.26757440771128 | NULL       | NULL       | NULL               | goodness   |
+------------+------------+------------+------------+--------------------+------------+

View the prediction result table lm_test_input_predict_out indicated by the following code:

+------------+-------------------+------------------+-------------------+
| y          | prediction_result | prediction_score | prediction_detail |
+------------+-------------------+------------------+-------------------+
| 10         | NULL              | 12.808476727264404 | {"y": 12.8084767272644} |
| 20         | NULL              | 15.43015013867922 | {"y": 15.43015013867922} |
| 30         | NULL              | 33.48786177519568 | {"y": 33.48786177519568} |
| 40         | NULL              | 36.565880804147355 | {"y": 36.56588080414735} |
| 50         | NULL              | 52.674180388994415 | {"y": 52.67418038899442} |
| 60         | NULL              | 62.82092871092313 | {"y": 62.82092871092313} |
| 70         | NULL              | 71.34749583130122 | {"y": 71.34749583130122} |
| 80         | NULL              | 75.75932310613193 | {"y": 75.75932310613193} |
| 90         | NULL              | 87.1832221199846 | {"y": 87.18322211998461} |
| 100        | NULL              | 101.92248485222113 | {"y": 101.9224848522211} |
+------------+-------------------+------------------+-------------------+