All Products
Search
Document Center

Platform For AI:Linear Regression

Last Updated:May 17, 2024

The Linear Regression component is used to analyze the linear relationship between a dependent variable and multiple independent variables.

Configure the component

You can use one of the following methods to configure the Linear Regression component.

Method 1: Configure the component on the pipeline page

You can configure the parameters of the Linear Regression component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.

Tab

Parameter

Description

Fields Setting

Feature Columns

The feature columns that are selected from the input table for training.

Label Column

The label column. The columns of the DOUBLE and BIGINT types are supported.

Use Sparse Format

Specifies whether the input data is in the sparse format. Data in the sparse format is presented by using key-value pairs.

KV Pair Delimiter

The default delimiter is a comma (,).

KV Delimiter

The delimiter that is used to separate keys and values. Colons (:) are used by default.

Parameters Setting

Maximum Iterations

The maximum number of iterations performed by the algorithm.

Minimum Likelihood Deviance

The algorithm is terminated if the difference of log-likelihood between two iterations is less than the value specified by this parameter.

Specifies the regularization type

The regularization type. Valid values: L1, L2, and None.

Regularization Coefficient

The regularization coefficient. This parameter is invalid if the Specifies the regularization type parameter is set to None.

Generate Model Evaluation Table

The metrics include R-Squared, adjusted R-Squared, AIC, degree of freedom, residual standard deviation, and residual deviation.

Regression Coefficient Evaluation

The metrics include the t value and p value, and the confidence interval is [2.5%,97.5%]. This parameter is valid only if Generate Model Evaluation Table is selected.

Tuning

Number of Computing Cores

The number of cores. By default, the system determines the value.

Memory Size per Core

The memory size of each core. By default, the system determines the value.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

PAI -name linearregression
    -project algo_public
    -DinputTableName=lm_test_input
    -DfeatureColNames=x
    -DlabelColName=y
    -DmodelName=lm_test_input_model_out;

Parameter

Required

Description

Default value

inputTableName

Yes

The name of the input table.

N/A

modelName

Yes

The name of the output model.

N/A

outputTableName

No

The name of the output model evaluation table. This parameter is required if the enableFitGoodness parameter is set to true.

N/A

labelColName

Yes

The label column. This parameter specifies the dependent variable. The columns of the DOUBLE and BIGINT types are supported. You can select only one column.

N/A

featureColNames

Yes

The feature columns. This parameter specifies the independent variables. If data in the input table is in the dense format, the columns of the DOUBLE and BIGINT types are supported. If the input data is in the sparse format, only the columns of the STRING type are supported.

N/A

inputTablePartitions

No

The partitions that are selected from the input table for training.

N/A

enableSparse

No

Specifies whether data in the input table is in the sparse format. Valid values: true and false.

false

itemDelimiter

No

The delimiter that is used to separate key-value pairs. This parameter is valid if the enableSparse parameter is set to true.

,

kvDelimiter

No

The delimiter that is used to separate keys and values. This parameter is valid if the enableSparse parameter is set to true.

:

maxIter

No

The maximum number of iterations performed by the algorithm.

100

epsilon

No

The minimum likelihood error. The algorithm is terminated if the difference of log-likelihood between two iterations is less than the value specified by this parameter.

0.000001

regularizedType

No

The regularization type. Valid values: l1, l2, and None.

None

regularizedLevel

No

The regularization coefficient. This parameter is invalid if the regularizedType parameter is set to None.

1

enableFitGoodness

No

Specifies whether to generate the model evaluation table. The metrics include R-Squared, adjusted R-Squared, AIC, degree of freedom, residual standard deviation, and residual deviation. Valid values: true and false.

false

enableCoefficientEstimate

No

Specifies whether to evaluate the regression coefficient. The metrics include the t value and p value, and the confidence interval is [2.5%,97.5%]. This parameter is valid if the enableFitGoodness parameter is set to true. Valid values: true and false.

false

lifecycle

No

The lifecycle of the output model evaluation table.

-1

coreNum

No

The number of cores used in computing.

Determined by the system

memSizePerCore

No

The memory size of each core. Valid values: 1024 to 20 × 1024. Unit: MB.

Determined by the system

Example

  1. Execute the following SQL statements to generate test data:

     drop table if exists lm_test_input;
      create table lm_test_input as
      select
        *
      from
      (
        select 10 as y, 1.84 as x1, 1 as x2, '0:1.84 1:1' as sparsecol1
          union all
        select 20 as y, 2.13 as x1, 0 as x2, '0:2.13' as sparsecol1
          union all
        select 30 as y, 3.89 as x1, 0 as x2, '0:3.89' as sparsecol1
          union all
        select 40 as y, 4.19 as x1, 0 as x2, '0:4.19' as sparsecol1
          union all
        select 50 as y, 5.76 as x1, 0 as x2, '0:5.76' as sparsecol1
          union all
        select 60 as y, 6.68 as x1, 2 as x2, '0:6.68 1:2' as sparsecol1
          union all
        select 70 as y, 7.58 as x1, 0 as x2, '0:7.58' as sparsecol1
          union all
        select 80 as y, 8.01 as x1, 0 as x2, '0:8.01' as sparsecol1
          union all
        select 90 as y, 9.02 as x1, 3 as x2, '0:9.02 1:3' as sparsecol1
          union all
        select 100 as y, 10.56 as x1, 0 as x2, '0:10.56' as sparsecol1
      ) tmp;
  2. Run the following PAI command to submit the parameters configured for the Linear Regression component:

    PAI -name linearregression
        -project algo_public
        -DinputTableName=lm_test_input
        -DlabelColName=y
        -DfeatureColNames=x1,x2
        -DmodelName=lm_test_input_model_out
        -DoutputTableName=lm_test_input_conf_out
        -DenableCoefficientEstimate=true
        -DenableFitGoodness=true
        -Dlifecycle=1;
  3. Run the following PAI command to submit the parameters configured for the Prediction component:

    pai -name prediction
        -project algo_public
        -DmodelName=lm_test_input_model_out
        -DinputTableName=lm_test_input
        -DoutputTableName=lm_test_input_predict_out
        -DappendColNames=y;
  4. View the generated model evaluation table lm_test_input_conf_out.

    +------------+------------+------------+------------+--------------------+------------+
    | colname    | value      | tscore     | pvalue     | confidenceinterval | p          |
    +------------+------------+------------+------------+--------------------+------------+
    | Intercept  | -6.42378496687763 | -2.2725755951390028 | 0.06       | {"2.5%": -11.964027, "97.5%": -0.883543} | coefficient |
    | x1         | 10.260063429838898 | 23.270944360826963 | 0.0        | {"2.5%": 9.395908, "97.5%": 11.124219} | coefficient |
    | x2         | 0.35374498323846265 | 0.2949247320997519 | 0.81       | {"2.5%": -1.997160, "97.5%": 2.704650} | coefficient |
    | rsquared   | 0.9879675667384592 | NULL       | NULL       | NULL               | goodness   |
    | adjusted_rsquared | 0.9845297286637332 | NULL       | NULL       | NULL               | goodness   |
    | aic        | 59.331109494251805 | NULL       | NULL       | NULL               | goodness   |
    | degree_of_freedom | 7.0        | NULL       | NULL       | NULL               | goodness   |
    | standardErr_residual | 3.765777749448906 | NULL       | NULL       | NULL               | goodness   |
    | deviance   | 99.26757440771128 | NULL       | NULL       | NULL               | goodness   |
    +------------+------------+------------+------------+--------------------+------------+
  5. View the prediction result table lm_test_input_predict_out indicated by the following code:

    +------------+-------------------+------------------+-------------------+
    | y          | prediction_result | prediction_score | prediction_detail |
    +------------+-------------------+------------------+-------------------+
    | 10         | NULL              | 12.808476727264404 | {"y": 12.8084767272644} |
    | 20         | NULL              | 15.43015013867922 | {"y": 15.43015013867922} |
    | 30         | NULL              | 33.48786177519568 | {"y": 33.48786177519568} |
    | 40         | NULL              | 36.565880804147355 | {"y": 36.56588080414735} |
    | 50         | NULL              | 52.674180388994415 | {"y": 52.67418038899442} |
    | 60         | NULL              | 62.82092871092313 | {"y": 62.82092871092313} |
    | 70         | NULL              | 71.34749583130122 | {"y": 71.34749583130122} |
    | 80         | NULL              | 75.75932310613193 | {"y": 75.75932310613193} |
    | 90         | NULL              | 87.1832221199846 | {"y": 87.18322211998461} |
    | 100        | NULL              | 101.92248485222113 | {"y": 101.9224848522211} |
    +------------+-------------------+------------------+-------------------+