All Products
Search
Document Center

Platform For AI:PS Linear Regression

Last Updated:Jun 06, 2024

Linear regression is a model that is used to analyze the linear relationship between a dependent variable and multiple independent variables. Parameter Servers (PSs) are used in large-scale online and offline training tasks. The PS Linear Regression component can support large-scale linear training tasks for hundreds of billions of samples and billions of features.

Component parameters

You can use one of the following methods to configure the PS Linear Regression component of Platform for AI (PAI).

Method 1: Configure the component in the PAI console

You can configure the parameters of the PS Linear Regression component in Machine Learning Designer. The following table describes the parameters.

Tab

Parameter

Description

Fields Setting

Feature Columns

The feature columns that are selected from the input table for training.

Label Column

The columns of the DOUBLE and BIGINT types are supported.

Use Sparse Format

Specifies whether the input data is in the sparse format. Input data in the sparse format is displayed as key-value pairs.

KV Pair Delimiter

The delimiter that is used to separate key-value pairs. By default, spaces are used.

KV Delimiter

The delimiter that is used to separate keys and values if the input table is a sparse table. By default, colons (:) are used.

Parameters Setting

L1 weight

The L1 regularization coefficient. A larger value specifies that the model has fewer non-zero elements. If overfitting occurs, increase the parameter value.

L2 weight

The L2 regularization coefficient. A larger value specifies that the absolute values of the model parameters are smaller. If overfitting occurs, increase the parameter value.

Maximum Iterations

The maximum number of iterations performed by the algorithm. If you set this parameter to 0, the number of iterations is unlimited.

Minimum Convergence Deviance

The conditions for algorithm termination.

Largest Feature ID

The largest feature ID or feature dimension. The value of this parameter can be greater than the actual value. If you do not specify this parameter, the system automatically runs an SQL task to calculate the largest feature ID or feature dimension.

Tuning

Cores

The number of cores. By default, the system determines the value.

Memory Size per Core

The memory size of each core. By default, the system determines the value.

Method 2: Configure the component by using PAI commands

The following table describes the parameters that are used in PAI commands. You can use the SQL script component to run PAI commands. For more information, see SQL Script.

# Training 
PAI -name ps_linearregression
    -project algo_public
    -DinputTableName="lm_test_input"
    -DmodelName="linear_regression_model"
    -DlabelColName="label"
    -DfeatureColNames="features"
    -Dl1Weight=1.0
    -Dl2Weight=0.0
    -DmaxIter=100
    -Depsilon=1e-6
    -DenableSparse=true
# Prediction 
drop table if exists logistic_regression_predict;
PAI -name prediction
    -DmodelName="linear_regression_model"
    -DoutputTableName="linear_regression_predict"
    -DinputTableName="lm_test_input"
    -DappendColNames="label,features"
    -DfeatureColNames="features"
    -DenableSparse=true

Parameter

Required

Description

Default value

inputTableName

Yes

The name of the input table.

N/A

modelName

Yes

The name of the output model.

N/A

outputTableName

No

The name of the output model evaluation table. This parameter is required if you set the enableFitGoodness parameter to true.

N/A

labelColName

Yes

The label column that is selected from the input table. Columns of the DOUBLE and BIGINT types are supported.

N/A

featureColNames

Yes

The feature columns that are selected from the input table for training. If data in the input table is in the dense format, columns of the DOUBLE and BIGINT types are supported. If data in the input table is in the sparse format, only columns of the STRING type are supported.

N/A

inputTablePartitions

No

The partitions that are selected from the input table for training.

N/A

enableSparse

No

Specifies whether the input data is in the sparse format. Valid values: {true, false}.

false

itemDelimiter

No

The delimiter that is used to separate key-value pairs. This parameter is valid only if you set the enableSparse parameter to true.

Space

kvDelimiter

No

The delimiter that is used to separate keys and values. This parameter is valid only if you set the enableSparse parameter to true.

Colons (:)

enableModelIo

No

Specifies whether the model is generated as an offline model. If you set the enableModelIo parameter to false, the model is generated in a MaxCompute table. Valid values: true and false.

true

maxIter

No

The maximum number of iterations performed by the algorithm. The value of this parameter must be a non-negative integer.

100

epsilon

No

The conditions for algorithm termination. Valid values: [0,1].

0.000001

l1Weight

No

The L1 regularization coefficient. A greater value specifies that the model has fewer non-zero elements. If overfitting occurs, increase the parameter value.

1.0

l2Weight

No

The L2 regularization coefficient. A larger value specifies that the absolute values of the model parameters are smaller. If overfitting occurs, increase the parameter value.

0

modelSize

No

The largest feature ID or feature dimension. The value of this parameter can be greater than the actual value. If you do not specify this parameter, the system automatically runs an SQL task to calculate the largest feature ID or feature dimension. The value of this parameter must be a non-negative integer.

0

coreNum

No

The number of cores used in computing.

Specified by the system

memSizePerCore

No

The memory size of each core. Unit: MB.

Specified by the system

Examples

  1. Execute the following SQL statements to generate input data by using the SQL Script component. In this example, input data in the key-value format is generated.

    drop table if exists lm_test_input;
    create table lm_test_input as
    select
    *
    from
    (
    select cast(2 as BIGINT) as label, '1:0.55 2:-0.15 3:0.82 4:-0.99 5:0.17' as features
        union all
    select cast(1 as BIGINT) as label, '1:-1.26 2:1.36 3:-0.13 4:-2.82 5:-0.41' as features
        union all
    select cast(1 as BIGINT) as label, '1:-0.77 2:0.91 3:-0.23 4:-4.46 5:0.91' as features
        union all
    select cast(2 as BIGINT) as label, '1:0.86 2:-0.22 3:-0.46 4:0.08 5:-0.60' as features
        union all
    select cast(1 as BIGINT) as label, '1:-0.76 2:0.89 3:1.02 4:-0.78 5:-0.86' as features
        union all
    select cast(1 as BIGINT) as label, '1:2.22 2:-0.46 3:0.49 4:0.31 5:-1.84' as features
        union all
    select cast(0 as BIGINT) as label, '1:-1.21 2:0.09 3:0.23 4:2.04 5:0.30' as features
        union all
    select cast(1 as BIGINT) as label, '1:2.17 2:-0.45 3:-1.22 4:-0.48 5:-1.41' as features
        union all
    select cast(0 as BIGINT) as label, '1:-0.40 2:0.63 3:0.56 4:0.74 5:-1.44' as features
        union all
    select cast(1 as BIGINT) as label, '1:0.17 2:0.49 3:-1.50 4:-2.20 5:-0.35' as features
    ) tmp;

    The generated input data is shown in the following figure.11

    Note

    If the input data is in the key-value format, the feature IDs must be positive integers, and the feature values must be real numbers. If the data type of the feature IDs is STRING, you must use the serialization component to serialize the input data. If the feature values are categorical strings, you must perform feature discretization to process the features.

  2. Create a pipeline as shown in the following figure. For more information, see Algorithm modeling. image

  3. Configure the component parameters.

    1. On the canvas, click the Read Table-1 component. On the Select Table tab in the right pane, set Table Name to lm_test_input.

    2. Configure the parameters of the PS Linear Regression component. The following table describes the parameters. Use the default values for other parameters.

      Tab

      Parameter

      Description

      Fields Setting

      Use Sparse Format

      Set the parameter to true.

      Feature Columns

      Select the features column.

      Label Column

      Select the label column.

      Tuning

      Cores

      Set the parameter to 3.

      Memory Size per Core

      Set the parameter to 1024. Unit: MB.

    3. Configure the parameters listed in the following table for the Prediction component. Retain the default values of the parameters that are not listed in the table.

      Tab

      Parameter

      Description

      Fields Setting

      Feature Columns

      Select the features column.

      Reserved Columns

      Select the label and features columns.

      Sparse Matrix

      Select Sparse Matrix.

      KV Delimiter

      Set the value to a colon (:).

      KV Pair Delimiter

      Leave this parameter empty, which specifies that a space is used as a delimiter.

  4. Click the image icon on the canvas to run the pipeline.

  5. After you run the pipeline, right-click the Prediction -1 component and choose View Data > Prediction Result Output Port. PS线性回归预测结果