Linear regression is a model that is used to analyze the linear relationship between a dependent variable and multiple independent variables. Parameter Servers (PSs) are used in large-scale online and offline training tasks. The PS Linear Regression component can support large-scale linear training tasks for hundreds of billions of samples and billions of features.
Component parameters
You can use one of the following methods to configure the PS Linear Regression component of Platform for AI (PAI).
Method 1: Configure the component in the PAI console
You can configure the parameters of the PS Linear Regression component in Machine Learning Designer. The following table describes the parameters.
Tab | Parameter | Description |
Fields Setting | Feature Columns | The feature columns that are selected from the input table for training. |
Label Column | The columns of the DOUBLE and BIGINT types are supported. | |
Use Sparse Format | Specifies whether the input data is in the sparse format. Input data in the sparse format is displayed as key-value pairs. | |
KV Pair Delimiter | The delimiter that is used to separate key-value pairs. By default, spaces are used. | |
KV Delimiter | The delimiter that is used to separate keys and values if the input table is a sparse table. By default, colons (:) are used. | |
Parameters Setting | L1 weight | The L1 regularization coefficient. A larger value specifies that the model has fewer non-zero elements. If overfitting occurs, increase the parameter value. |
L2 weight | The L2 regularization coefficient. A larger value specifies that the absolute values of the model parameters are smaller. If overfitting occurs, increase the parameter value. | |
Maximum Iterations | The maximum number of iterations performed by the algorithm. If you set this parameter to 0, the number of iterations is unlimited. | |
Minimum Convergence Deviance | The conditions for algorithm termination. | |
Largest Feature ID | The largest feature ID or feature dimension. The value of this parameter can be greater than the actual value. If you do not specify this parameter, the system automatically runs an SQL task to calculate the largest feature ID or feature dimension. | |
Tuning | Cores | The number of cores. By default, the system determines the value. |
Memory Size per Core | The memory size of each core. By default, the system determines the value. |
Method 2: Configure the component by using PAI commands
The following table describes the parameters that are used in PAI commands. You can use the SQL script component to run PAI commands. For more information, see SQL Script.
# Training
PAI -name ps_linearregression
-project algo_public
-DinputTableName="lm_test_input"
-DmodelName="linear_regression_model"
-DlabelColName="label"
-DfeatureColNames="features"
-Dl1Weight=1.0
-Dl2Weight=0.0
-DmaxIter=100
-Depsilon=1e-6
-DenableSparse=true
# Prediction
drop table if exists logistic_regression_predict;
PAI -name prediction
-DmodelName="linear_regression_model"
-DoutputTableName="linear_regression_predict"
-DinputTableName="lm_test_input"
-DappendColNames="label,features"
-DfeatureColNames="features"
-DenableSparse=true
Parameter | Required | Description | Default value |
inputTableName | Yes | The name of the input table. | N/A |
modelName | Yes | The name of the output model. | N/A |
outputTableName | No | The name of the output model evaluation table. This parameter is required if you set the enableFitGoodness parameter to true. | N/A |
labelColName | Yes | The label column that is selected from the input table. Columns of the DOUBLE and BIGINT types are supported. | N/A |
featureColNames | Yes | The feature columns that are selected from the input table for training. If data in the input table is in the dense format, columns of the DOUBLE and BIGINT types are supported. If data in the input table is in the sparse format, only columns of the STRING type are supported. | N/A |
inputTablePartitions | No | The partitions that are selected from the input table for training. | N/A |
enableSparse | No | Specifies whether the input data is in the sparse format. Valid values: {true, false}. | false |
itemDelimiter | No | The delimiter that is used to separate key-value pairs. This parameter is valid only if you set the enableSparse parameter to true. | Space |
kvDelimiter | No | The delimiter that is used to separate keys and values. This parameter is valid only if you set the enableSparse parameter to true. | Colons (:) |
enableModelIo | No | Specifies whether the model is generated as an offline model. If you set the enableModelIo parameter to false, the model is generated in a MaxCompute table. Valid values: true and false. | true |
maxIter | No | The maximum number of iterations performed by the algorithm. The value of this parameter must be a non-negative integer. | 100 |
epsilon | No | The conditions for algorithm termination. Valid values: [0,1]. | 0.000001 |
l1Weight | No | The L1 regularization coefficient. A greater value specifies that the model has fewer non-zero elements. If overfitting occurs, increase the parameter value. | 1.0 |
l2Weight | No | The L2 regularization coefficient. A larger value specifies that the absolute values of the model parameters are smaller. If overfitting occurs, increase the parameter value. | 0 |
modelSize | No | The largest feature ID or feature dimension. The value of this parameter can be greater than the actual value. If you do not specify this parameter, the system automatically runs an SQL task to calculate the largest feature ID or feature dimension. The value of this parameter must be a non-negative integer. | 0 |
coreNum | No | The number of cores used in computing. | Specified by the system |
memSizePerCore | No | The memory size of each core. Unit: MB. | Specified by the system |
Examples
Execute the following SQL statements to generate input data by using the SQL Script component. In this example, input data in the key-value format is generated.
drop table if exists lm_test_input; create table lm_test_input as select * from ( select cast(2 as BIGINT) as label, '1:0.55 2:-0.15 3:0.82 4:-0.99 5:0.17' as features union all select cast(1 as BIGINT) as label, '1:-1.26 2:1.36 3:-0.13 4:-2.82 5:-0.41' as features union all select cast(1 as BIGINT) as label, '1:-0.77 2:0.91 3:-0.23 4:-4.46 5:0.91' as features union all select cast(2 as BIGINT) as label, '1:0.86 2:-0.22 3:-0.46 4:0.08 5:-0.60' as features union all select cast(1 as BIGINT) as label, '1:-0.76 2:0.89 3:1.02 4:-0.78 5:-0.86' as features union all select cast(1 as BIGINT) as label, '1:2.22 2:-0.46 3:0.49 4:0.31 5:-1.84' as features union all select cast(0 as BIGINT) as label, '1:-1.21 2:0.09 3:0.23 4:2.04 5:0.30' as features union all select cast(1 as BIGINT) as label, '1:2.17 2:-0.45 3:-1.22 4:-0.48 5:-1.41' as features union all select cast(0 as BIGINT) as label, '1:-0.40 2:0.63 3:0.56 4:0.74 5:-1.44' as features union all select cast(1 as BIGINT) as label, '1:0.17 2:0.49 3:-1.50 4:-2.20 5:-0.35' as features ) tmp;
The generated input data is shown in the following figure.
NoteIf the input data is in the key-value format, the feature IDs must be positive integers, and the feature values must be real numbers. If the data type of the feature IDs is STRING, you must use the serialization component to serialize the input data. If the feature values are categorical strings, you must perform feature discretization to process the features.
Create a pipeline as shown in the following figure. For more information, see Algorithm modeling.
Configure the component parameters.
On the canvas, click the Read Table-1 component. On the Select Table tab in the right pane, set Table Name to lm_test_input.
Configure the parameters of the PS Linear Regression component. The following table describes the parameters. Use the default values for other parameters.
Tab
Parameter
Description
Fields Setting
Use Sparse Format
Set the parameter to true.
Feature Columns
Select the features column.
Label Column
Select the label column.
Tuning
Cores
Set the parameter to 3.
Memory Size per Core
Set the parameter to 1024. Unit: MB.
Configure the parameters listed in the following table for the Prediction component. Retain the default values of the parameters that are not listed in the table.
Tab
Parameter
Description
Fields Setting
Feature Columns
Select the features column.
Reserved Columns
Select the label and features columns.
Sparse Matrix
Select Sparse Matrix.
KV Delimiter
Set the value to a colon (:).
KV Pair Delimiter
Leave this parameter empty, which specifies that a space is used as a delimiter.
Click the icon on the canvas to run the pipeline.
After you run the pipeline, right-click the Prediction -1 component and choose
.