PS Linear Regression - Platform For AI - Alibaba Cloud Documentation Center

Linear regression is a model that is used to analyze the linear relationship between a dependent variable and multiple independent variables. Parameter Servers (PSs) are used in large-scale online and offline training tasks. The PS Linear Regression component can support large-scale linear training tasks for hundreds of billions of samples and billions of features.

Component parameters

You can use one of the following methods to configure the PS Linear Regression component of Platform for AI (PAI).

Method 1: Configure the component in the PAI console

You can configure the parameters of the PS Linear Regression component in Machine Learning Designer. The following table describes the parameters.

Tab	Parameter	Description
Fields Setting	Feature Columns	The feature columns that are selected from the input table for training.
	Label Column	The columns of the DOUBLE and BIGINT types are supported.
	Use Sparse Format	Specifies whether the input data is in the sparse format. Input data in the sparse format is displayed as key-value pairs.
	KV Pair Delimiter	The delimiter that is used to separate key-value pairs. By default, spaces are used.
	KV Delimiter	The delimiter that is used to separate keys and values if the input table is a sparse table. By default, colons (:) are used.
Parameters Setting	L1 weight	The L1 regularization coefficient. A larger value specifies that the model has fewer non-zero elements. If overfitting occurs, increase the parameter value.
	L2 weight	The L2 regularization coefficient. A larger value specifies that the absolute values of the model parameters are smaller. If overfitting occurs, increase the parameter value.
	Maximum Iterations	The maximum number of iterations performed by the algorithm. If you set this parameter to 0, the number of iterations is unlimited.
	Minimum Convergence Deviance	The conditions for algorithm termination.
	Largest Feature ID	The largest feature ID or feature dimension. The value of this parameter can be greater than the actual value. If you do not specify this parameter, the system automatically runs an SQL task to calculate the largest feature ID or feature dimension.
Tuning	Cores	The number of cores. By default, the system determines the value.
Tuning	Memory Size per Core	The memory size of each core. By default, the system determines the value.

Method 2: Configure the component by using PAI commands

The following table describes the parameters that are used in PAI commands. You can use the SQL script component to run PAI commands. For more information, see SQL Script.

# Training 
PAI -name ps_linearregression
    -project algo_public
    -DinputTableName="lm_test_input"
    -DmodelName="linear_regression_model"
    -DlabelColName="label"
    -DfeatureColNames="features"
    -Dl1Weight=1.0
    -Dl2Weight=0.0
    -DmaxIter=100
    -Depsilon=1e-6
    -DenableSparse=true
# Prediction 
drop table if exists logistic_regression_predict;
PAI -name prediction
    -DmodelName="linear_regression_model"
    -DoutputTableName="linear_regression_predict"
    -DinputTableName="lm_test_input"
    -DappendColNames="label,features"
    -DfeatureColNames="features"
    -DenableSparse=true

Parameter	Required	Description	Default value
inputTableName	Yes	The name of the input table.	N/A
modelName	Yes	The name of the output model.	N/A
outputTableName	No	The name of the output model evaluation table. This parameter is required if you set the enableFitGoodness parameter to true.	N/A
labelColName	Yes	The label column that is selected from the input table. Columns of the DOUBLE and BIGINT types are supported.	N/A
featureColNames	Yes	The feature columns that are selected from the input table for training. If data in the input table is in the dense format, columns of the DOUBLE and BIGINT types are supported. If data in the input table is in the sparse format, only columns of the STRING type are supported.	N/A
inputTablePartitions	No	The partitions that are selected from the input table for training.	N/A
enableSparse	No	Specifies whether the input data is in the sparse format. Valid values: {true, false}.	false
itemDelimiter	No	The delimiter that is used to separate key-value pairs. This parameter is valid only if you set the enableSparse parameter to true.	Space
kvDelimiter	No	The delimiter that is used to separate keys and values. This parameter is valid only if you set the enableSparse parameter to true.	Colons (:)
enableModelIo	No	Specifies whether the model is generated as an offline model. If you set the enableModelIo parameter to false, the model is generated in a MaxCompute table. Valid values: true and false.	true
maxIter	No	The maximum number of iterations performed by the algorithm. The value of this parameter must be a non-negative integer.	100
epsilon	No	The conditions for algorithm termination. Valid values: [0,1].	0.000001
l1Weight	No	The L1 regularization coefficient. A greater value specifies that the model has fewer non-zero elements. If overfitting occurs, increase the parameter value.	1.0
l2Weight	No	The L2 regularization coefficient. A larger value specifies that the absolute values of the model parameters are smaller. If overfitting occurs, increase the parameter value.	0
modelSize	No	The largest feature ID or feature dimension. The value of this parameter can be greater than the actual value. If you do not specify this parameter, the system automatically runs an SQL task to calculate the largest feature ID or feature dimension. The value of this parameter must be a non-negative integer.	0
coreNum	No	The number of cores used in computing.	Specified by the system
memSizePerCore	No	The memory size of each core. Unit: MB.	Specified by the system

Examples

Execute the following SQL statements to generate input data by using the SQL Script component. In this example, input data in the key-value format is generated.

drop table if exists lm_test_input;
create table lm_test_input as
select
*
from
(
select cast(2 as BIGINT) as label, '1:0.55 2:-0.15 3:0.82 4:-0.99 5:0.17' as features
    union all
select cast(1 as BIGINT) as label, '1:-1.26 2:1.36 3:-0.13 4:-2.82 5:-0.41' as features
    union all
select cast(1 as BIGINT) as label, '1:-0.77 2:0.91 3:-0.23 4:-4.46 5:0.91' as features
    union all
select cast(2 as BIGINT) as label, '1:0.86 2:-0.22 3:-0.46 4:0.08 5:-0.60' as features
    union all
select cast(1 as BIGINT) as label, '1:-0.76 2:0.89 3:1.02 4:-0.78 5:-0.86' as features
    union all
select cast(1 as BIGINT) as label, '1:2.22 2:-0.46 3:0.49 4:0.31 5:-1.84' as features
    union all
select cast(0 as BIGINT) as label, '1:-1.21 2:0.09 3:0.23 4:2.04 5:0.30' as features
    union all
select cast(1 as BIGINT) as label, '1:2.17 2:-0.45 3:-1.22 4:-0.48 5:-1.41' as features
    union all
select cast(0 as BIGINT) as label, '1:-0.40 2:0.63 3:0.56 4:0.74 5:-1.44' as features
    union all
select cast(1 as BIGINT) as label, '1:0.17 2:0.49 3:-1.50 4:-2.20 5:-0.35' as features
) tmp;

The generated input data is shown in the following figure.

Note

If the input data is in the key-value format, the feature IDs must be positive integers, and the feature values must be real numbers. If the data type of the feature IDs is STRING, you must use the serialization component to serialize the input data. If the feature values are categorical strings, you must perform feature discretization to process the features.

Create a pipeline as shown in the following figure. For more information, see Algorithm modeling.

Configure the component parameters.

On the canvas, click the Read Table-1 component. On the Select Table tab in the right pane, set Table Name to lm_test_input.

Configure the parameters of the PS Linear Regression component. The following table describes the parameters. Use the default values for other parameters.

Tab	Parameter	Description
Fields Setting	Use Sparse Format	Set the parameter to true.
	Feature Columns	Select the features column.
	Label Column	Select the label column.
Tuning	Cores	Set the parameter to 3.
Tuning	Memory Size per Core	Set the parameter to 1024. Unit: MB.

Configure the parameters listed in the following table for the Prediction component. Retain the default values of the parameters that are not listed in the table.

Tab	Parameter	Description
Fields Setting	Feature Columns	Select the features column.
	Reserved Columns	Select the label and features columns.
	Sparse Matrix	Select Sparse Matrix.
	KV Delimiter	Set the value to a colon (:).
	KV Pair Delimiter	Leave this parameter empty, which specifies that a space is used as a delimiter.

Click the icon on the canvas to run the pipeline.
After you run the pipeline, right-click the Prediction -1 component and choose View Data > Prediction Result Output Port.