All Products
Search
Document Center

PolarDB:LightGBM algorithm

Last Updated:Jun 06, 2024

This topic describes the light gradient boosting machine (LightGBM) algorithm.

Background information

LightGBM is a distributed gradient lifting framework based on decision tree algorithms. LightGBM is designed as a fast, efficient, low-memory, and high-accuracy tool that supports parallel and large-scale data processing. LightGBM can reduce the amount of memory occupied by data, lower communication costs, improve the efficiency of multi-node elastic parallel query (ePQ), and achieve linear acceleration in data computing.

Scenarios

LightGBM is an algorithm framework that includes GBDT models, random forests, and logistic regression models. LightGBM is primarily used in scenarios such as binary classification, multiclass classification, and sorting.

For example, in most personalized commodity recommendation scenarios, click estimation models are required. Historical user behaviors such as clicks, unclicks, and purchases can be used as training data to predict the probability of user clicks or purchases. The following features are extracted based on user behavior and user attributes:

  • Categorical feature: a string type, such as gender (male or female).

  • Commodity category: categories such as clothing, toys, or electronics.

  • Numerical feature: integer or floating-point data type, such as user activity or commodity prices.

Parameters

The values of the parameters described in the following table are the same as those of the model_parameter parameter specified in the CREATE MODEL statement that is used to create a model. You can configure the parameters based on your business requirements.

Parameter

Description

boosting_type

The type of the weak learner. Valid values:

  • gbdt (default): uses the gradient-boosted decision tree model.

  • gblinear: uses the linear model.

  • rf: uses the random forest model.

  • dart: uses the dropout technique to delete some trees to prevent overfitting.

  • goss: uses the gradient-based one side sampling algorithm. This type is fast, but may cause underfitting.

Note

When you specify a value for this parameter value, enclose the value in single quotation marks ('). Example: boosting_type='gbdt'

n_estimators

The number of iterations. The value must be an integer. Default value: 100.

loss

The learning task and the learning objectives of the task. Valid values:

  • binary (default): binary classification.

  • regression: uses the L2-regularization regression model.

  • regression_l1: uses the L1-regularization regression model.

  • multiclass: multiclass classification.

num_leaves

The number of leaves. The value must be an integer. Default value: 128.

max_depth

The maximum depth of the tree. The value must be an integer. Default value: 7.

Note

If this parameter is set to -1, the depth of the tree is not specified. We recommend that you specify this parameter with caution to prevent overfitting.

learning_rate

The learning rate. The value must be a floating-point number. Default value: 0.06.

max_leaf_nodes

The maximum number of leaf nodes in the tree. The value can be an integer or be left empty. By default, this parameter is left empty, which indicates that the number of leaf nodes in the tree is not limited.

min_samples_leaf

The minimum number of sample leaf nodes in the tree. The value must be an integer. Default value: 20.

Note

If the number of leaf nodes is less than the minimum number of sample leaf nodes, the nodes are pruned together with sibling nodes.

subsample

The ratio of training samples to all samples. The value must be a floating-point number. Valid values: 0 to 1. Default value: 1.

Note

If the value of this parameter is less than 1, only samples with this proportion value are used in the training.

max_features

The ratio of training features to all features. The value must be a floating-point number. Valid values: 0 to 1. Default value: 1.

max_depth

The maximum depth of the tree. The value must be an integer. Default value: 7.

Note

A greater value specifies the higher precision. However, overfitting may occur.

random_state

The random number seed. The value must be an integer. Default value: 1.

Note

If different values exist, the construction of the tree and the segmentation of training data may be affected.

model_type

The storage type of the model. Valid values:

  • pkl (default): PKL file.

  • pmml: PMML file. This type can display tree-related information such as the structure of the tree.

n_jobs

The number of threads used for training. The value must be an integer. Default value: 4.

Note

A large number of threads used for training can improve training speed.

is_unbalance

Specifies whether to increase the weight of the category with a small number of samples to address sample imbalance. Valid values:

  • False (default): does not increase the weight of the category with a small number of samples.

  • True: increases the weight of the category with a small number of samples.

categorical_feature

The categorical feature. The value must be a string array. In most cases, LightGBM determines the data type to automatically configure the categorical_feature parameter. You can also configure this parameter.

For example, if the categorical_feature parameter is set to 'AirportTo,DayOfWeek', the value 'AirportTo,DayOfWeek' indicates two categorical features.

automl

Specifies whether to enable the automatic parameter tuning feature. Valid values:

  • False (default): The automatic parameter tuning feature is disabled.

  • True: The automatic parameter tuning feature is enabled. By default, after the automatic parameter tuning feature is enabled, the early stopping technique is used to stop iteration when the learning task and the learning objectives of the task specified by the loss parameter remain unchanged.

automl_train_tag

The label of the training.

automl_test_tag

The label of the test.

automl_column

The name of the column in the training set or development set that is specified by the automatic parameter tuning feature. You must configure the automl_column and automl_test_tag parameters. The data volume of the automl_train_tag parameter must be 4 to 9 times that of the automl_test_tag parameter.

Note

After you configure the automl_column parameter, automatic searching for the optimal parameter combination is enabled. In this case, you can add the automl_ prefix before the learning_rate and subsample parameters. This way, you can search for existing parameters to locate the optimal parameter. Example:

automl_column='automl_column',automl_train_tag='train',automl_test_tag='test',automl_learning_rate='0.05,0.04,0.03,0.01',automl_subsample='0.6,0.5'

The optimal parameter is obtained from the learning rate set "0.05, 0.04, 0.03, and 0.01" and the sample sampling parameters "0.6 and 0.5".

Examples

Create a model and an offline training task.

/*polar4ai*/
CREATE MODEL airline_gbm WITH
(model_class = 'lightgbm',
x_cols = 'Airline,Flight,AirportFrom,AirportTo,DayOfWeek,Time,Length',
y_cols='Delay',model_parameter=(boosting_type='gbdt'))
AS (SELECT * FROM db4ai.airlines);

Evaluate the model.

/*polar4ai*/
SELECT Airline FROM EVALUATE(MODEL airline_gbm, 
SELECT * FROM db4ai.airlines LIMIT 20) WITH 
(x_cols = 'Airline,Flight,AirportFrom,AirportTo,DayOfWeek,Time,Length',y_cols='Delay',metrics='acc');

Use the model for prediction.

/*polar4ai*/
SELECT Airline FROM PREDICT(MODEL airline_gbm,
SELECT * FROM db4ai.airlines limit 20) WITH
(x_cols = 'Airline,Flight,AirportFrom,AirportTo,DayOfWeek,Time,Length');