This topic describes the light gradient boosting machine (LightGBM) algorithm.
Background information
LightGBM is a distributed gradient lifting framework based on decision tree algorithms. LightGBM is designed as a fast, efficient, low-memory, and high-accuracy tool that supports parallel and large-scale data processing. LightGBM can reduce the amount of memory occupied by data, lower communication costs, improve the efficiency of multi-node elastic parallel query (ePQ), and achieve linear acceleration in data computing.
Scenarios
LightGBM is an algorithm framework that includes GBDT models, random forests, and logistic regression models. LightGBM is primarily used in scenarios such as binary classification, multiclass classification, and sorting.
For example, in most personalized commodity recommendation scenarios, click estimation models are required. Historical user behaviors such as clicks, unclicks, and purchases can be used as training data to predict the probability of user clicks or purchases. The following features are extracted based on user behavior and user attributes:
Categorical feature: a string type, such as gender (male or female).
Commodity category: categories such as clothing, toys, or electronics.
Numerical feature: integer or floating-point data type, such as user activity or commodity prices.
Parameters
The values of the parameters described in the following table are the same as those of the model_parameter
parameter specified in the CREATE MODEL
statement that is used to create a model. You can configure the parameters based on your business requirements.
Parameter | Description |
boosting_type | The type of the weak learner. Valid values:
Note When you specify a value for this parameter value, enclose the value in single quotation marks ('). Example: |
n_estimators | The number of iterations. The value must be an integer. Default value: 100. |
loss | The learning task and the learning objectives of the task. Valid values:
|
num_leaves | The number of leaves. The value must be an integer. Default value: 128. |
max_depth | The maximum depth of the tree. The value must be an integer. Default value: 7. Note If this parameter is set to -1, the depth of the tree is not specified. We recommend that you specify this parameter with caution to prevent overfitting. |
learning_rate | The learning rate. The value must be a floating-point number. Default value: 0.06. |
max_leaf_nodes | The maximum number of leaf nodes in the tree. The value can be an integer or be left empty. By default, this parameter is left empty, which indicates that the number of leaf nodes in the tree is not limited. |
min_samples_leaf | The minimum number of sample leaf nodes in the tree. The value must be an integer. Default value: 20. Note If the number of leaf nodes is less than the minimum number of sample leaf nodes, the nodes are pruned together with sibling nodes. |
subsample | The ratio of training samples to all samples. The value must be a floating-point number. Valid values: 0 to 1. Default value: 1. Note If the value of this parameter is less than 1, only samples with this proportion value are used in the training. |
max_features | The ratio of training features to all features. The value must be a floating-point number. Valid values: 0 to 1. Default value: 1. |
max_depth | The maximum depth of the tree. The value must be an integer. Default value: 7. Note A greater value specifies the higher precision. However, overfitting may occur. |
random_state | The random number seed. The value must be an integer. Default value: 1. Note If different values exist, the construction of the tree and the segmentation of training data may be affected. |
model_type | The storage type of the model. Valid values:
|
n_jobs | The number of threads used for training. The value must be an integer. Default value: 4. Note A large number of threads used for training can improve training speed. |
is_unbalance | Specifies whether to increase the weight of the category with a small number of samples to address sample imbalance. Valid values:
|
categorical_feature | The categorical feature. The value must be a string array. In most cases, LightGBM determines the data type to automatically configure the categorical_feature parameter. You can also configure this parameter. For example, if the |
automl | Specifies whether to enable the automatic parameter tuning feature. Valid values:
|
automl_train_tag | The label of the training. |
automl_test_tag | The label of the test. |
automl_column | The name of the column in the training set or development set that is specified by the automatic parameter tuning feature. You must configure the Note After you configure the
The optimal parameter is obtained from the learning rate set "0.05, 0.04, 0.03, and 0.01" and the sample sampling parameters "0.6 and 0.5". |
Examples
Create a model and an offline training task.
/*polar4ai*/
CREATE MODEL airline_gbm WITH
(model_class = 'lightgbm',
x_cols = 'Airline,Flight,AirportFrom,AirportTo,DayOfWeek,Time,Length',
y_cols='Delay',model_parameter=(boosting_type='gbdt'))
AS (SELECT * FROM db4ai.airlines);
Evaluate the model.
/*polar4ai*/
SELECT Airline FROM EVALUATE(MODEL airline_gbm,
SELECT * FROM db4ai.airlines LIMIT 20) WITH
(x_cols = 'Airline,Flight,AirportFrom,AirportTo,DayOfWeek,Time,Length',y_cols='Delay',metrics='acc');
Use the model for prediction.
/*polar4ai*/
SELECT Airline FROM PREDICT(MODEL airline_gbm,
SELECT * FROM db4ai.airlines limit 20) WITH
(x_cols = 'Airline,Flight,AirportFrom,AirportTo,DayOfWeek,Time,Length');