This topic describes the light gradient boosting machine (LightGBM) algorithm.
Background information
LightGBM is a distributed gradient lifting framework based on decision tree algorithms. LightGBM is designed as a fast, efficient, low-memory, and high-accuracy tool that supports parallel and large-scale data processing. LightGBM can reduce the amount of memory occupied by data, lower communication costs, improve the efficiency of multi-node elastic parallel query (ePQ), and achieve linear acceleration in data computing.
Scenarios
LightGBM is an algorithm framework that includes GBDT models, random forests, and logistic regression models. LightGBM is primarily used in scenarios such as binary classification, multiclass classification, and sorting.
For example, in most personalized commodity recommendation scenarios, click prediction models are required. Click prediction models are used to predict user clicks or purchases based on historical user behavior data such as clicks, non-clicks, and purchases. The following features are extracted based on user behavior and user attributes:
Categorical feature: a string type, such as gender (male or female).
Commodity category: categories such as clothing, toys, or electronics.
Numerical feature: integer or floating-point data type, such as user activity or commodity prices.
Parameters
The values of the parameters described in the following table are the same as those of the model_parameter
parameter specified in the CREATE MODEL
statement that is used to create a model. You can configure the parameters based on your business requirements.
Parameter | Description |
boosting_type | The type of the weak learner. Valid values:
Note When you specify a value for this parameter value, enclose the value in single quotation marks ('). Example: |
n_estimators | The number of iterations. The value must be an integer. Default value: 100. |
loss | The learning task and the learning objectives of the task. Valid values:
|
num_leaves | The number of leaves. The value must be an integer. Default value: 128. |
max_depth | The maximum depth of the tree. The value must be an integer. Default value: 7. Note If this parameter is set to -1, the depth of the tree is not specified. We recommend that you specify this parameter with caution to prevent overfitting. |
learning_rate | The learning rate. The value must be a floating-point number. Default value: 0.06. |
max_leaf_nodes | The maximum number of leaf nodes in the tree. The value can be an integer or be left empty. By default, this parameter is left empty, which indicates that the number of leaf nodes in the tree is not limited. |
min_samples_leaf | The minimum number of sample leaf nodes in the tree. The value must be an integer. Default value: 20. Note If the number of leaf nodes is less than the minimum number of sample leaf nodes, the nodes are pruned together with sibling nodes. |
subsample | The ratio of samples used for model creation to all samples. The value must be a floating-point number. Valid values: 0 to 1. Default value: 1. Note If the value of this parameter is less than 1, only the specified proportion of the samples are used for model creation. |
max_features | The proportion of model creation features to all features. The value must be a floating-point number. Valid values: 0 to 1. Default value: 1. |
max_depth | The maximum depth of the tree. The value must be an integer. Default value: 7. Note A greater value specifies the higher precision. However, overfitting may occur. |
random_state | The random number seed. The value must be an integer. Default value: 1. Note If different values exist, the construction of the tree and the segmentation of model-creating data may be affected. |
model_type | The storage type of the model. Valid values:
|
n_jobs | The number of threads used for creating the model. The value must be an integer. Default value: 4. Note The more threads used for model creation, the faster the model is created. |
is_unbalance | Specifies whether to increase the weight of the category with a small number of samples to address sample imbalance. Valid values:
|
categorical_feature | The categorical feature. The value must be a string array. In most cases, LightGBM determines the data type to automatically configure the categorical_feature parameter. You can also configure this parameter. For example, if the |
automl | Specifies whether to enable the automatic parameter tuning feature. Valid values:
|
automl_train_tag | The model creation label. |
automl_test_tag | The model testing label. |
automl_column | The name of the column in the model creation set or development set that is specified by the automatic parameter tuning feature. You must configure the Note After you configure the
The optimal parameter is obtained from the learning rate set "0.05, 0.04, 0.03, and 0.01" and the sample sampling parameters "0.6 and 0.5". |
Examples
Create a LightGBM model.
/*polar4ai*/CREATE MODEL airline_gbm WITH
(model_class = 'lightgbm',
x_cols = 'Airline,Flight,AirportFrom,AirportTo,DayOfWeek,Time,Length',
y_cols='Delay',model_parameter=(boosting_type='gbdt'))
AS (SELECT * FROM db4ai.airlines);
Evaluate the model.
/*polar4ai*/SELECT Airline FROM EVALUATE(MODEL airline_gbm,
SELECT * FROM db4ai.airlines LIMIT 20) WITH
(x_cols = 'Airline,Flight,AirportFrom,AirportTo,DayOfWeek,Time,Length',y_cols='Delay',metrics='acc');
Use the model for prediction.
/*polar4ai*/SELECT Airline FROM PREDICT(MODEL airline_gbm,
SELECT * FROM db4ai.airlines limit 20) WITH
(x_cols = 'Airline,Flight,AirportFrom,AirportTo,DayOfWeek,Time,Length');