AnalyticDB for PostgreSQL V7.0 provides the in-database AI/ML feature that integrates Hugging Face models for AI, and the XGBoost, LightGBM, CatBoost, Linfa, and all scikit-learn algorithms for machine learning. You can use this feature to perform AI or machine learning training, fine-tuning, deployment, and inference on data in tables or views. This topic describes how to use machine learning to perform training, deployment, and inference in AnalyticDB for PostgreSQL V7.0.
Training
You can create a user-defined function (UDF) named pgml.train to perform training. This operation is supported only on AnalyticDB for PostgreSQL coordinator nodes.
Syntax
CREATE FUNCTION pgml.train(
"project_name" TEXT,
"task" TEXT DEFAULT NULL,
"relation_name" TEXT DEFAULT NULL,
"y_column_name" TEXT DEFAULT NULL,
"algorithm" pgml.Algorithm DEFAULT 'linear',
"hyperparams" jsonb DEFAULT '{}',
"search" pgml.Search DEFAULT NULL,
"search_params" jsonb DEFAULT '{}',
"search_args" jsonb DEFAULT '{}',
"test_size" real DEFAULT 0.25,
"test_sampling" pgml.Sampling DEFAULT 'stratified',
"runtime" pgml.Runtime DEFAULT NULL,
"automatic_deploy" bool DEFAULT true,
"materialize_snapshot" bool DEFAULT false,
"preprocess" jsonb DEFAULT '{}'
) RETURNS TABLE (
"project" TEXT,
"task" TEXT,
"algorithm" TEXT,
"deployed" bool
)
EXECUTE ON COORDINATOR
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'train_wrapper';
Parameters
Parameter | Description | Example |
project_name | The name of the project. | 'Breast Cancer Detection' |
task | The name of the task. | 'classification' |
relation_name | The name of the training table. You can specify the name of a table in a schema other than the current schema in the SCHEMA.TABLE format. | 'pgml.breast_cancer' |
y_column_name | The name of the training column in the table. | 'mglignant' |
algorithm | The name of the algorithm. For information about the supported algorithms, see the pgml.algorithm enumeration type table in this topic. | 'XGBoost' |
hyperparams | The hyperparameters that you want to pass to the algorithm for training. The value of this parameter is in the JSON format. | { "n_estimators": 25, "nthread": 4 } |
search | The search method that you want to use to obtain the optimal set of hyperparameters. For information about the supported search methods, see the pgml.search enumeration type table in this topic. | 'grid' |
search_params | The hyperparameters and corresponding ranges of hyperparameter values that you want to use in the hyperparameter search to obtain the optimal set of hyperparameters. For more information, see the official documentation of the model algorithm that you use, such as scikit-learn and XGBoost Documentation. | { "max_depth": [1, 2, 3, 4] } |
search_args | The supplementary parameters of the search method. For more information, see the official documentation of the model algorithm that you use, such as scikit-learn and XGBoost Documentation. | { "n_iter": 10} |
test_size | The ratio of data that you want to use as the test set. | 0.25 |
test_sampling | The sampling method that you want to use to retrieve data from the test set. For information about the supported sampling methods, see the pgml.sampling enumeration type table in this topic. | 'random' |
runtime | The runtime environment in which the algorithm is implemented. Valid values: Python and Rust. | 'python' |
automatic_deploy | Specifies whether to enable automatic deployment. | true |
materialize_snapshot | Specifies whether to enable materialization for deployment. | true |
preprocess | The preprocessing steps that you want to perform. | {"col_name": {"impute": "mean", scale: "standard"}} |
If you do not specify the number of cores that you want to use for training, the task uses a single core by default. You can use the hyperparams
parameter to specify the hyperparameters that you want to pass to the algorithm for training. For more information, see the parameter documentation of the training model. For example, you can use the nthread
parameter to specify the number of cores that you want to use for training in the XGBoost model.
Custom enumeration types of the pgml extension
The following tables describe the algorithms, sampling methods, and runtime environments supported by the pgml.algorithm
, pgml.sampling
, and pgml.runtime
custom enumeration types.
The following table describes the algorithms supported by the
pgml.algorithm
custom enumeration type.Algorithm
Description
linear
The linear model that is used for regression analysis.
xgboost
The optimized distributed gradient boosting library.
xgboost_random_forest
The XGBoost version that is specially tuned for the random forest algorithm.
svm
The Support Vector Machine (SVM) algorithm that is used for classification and regression.
lasso
The algorithm that is used for regression analysis in statistics and machine learning.
elastic_net
The elastic net regression algorithm that combines Lasso (L1) and Ridge (L2) regularization.
ridge
The ridge regression algorithm that uses L2 regularization to prevent overfitting.
kmeans
The clustering algorithm.
dbscan
The density-based clustering algorithm that can identify clusters of various shapes.
knn
The k-nearest neighbors (KNN) algorithm that is used for regression.
random_forest
The ensemble learning algorithm that is used for classification and regression.
least_angle
The least angle regression (LARS) algorithm that is compatible with the lasso algorithm.
lasso_least_angle
The lasso least angle regression algorithm that combines features of LARS and lasso.
orthogonal_matching_pursuit
The greedy algorithm that is used for sparse signal recovery.
bayesian_ridge
The Bayesian ridge regression algorithm that estimates regression parameters by using the prior distribution.
automatic_relevance_determination
The Bayesian regression algorithm.
stochastic_gradient_descent
The simple and efficient optimization algorithm that is used for regression and classification.
perceptron
The linear model algorithm that is used for binary classification.
passive_aggressive
The online learning algorithm that is suitable for large-scale learning.
ransac
The random sample consensus algorithm that is used for robust linear regression.
theil_sen
The Theil-Sen estimator that is used for robust linear regression.
huber
The Huber regression that is robust to outliers.
quantile
The quantile regression algorithm that predicts conditional quantiles such as the median value.
kernel_ridge
The kernel ridge regression (KRR) algorithm that uses kernel tricks to learn nonlinear relationships.
gaussian_process
The Gaussian process algorithm that is used for probabilistic regression and classification.
nu_svm
The algorithm that is used for classification and regression. This algorithm is a variant of the SVM algorithm.
ada_boost
The boosting algorithm that is used to resolve classification issues.
bagging
The ensemble algorithm that is used to reduce overfitting.
extra_trees
The extra trees algorithm that is an extension of the random forest algorithm.
gradient_boosting_trees
The boosting algorithm that is used to resolve regression and classification issues.
hist_gradient_boosting
The gradient boosting algorithm that is suitable for large datasets.
linear_svm
The linear SVM that is suitable for resolving classification issues.
lightgbm
The gradient boosting framework that is developed by Microsoft for large-scale data.
affinity_propagation
The message-passing clustering algorithm that is based on the similarity between data points.
birch
The clustering algorithm that is suitable for large datasets.
feature_agglomeration
The clustering algorithm that is used to select features.
mini_batch_kmeans
The algorithm that is suitable for large or online datasets. This algorithm is a variant of k-means clustering.
mean_shift
The centroid-based clustering algorithm that can identify any shape of clusters.
optics
The algorithm that can identify clusters of various densities.
spectral
The spectral clustering algorithm that uses the spectral properties of data.
spectral_bi
The spectral bi-clustering algorithm that simultaneously clusters data of two dimensions, such as rows and columns.
spectral_co
The spectral co-clustering algorithm that simultaneously clusters data of two dimensions.
catboost
The gradient boosting algorithm that is used for resolving classification and regression issues.
pca
The algorithm that is used to reduce the dimensionality of data.
The following table describes the sampling methods supported by the
pgml.sampling
custom enumeration type.Sampling method
Description
random
Performs random sampling on test data.
last
Performs a sequential scan to retrieve the tail end of test data.
stratified
Performs stratified sampling on test data.
The following table describes the runtime environments supported by the
pgml.runtime
custom enumeration type.Runtime environment
Description
Python
The Python runtime environment.
Rust
The Rust runtime environment.
Hyperparameter search
You can use hyperparameter search and cross-validation to find the optimal set of hyperparameters for model training. The training parameters related to hyperparameter search include search
, search_params
, and search_args
.
The following table describes the search methods supported by the pgml.search custom enumeration type.
Search method
Description
grid
Uses the Cartesian product to train every possible combination of the search hyperparameters specified by the
search_params
parameter.random
Randomly selects parameter combinations from the search hyperparameters specified by the
search_params
parameter for up ton_iter
sampling iterations based on thesearch_args
parameter.The
search_params
parameter specifies the hyperparameters and corresponding ranges of hyperparameter values that you want to use in the hyperparameter search to obtain the optimal set of hyperparameters. For more information about search hyperparameters, see the official documentation of the model algorithm that you use, such as scikit-learn and XGBoost Documentation.The following table describes the supplementary parameters supported by the
search_args
parameter.Parameter in the JSONB format
Description
'{ "n_iter": K }'
The number of iterations that you want the model to perform for hyperparameter search. This parameter is applicable when you set the pgml.search custom enumeration type to random.
'{ "cv": K }'
The number of folds into which you want to split the dataset for k-fold cross-validation.
Example
Perform grid search to train a model by evaluating combinations of the search hyperparameters specified by the
search_params
parameter. The total number of combinations is calculated by using the following formula:len(max_depth) × len(n_estimators) × k-fold = 3 × 3 × 3 = 27
.NoteFor regression tasks, the R2 score is used to select the optimal set of hyperparameters by default.
For classification tasks, the F1 score is used to select the optimal set of hyperparameters by default.
SELECT * FROM pgml.train( 'Handwritten Digit Image Classifier', 'classification', 'pgml.digits', 'target', algorithm => 'xgboost', SEARCH => 'grid', search_params => '{ "max_depth": [1, 3, 5], "n_estimators": [20, 80, 160] }', search_args => '{ "cv": 3 }' );
Preprocessing
You can specify the preprocess
parameters in a training function to preprocess data. The preprocessor can configure all columns of the training data in a centralized manner. The pgml extension supports the following data preprocessing types.
After training, the information about the preprocessor is saved in the pgml.snapshots
metadata table. When you subsequently invoke the pgml.predict()
function, the same preprocessing steps that you defined during training are automatically applied to the new prediction data.
Enumeration type encoding
Converts enumeration types into numeric types.
Encoding method | Description |
native | Converts a non-text type into a 32-bit floating-point type. This method is the default method. |
target | Converts categorical variables into numeric values by replacing each category with the average value of the target variable for that category. |
one_hot | Converts enumeration types into the one-hot encoding format. |
ordinal | Converts categorical data into numeric values by assigning each category a unique integer based on the index specified in a predefined ordinal array. By default, NULL values are assigned an index of 0. |
Missing value imputation
Replaces NULL or NaN values with numeric values.
Imputation method | Description |
error | Returns an error when NULL or NaN values are detected during training or inference. This method is the default method. |
mean | Replaces NULL or NaN values with the average value in the training data. |
median | Replaces NULL or NaN values with the median value in the training data. |
mode | Replaces NULL or NaN values with the most frequently occurring value in the training data. |
min | Replaces NULL or NaN values with the minimum value in the training data. |
max | Replaces NULL or NaN values with the maximum value in the training data. |
zero | Replaces NULL or NaN values with 0. |
Scaling
Scales all variables to a standard range to help improve the generalization of the model.
Scaling method | Description |
preserve | Does not perform scaling. This method is the default method. |
standard | Standardizes features by removing the average value and scaling features to unit variance. |
min-max | Scales features to the range between the specified minimum and maximum values. In most cases, the range is between 0 and 1. |
max-abs | Scales data by dividing each feature by the maximum absolute value, which ensures that the maximum absolute value of each feature in the training set is 1. |
robust | Scales data based on the range defined by the first and third quartiles of statistics. Compared with other methods, this method is more robust in handling outliers. |
Deployment
During deployment, the training project information, preprocessing methods, and model parameters are stored in metadata tables. In subsequent inference, you need to only specify the project number or name to perform inference, which simplifies the inference process.
The following table describes the deployment policies supported by the pgml.strategy
custom enumeration type.
Deployment policy | Description |
best_score | Deploys the model that achieves the best metrics. |
most_recent | Deploys the most recently trained model and ignores the information about metrics. |
rollback | Rolls back to the previous deployment of the current model. |
The automatic_deploy
parameter is used to perform deployment during training. The default value of this parameter is true
. This parameter automatically selects the optimal model for deployment during multiple trainings. For classification tasks, the model that has the highest F1 score is selected by default. For regression tasks, the model that has the highest R2 score is selected by default. You can also use the following UDFs to manually deploy a trained model.
Forcibly deploy a model specified by the
model_id
parameter.CREATE FUNCTION pgml."deploy"( "model_id" bigint ) RETURNS TABLE ( "project" TEXT, "strategy" TEXT, "algorithm" TEXT ) STRICT EXECUTE ON COORDINATOR LANGUAGE c /* Rust */ AS 'MODULE_PATHNAME', 'deploy_model_wrapper';
Deploy a model that matches the specified algorithm and deployment policy.
CREATE FUNCTION pgml."deploy"( "project_name" TEXT, "strategy" pgml.Strategy, "algorithm" pgml.Algorithm DEFAULT NULL ) RETURNS TABLE ( "project" TEXT, "strategy" TEXT, "algorithm" TEXT ) EXECUTE ON COORDINATOR LANGUAGE c /* Rust */ AS 'MODULE_PATHNAME', 'deploy_strategy_wrapper';
Inference
Perform inference on a piece of text
You can use the predict()
function to perform inference on a piece of text based on the project name or model ID. The data type of the return value is FLOAT4
. The return value is the predicted class or category to which the input data belongs for classification tasks and the numeric value predicted based on the input features for regression tasks. The following examples show how to perform inference on a piece of text in different scenarios.
The model identified based on a project name is the most recently deployed model for that project.
Perform inference on the data rows of a table based on the project name.
CREATE FUNCTION pgml."predict"( "project_name" TEXT, "row" anyelement ) RETURNS real IMMUTABLE STRICT PARALLEL SAFE LANGUAGE c /* Rust */ AS 'MODULE_PATHNAME', 'predict_row_wrapper';
Perform inference on the data rows of a table based on the model ID.
CREATE FUNCTION pgml."predict"( "model_id" bigint, "row" anyelement ) RETURNS real IMMUTABLE STRICT PARALLEL SAFE LANGUAGE c /* Rust */ AS 'MODULE_PATHNAME', 'predict_model_row_wrapper';
Perform inference on
FLOAT4
features based on the project name.CREATE FUNCTION pgml."predict"( "project_name" TEXT, "features" real[] ) RETURNS real IMMUTABLE STRICT PARALLEL SAFE LANGUAGE c /* Rust */ AS 'MODULE_PATHNAME', 'predict_f32_wrapper';
Perform inference on
FLOAT4
features based on the model ID.CREATE FUNCTION pgml."predict"( "model_id" bigint, "features" real[] ) RETURNS real IMMUTABLE STRICT PARALLEL SAFE LANGUAGE c /* Rust */ AS 'MODULE_PATHNAME', 'predict_model_wrapper';
Perform inference on
BIGINT
features based on the project name.CREATE FUNCTION pgml."predict"( "project_name" TEXT, "features" bigint[] ) RETURNS real IMMUTABLE STRICT PARALLEL SAFE LANGUAGE c /* Rust */ AS 'MODULE_PATHNAME', 'predict_i64_wrapper';
Perform inference on
INT
features based on the project name.CREATE FUNCTION pgml."predict"( "project_name" TEXT, "features" INT[] ) RETURNS real IMMUTABLE STRICT PARALLEL SAFE LANGUAGE c /* Rust */ AS 'MODULE_PATHNAME', 'predict_i32_wrapper';
Perform inference on
SMALLINT
features based on the project name.CREATE FUNCTION pgml."predict"( "project_name" TEXT, "features" smallint[] ) RETURNS real IMMUTABLE STRICT PARALLEL SAFE LANGUAGE c /* Rust */ AS 'MODULE_PATHNAME', 'predict_i16_wrapper';
Perform inference on
FLOAT8
features based on the project name.CREATE FUNCTION pgml."predict"( "project_name" TEXT, "features" double precision[] ) RETURNS real IMMUTABLE STRICT PARALLEL SAFE LANGUAGE c /* Rust */ AS 'MODULE_PATHNAME', 'predict_f64_wrapper';
Perform inference on
BOOL
features based on the project name.CREATE FUNCTION pgml."predict"( "project_name" TEXT, "features" bool[] ) RETURNS real /* f32 */ IMMUTABLE STRICT PARALLEL SAFE LANGUAGE c /* Rust */ AS 'MODULE_PATHNAME', 'predict_bool_wrapper';
Perform inference on multiple pieces of text at a time
You can perform inference on multiple sets of FLOAT4
features based on the project name or model ID by using the predict_batch()
, predict_proba()
, or predict_joint()
function.
The multiple sets of features must be flattened into a one-dimensional array. The return value is an array of FLOAT4
values.
predict_batch
In most cases, you can use the predict_batch()
function to perform inference on multiple pieces of text at a time. The following examples show how to perform inference on multiple pieces of text at a time in different scenarios.
Perform inference on the data rows of a table based on the model ID.
CREATE FUNCTION pgml."predict_batch"( "model_id" bigint, /* i64 */ "features" real[] /* alloc::vec::Vec<f32> */ ) RETURNS real[] /* alloc::vec::Vec<f32> */ IMMUTABLE STRICT PARALLEL SAFE LANGUAGE c /* Rust */ AS 'MODULE_PATHNAME', 'predict_model_batch_wrapper';
Perform inference on the data rows of a table based on the project name.
CREATE FUNCTION pgml."predict_batch"( "project_name" TEXT, /* &str */ "features" real[] /* alloc::vec::Vec<f32> */ ) RETURNS SETOF real /* f32 */ IMMUTABLE STRICT PARALLEL SAFE LANGUAGE c /* Rust */ AS 'MODULE_PATHNAME', 'predict_batch_wrapper';
predict_proba
If you want confidence probabilities to be returned, we recommend that you use the predict_proba()
function. The predict_proba()
function can be used in a manner similar to the predict_batch
function except that the predict_proba() function returns the confidence probability of each class.
Perform inference on the data rows of a table based on the model ID.
CREATE FUNCTION pgml."predict_proba"( "model_id" bigint, /* i64 */ "features" real[] /* alloc::vec::Vec<f32> */ ) RETURNS real[] /* alloc::vec::Vec<f32> */ IMMUTABLE STRICT PARALLEL SAFE LANGUAGE c /* Rust */ AS 'MODULE_PATHNAME', 'predict_model_proba_wrapper';
Perform inference on the data rows of a table based on the project name.
CREATE FUNCTION pgml."predict_proba"( "project_name" TEXT, /* &str */ "features" real[] /* alloc::vec::Vec<f32> */ ) RETURNS real[] /* alloc::vec::Vec<f32> */ IMMUTABLE STRICT PARALLEL SAFE LANGUAGE c /* Rust */ AS 'MODULE_PATHNAME', 'predict_proba_wrapper';
predict_joint
To perform inference on multi-label columns, use the predict_joint()
function. This function does not support classification tasks. The predict_joint()
function returns multi-dimensional prediction results for each set of features.
Perform inference on multiple sets of FLOAT4 features based on the model ID.
CREATE FUNCTION pgml."predict_joint"( "model_id" bigint, /* i64 */ "features" real[] /* alloc::vec::Vec<f32> */ ) RETURNS real[] /* alloc::vec::Vec<f32> */ IMMUTABLE STRICT PARALLEL SAFE LANGUAGE c /* Rust */ AS 'MODULE_PATHNAME', 'predict_model_joint_wrapper';
Perform inference on multiple sets of FLOAT4 features based on the project name.
CREATE FUNCTION pgml."predict_joint"( "project_name" TEXT, /* &str */ "features" real[] /* alloc::vec::Vec<f32> */ ) RETURNS real[] /* alloc::vec::Vec<f32> */ IMMUTABLE STRICT PARALLEL SAFE LANGUAGE c /* Rust */ AS 'MODULE_PATHNAME', 'predict_joint_wrapper';
Examples
The following examples show complete SQL syntax that is used to perform classification, regression, clustering, dimensionality reduction, and preprocessing.