Use machine learning - AnalyticDB - Alibaba Cloud Documentation Center

AnalyticDB for PostgreSQL V7.0 provides the in-database AI/ML feature that integrates Hugging Face models for AI, and the XGBoost, LightGBM, CatBoost, Linfa, and all scikit-learn algorithms for machine learning. You can use this feature to perform AI or machine learning training, fine-tuning, deployment, and inference on data in tables or views. This topic describes how to use machine learning to perform training, deployment, and inference in AnalyticDB for PostgreSQL V7.0.

Training

You can create a user-defined function (UDF) named pgml.train to perform training. This operation is supported only on AnalyticDB for PostgreSQL coordinator nodes.

Syntax

CREATE FUNCTION pgml.train(
    "project_name" TEXT,
    "task" TEXT DEFAULT NULL,
    "relation_name" TEXT DEFAULT NULL,
    "y_column_name" TEXT DEFAULT NULL,
    "algorithm" pgml.Algorithm DEFAULT 'linear',
    "hyperparams" jsonb DEFAULT '{}',
    "search" pgml.Search DEFAULT NULL,
    "search_params" jsonb DEFAULT '{}',
    "search_args" jsonb DEFAULT '{}',
    "test_size" real DEFAULT 0.25,
    "test_sampling" pgml.Sampling DEFAULT 'stratified',
    "runtime" pgml.Runtime DEFAULT NULL,
    "automatic_deploy" bool DEFAULT true,
    "materialize_snapshot" bool DEFAULT false, 
    "preprocess" jsonb DEFAULT '{}'
) RETURNS TABLE (
    "project" TEXT,
    "task" TEXT,
    "algorithm" TEXT,
    "deployed" bool
)
EXECUTE ON COORDINATOR
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'train_wrapper';

Parameters

Parameter	Description	Example
project_name	The name of the project.	'Breast Cancer Detection'
task	The name of the task.	'classification'
relation_name	The name of the training table. You can specify the name of a table in a schema other than the current schema in the SCHEMA.TABLE format.	'pgml.breast_cancer'
y_column_name	The name of the training column in the table.	'mglignant'
algorithm	The name of the algorithm. For information about the supported algorithms, see the pgml.algorithm enumeration type table in this topic.	'XGBoost'
hyperparams	The hyperparameters that you want to pass to the algorithm for training. The value of this parameter is in the JSON format.	{ "n_estimators": 25, "nthread": 4 }
search	The search method that you want to use to obtain the optimal set of hyperparameters. For information about the supported search methods, see the pgml.search enumeration type table in this topic.	'grid'
search_params	The hyperparameters and corresponding ranges of hyperparameter values that you want to use in the hyperparameter search to obtain the optimal set of hyperparameters. For more information, see the official documentation of the model algorithm that you use, such as scikit-learn and XGBoost Documentation.	{ "max_depth": [1, 2, 3, 4] }
search_args	The supplementary parameters of the search method. For more information, see the official documentation of the model algorithm that you use, such as scikit-learn and XGBoost Documentation.	{ "n_iter": 10}
test_size	The ratio of data that you want to use as the test set.	0.25
test_sampling	The sampling method that you want to use to retrieve data from the test set. For information about the supported sampling methods, see the pgml.sampling enumeration type table in this topic.	'random'
runtime	The runtime environment in which the algorithm is implemented. Valid values: Python and Rust.	'python'
automatic_deploy	Specifies whether to enable automatic deployment.	true
materialize_snapshot	Specifies whether to enable materialization for deployment.	true
preprocess	The preprocessing steps that you want to perform.	{"col_name": {"impute": "mean", scale: "standard"}}

Note

If you do not specify the number of cores that you want to use for training, the task uses a single core by default. You can use the hyperparams parameter to specify the hyperparameters that you want to pass to the algorithm for training. For more information, see the parameter documentation of the training model. For example, you can use the nthread parameter to specify the number of cores that you want to use for training in the XGBoost model.

Custom enumeration types of the pgml extension

The following tables describe the algorithms, sampling methods, and runtime environments supported by the pgml.algorithm, pgml.sampling, and pgml.runtime custom enumeration types.

The following table describes the algorithms supported by the pgml.algorithm custom enumeration type.

Algorithm	Description
linear	The linear model that is used for regression analysis.
xgboost	The optimized distributed gradient boosting library.
xgboost_random_forest	The XGBoost version that is specially tuned for the random forest algorithm.
svm	The Support Vector Machine (SVM) algorithm that is used for classification and regression.
lasso	The algorithm that is used for regression analysis in statistics and machine learning.
elastic_net	The elastic net regression algorithm that combines Lasso (L1) and Ridge (L2) regularization.
ridge	The ridge regression algorithm that uses L2 regularization to prevent overfitting.
kmeans	The clustering algorithm.
dbscan	The density-based clustering algorithm that can identify clusters of various shapes.
knn	The k-nearest neighbors (KNN) algorithm that is used for regression.
random_forest	The ensemble learning algorithm that is used for classification and regression.
least_angle	The least angle regression (LARS) algorithm that is compatible with the lasso algorithm.
lasso_least_angle	The lasso least angle regression algorithm that combines features of LARS and lasso.
orthogonal_matching_pursuit	The greedy algorithm that is used for sparse signal recovery.
bayesian_ridge	The Bayesian ridge regression algorithm that estimates regression parameters by using the prior distribution.
automatic_relevance_determination	The Bayesian regression algorithm.
stochastic_gradient_descent	The simple and efficient optimization algorithm that is used for regression and classification.
perceptron	The linear model algorithm that is used for binary classification.
passive_aggressive	The online learning algorithm that is suitable for large-scale learning.
ransac	The random sample consensus algorithm that is used for robust linear regression.
theil_sen	The Theil-Sen estimator that is used for robust linear regression.
huber	The Huber regression that is robust to outliers.
quantile	The quantile regression algorithm that predicts conditional quantiles such as the median value.
kernel_ridge	The kernel ridge regression (KRR) algorithm that uses kernel tricks to learn nonlinear relationships.
gaussian_process	The Gaussian process algorithm that is used for probabilistic regression and classification.
nu_svm	The algorithm that is used for classification and regression. This algorithm is a variant of the SVM algorithm.
ada_boost	The boosting algorithm that is used to resolve classification issues.
bagging	The ensemble algorithm that is used to reduce overfitting.
extra_trees	The extra trees algorithm that is an extension of the random forest algorithm.
gradient_boosting_trees	The boosting algorithm that is used to resolve regression and classification issues.
hist_gradient_boosting	The gradient boosting algorithm that is suitable for large datasets.
linear_svm	The linear SVM that is suitable for resolving classification issues.
lightgbm	The gradient boosting framework that is developed by Microsoft for large-scale data.
affinity_propagation	The message-passing clustering algorithm that is based on the similarity between data points.
birch	The clustering algorithm that is suitable for large datasets.
feature_agglomeration	The clustering algorithm that is used to select features.
mini_batch_kmeans	The algorithm that is suitable for large or online datasets. This algorithm is a variant of k-means clustering.
mean_shift	The centroid-based clustering algorithm that can identify any shape of clusters.
optics	The algorithm that can identify clusters of various densities.
spectral	The spectral clustering algorithm that uses the spectral properties of data.
spectral_bi	The spectral bi-clustering algorithm that simultaneously clusters data of two dimensions, such as rows and columns.
spectral_co	The spectral co-clustering algorithm that simultaneously clusters data of two dimensions.
catboost	The gradient boosting algorithm that is used for resolving classification and regression issues.
pca	The algorithm that is used to reduce the dimensionality of data.

The following table describes the sampling methods supported by the pgml.sampling custom enumeration type.
Sampling method
Description
random
Performs random sampling on test data.
last
Performs a sequential scan to retrieve the tail end of test data.
stratified
Performs stratified sampling on test data.
The following table describes the runtime environments supported by the pgml.runtime custom enumeration type.
Runtime environment
Description
Python
The Python runtime environment.
Rust
The Rust runtime environment.

Hyperparameter search

You can use hyperparameter search and cross-validation to find the optimal set of hyperparameters for model training. The training parameters related to hyperparameter search include search, search_params, and search_args.

The following table describes the search methods supported by the pgml.search custom enumeration type.

Search method	Description
grid	Uses the Cartesian product to train every possible combination of the search hyperparameters specified by the `search_params` parameter.
random	Randomly selects parameter combinations from the search hyperparameters specified by the `search_params` parameter for up to `n_iter` sampling iterations based on the `search_args` parameter.

The search_params parameter specifies the hyperparameters and corresponding ranges of hyperparameter values that you want to use in the hyperparameter search to obtain the optimal set of hyperparameters. For more information about search hyperparameters, see the official documentation of the model algorithm that you use, such as scikit-learn and XGBoost Documentation.

The following table describes the supplementary parameters supported by the search_args parameter.

Parameter in the JSONB format	Description
'{ "n_iter": K }'	The number of iterations that you want the model to perform for hyperparameter search. This parameter is applicable when you set the pgml.search custom enumeration type to random.
'{ "cv": K }'	The number of folds into which you want to split the dataset for k-fold cross-validation.

Example

Perform grid search to train a model by evaluating combinations of the search hyperparameters specified by the search_params parameter. The total number of combinations is calculated by using the following formula: len(max_depth) × len(n_estimators) × k-fold = 3 × 3 × 3 = 27.

Note

For regression tasks, the R2 score is used to select the optimal set of hyperparameters by default.
For classification tasks, the F1 score is used to select the optimal set of hyperparameters by default.

SELECT * FROM pgml.train(
    'Handwritten Digit Image Classifier', 
    'classification',
    'pgml.digits',
    'target',
    algorithm => 'xgboost', 
    SEARCH => 'grid', 
    search_params => '{
        "max_depth": [1, 3, 5], 
        "n_estimators": [20, 80, 160]
    }',
    search_args => '{ "cv": 3 }'
);

Preprocessing

You can specify the preprocess parameters in a training function to preprocess data. The preprocessor can configure all columns of the training data in a centralized manner. The pgml extension supports the following data preprocessing types.

Note

After training, the information about the preprocessor is saved in the pgml.snapshots metadata table. When you subsequently invoke the pgml.predict() function, the same preprocessing steps that you defined during training are automatically applied to the new prediction data.

Enumeration type encoding

Converts enumeration types into numeric types.

Encoding method	Description
native	Converts a non-text type into a 32-bit floating-point type. This method is the default method.
target	Converts categorical variables into numeric values by replacing each category with the average value of the target variable for that category.
one_hot	Converts enumeration types into the one-hot encoding format.
ordinal	Converts categorical data into numeric values by assigning each category a unique integer based on the index specified in a predefined ordinal array. By default, NULL values are assigned an index of 0.

Missing value imputation

Replaces NULL or NaN values with numeric values.

Imputation method	Description
error	Returns an error when NULL or NaN values are detected during training or inference. This method is the default method.
mean	Replaces NULL or NaN values with the average value in the training data.
median	Replaces NULL or NaN values with the median value in the training data.
mode	Replaces NULL or NaN values with the most frequently occurring value in the training data.
min	Replaces NULL or NaN values with the minimum value in the training data.
max	Replaces NULL or NaN values with the maximum value in the training data.
zero	Replaces NULL or NaN values with 0.

Scaling

Scales all variables to a standard range to help improve the generalization of the model.

Scaling method	Description
preserve	Does not perform scaling. This method is the default method.
standard	Standardizes features by removing the average value and scaling features to unit variance.
min-max	Scales features to the range between the specified minimum and maximum values. In most cases, the range is between 0 and 1.
max-abs	Scales data by dividing each feature by the maximum absolute value, which ensures that the maximum absolute value of each feature in the training set is 1.
robust	Scales data based on the range defined by the first and third quartiles of statistics. Compared with other methods, this method is more robust in handling outliers.

Deployment

During deployment, the training project information, preprocessing methods, and model parameters are stored in metadata tables. In subsequent inference, you need to only specify the project number or name to perform inference, which simplifies the inference process.

The following table describes the deployment policies supported by the pgml.strategy custom enumeration type.

Deployment policy	Description
best_score	Deploys the model that achieves the best metrics.
most_recent	Deploys the most recently trained model and ignores the information about metrics.
rollback	Rolls back to the previous deployment of the current model.

The automatic_deploy parameter is used to perform deployment during training. The default value of this parameter is true. This parameter automatically selects the optimal model for deployment during multiple trainings. For classification tasks, the model that has the highest F1 score is selected by default. For regression tasks, the model that has the highest R2 score is selected by default. You can also use the following UDFs to manually deploy a trained model.

Forcibly deploy a model specified by the model_id parameter.

CREATE FUNCTION pgml."deploy"(
        "model_id" bigint
) RETURNS TABLE (
        "project" TEXT,
        "strategy" TEXT,
        "algorithm" TEXT
)
STRICT EXECUTE ON COORDINATOR
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'deploy_model_wrapper';

Deploy a model that matches the specified algorithm and deployment policy.

CREATE FUNCTION pgml."deploy"(
        "project_name" TEXT,
        "strategy" pgml.Strategy,
        "algorithm" pgml.Algorithm DEFAULT NULL
) RETURNS TABLE (
        "project" TEXT,
        "strategy" TEXT,
        "algorithm" TEXT
)
EXECUTE ON COORDINATOR
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'deploy_strategy_wrapper';

Inference

Perform inference on a piece of text

You can use the predict() function to perform inference on a piece of text based on the project name or model ID. The data type of the return value is FLOAT4. The return value is the predicted class or category to which the input data belongs for classification tasks and the numeric value predicted based on the input features for regression tasks. The following examples show how to perform inference on a piece of text in different scenarios.

Note

The model identified based on a project name is the most recently deployed model for that project.

Perform inference on the data rows of a table based on the project name.

CREATE FUNCTION pgml."predict"(
        "project_name" TEXT,
        "row" anyelement
) RETURNS real
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_row_wrapper';

Perform inference on the data rows of a table based on the model ID.

CREATE FUNCTION pgml."predict"(
        "model_id" bigint,
        "row" anyelement
) RETURNS real
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_model_row_wrapper';

Perform inference on FLOAT4 features based on the project name.

CREATE FUNCTION pgml."predict"(
        "project_name" TEXT,
        "features" real[]
) RETURNS real
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_f32_wrapper';

Perform inference on FLOAT4 features based on the model ID.

CREATE FUNCTION pgml."predict"(
        "model_id" bigint,
        "features" real[]
) RETURNS real
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_model_wrapper';

Perform inference on BIGINT features based on the project name.

CREATE FUNCTION pgml."predict"(
        "project_name" TEXT,
        "features" bigint[]
) RETURNS real
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_i64_wrapper';

Perform inference on INT features based on the project name.

CREATE FUNCTION pgml."predict"(
        "project_name" TEXT,
        "features" INT[]
) RETURNS real
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_i32_wrapper';

Perform inference on SMALLINT features based on the project name.

CREATE FUNCTION pgml."predict"(
        "project_name" TEXT,
        "features" smallint[]
) RETURNS real
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_i16_wrapper';

Perform inference on FLOAT8 features based on the project name.

CREATE FUNCTION pgml."predict"(
        "project_name" TEXT,
        "features" double precision[]
) RETURNS real
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_f64_wrapper';

Perform inference on BOOL features based on the project name.

CREATE FUNCTION pgml."predict"(
        "project_name" TEXT,
        "features" bool[]
) RETURNS real /* f32 */
IMMUTABLE STRICT PARALLEL SAFE
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_bool_wrapper';

Perform inference on multiple pieces of text at a time

You can perform inference on multiple sets of FLOAT4 features based on the project name or model ID by using the predict_batch(), predict_proba(), or predict_joint() function.

Note

The multiple sets of features must be flattened into a one-dimensional array. The return value is an array of FLOAT4 values.

predict_batch

In most cases, you can use the predict_batch() function to perform inference on multiple pieces of text at a time. The following examples show how to perform inference on multiple pieces of text at a time in different scenarios.

Perform inference on the data rows of a table based on the model ID.

CREATE FUNCTION pgml."predict_batch"(
    "model_id" bigint, /* i64 */
    "features" real[] /* alloc::vec::Vec<f32> */
) RETURNS real[] /* alloc::vec::Vec<f32> */
IMMUTABLE STRICT PARALLEL SAFE 
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_model_batch_wrapper';

Perform inference on the data rows of a table based on the project name.

CREATE FUNCTION pgml."predict_batch"(
    "project_name" TEXT, /* &str */
    "features" real[] /* alloc::vec::Vec<f32> */
) RETURNS SETOF real /* f32 */
IMMUTABLE STRICT PARALLEL SAFE 
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_batch_wrapper';

predict_proba

If you want confidence probabilities to be returned, we recommend that you use the predict_proba() function. The predict_proba() function can be used in a manner similar to the predict_batch function except that the predict_proba() function returns the confidence probability of each class.

Perform inference on the data rows of a table based on the model ID.

CREATE FUNCTION pgml."predict_proba"(
    "model_id" bigint, /* i64 */
    "features" real[] /* alloc::vec::Vec<f32> */
) RETURNS real[] /* alloc::vec::Vec<f32> */
IMMUTABLE STRICT PARALLEL SAFE 
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_model_proba_wrapper';

Perform inference on the data rows of a table based on the project name.

CREATE FUNCTION pgml."predict_proba"(
    "project_name" TEXT, /* &str */
    "features" real[] /* alloc::vec::Vec<f32> */
) RETURNS real[] /* alloc::vec::Vec<f32> */
IMMUTABLE STRICT PARALLEL SAFE 
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_proba_wrapper';

predict_joint

To perform inference on multi-label columns, use the predict_joint() function. This function does not support classification tasks. The predict_joint() function returns multi-dimensional prediction results for each set of features.

Perform inference on multiple sets of FLOAT4 features based on the model ID.

 CREATE FUNCTION pgml."predict_joint"(
    "model_id" bigint, /* i64 */
    "features" real[] /* alloc::vec::Vec<f32> */
) RETURNS real[] /* alloc::vec::Vec<f32> */
IMMUTABLE STRICT PARALLEL SAFE 
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_model_joint_wrapper';

Perform inference on multiple sets of FLOAT4 features based on the project name.

CREATE FUNCTION pgml."predict_joint"(
    "project_name" TEXT, /* &str */
    "features" real[] /* alloc::vec::Vec<f32> */
) RETURNS real[] /* alloc::vec::Vec<f32> */
IMMUTABLE STRICT PARALLEL SAFE 
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'predict_joint_wrapper';

Examples

The following examples show complete SQL syntax that is used to perform classification, regression, clustering, dimensionality reduction, and preprocessing.

Classification

Example

-- Exit on error (psql)
-- \set ON_ERROR_STOP true
\timing on

SELECT pgml.load_dataset('breast_cancer');

-- view the dataset
SELECT * FROM pgml.breast_cancer LIMIT 10;

-- train a simple model to classify the data
SELECT * FROM pgml.train('Breast Cancer Detection', 'classification', 'pgml.breast_cancer', 'malignant');

-- check out the predictions
SELECT malignant, pgml.predict(
    'Breast Cancer Detection', 
    (
        "mean radius", 
        "mean texture", 
        "mean perimeter", 
        "mean area",
        "mean smoothness",
        "mean compactness",
        "mean concavity",
        "mean concave points",
        "mean symmetry",
        "mean fractal dimension",
        "radius error",
        "texture error",
        "perimeter error",
        "area error",
        "smoothness error",
        "compactness error",
        "concavity error",
        "concave points error",
        "symmetry error",
        "fractal dimension error",
        "worst radius",
        "worst texture",
        "worst perimeter",
        "worst area",
        "worst smoothness",
        "worst compactness",
        "worst concavity",
        "worst concave points",
        "worst symmetry",
        "worst fractal dimension"
    )
) AS prediction
FROM pgml.breast_cancer
LIMIT 10;

-- view raw class probabilities
SELECT malignant, pgml.predict_proba(
    'Breast Cancer Detection',
    ARRAY[
        "mean radius",
        "mean texture",
        "mean perimeter",
        "mean area",
        "mean smoothness",
        "mean compactness",
        "mean concavity",
        "mean concave points",
        "mean symmetry",
        "mean fractal dimension",
        "radius error",
        "texture error",
        "perimeter error",
        "area error",
        "smoothness error",
        "compactness error",
        "concavity error",
        "concave points error",
        "symmetry error",
        "fractal dimension error",
        "worst radius",
        "worst texture",
        "worst perimeter",
        "worst area",
        "worst smoothness",
        "worst compactness",
        "worst concavity",
        "worst concave points",
        "worst symmetry",
        "worst fractal dimension"
    ]
) AS prediction
FROM pgml.breast_cancer
LIMIT 10;

--
-- After a project has been trained, omitted parameters will be reused from previous training runs
-- In these examples we'll reuse the training data snapshots from the initial call.
--

-- linear models
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'ridge');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'stochastic_gradient_descent');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'perceptron');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'passive_aggressive');

-- support vector machines
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'svm');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'nu_svm');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'linear_svm');

-- ensembles
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'ada_boost');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'bagging');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'extra_trees', hyperparams => '{"n_estimators": 10}');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'gradient_boosting_trees', hyperparams => '{"n_estimators": 10}');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'random_forest', hyperparams => '{"n_estimators": 10}');

-- other
-- Gaussian Process is too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'gaussian_process', hyperparams => '{"max_iter_predict": 100, "warm_start": true}');

-- Gradient Boosting
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'xgboost', hyperparams => '{"n_estimators": 10}');
-- SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'xgboost_random_forest', hyperparams => '{"n_estimators": 10}');
-- SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'lightgbm', hyperparams => '{"n_estimators": 1}');
-- Histogram Gradient Boosting is too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'hist_gradient_boosting', hyperparams => '{"max_iter": 2}');


-- check out all that hard work
SELECT trained_models.* FROM pgml.trained_models 
JOIN pgml.models on models.id = trained_models.id
ORDER BY models.metrics->>'f1' DESC LIMIT 5;

-- deploy the random_forest model for prediction use
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'most_recent', 'random_forest');
-- check out that throughput
SELECT * FROM pgml.deployed_models ORDER BY deployed_at DESC LIMIT 5;

-- do a hyperparam search on your favorite algorithm
SELECT pgml.train(
    'Breast Cancer Detection', 
    algorithm => 'xgboost', 
    search => 'grid', 
    search_params => '{
        "n_estimators": [2, 4],
        "max_depth": [1, 2, 3]
    }'
);

-- deploy the "best" model for prediction use
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'best_score');
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'most_recent');
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'rollback');
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'best_score', 'svm');

-- check out the improved predictions
SELECT malignant, pgml.predict(
    'Breast Cancer Detection', 
    (
        "mean radius", 
        "mean texture", 
        "mean perimeter", 
        "mean area",
        "mean smoothness",
        "mean compactness",
        "mean concavity",
        "mean concave points",
        "mean symmetry",
        "mean fractal dimension",
        "radius error",
        "texture error",
        "perimeter error",
        "area error",
        "smoothness error",
        "compactness error",
        "concavity error",
        "concave points error",
        "symmetry error",
        "fractal dimension error",
        "worst radius",
        "worst texture",
        "worst perimeter",
        "worst area",
        "worst smoothness",
        "worst compactness",
        "worst concavity",
        "worst concave points",
        "worst symmetry",
        "worst fractal dimension"
    )
) AS prediction
FROM pgml.breast_cancer 
LIMIT 10;

Regression

Example

SELECT pgml.load_dataset('diabetes');

-- view the dataset
SELECT * FROM pgml.diabetes LIMIT 10;

-- train a simple model on the data
SELECT * FROM pgml.train('Diabetes Progression', 'regression', 'pgml.diabetes', 'target');

-- check out the predictions
SELECT target, pgml.predict('Diabetes Progression', (age, sex, bmi, bp, s1, s2, s3, s4, s5, s6)) AS prediction
FROM pgml.diabetes 
LIMIT 10;

-- Check predictions against a specific model id
SELECT model_id, target, pgml.predict(model_id, (age, sex, bmi, bp, s1, s2, s3, s4, s5, s6)) AS prediction
FROM pgml.diabetes
CROSS JOIN LATERAL (
    SELECT pgml.models.id AS model_id FROM pgml.models
    INNER JOIN pgml.projects
    ON pgml.models.project_id = pgml.projects.id
    WHERE pgml.projects.name = 'Diabetes Progression'
    LIMIT 1
) models
LIMIT 10;

--
-- After a project has been trained, omitted parameters will be reused from previous training runs
-- In these examples we'll reuse the training data snapshots from the initial call.
--

-- linear models
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'ridge');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'lasso');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'elastic_net');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'least_angle');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'lasso_least_angle');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'orthogonal_matching_pursuit');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'bayesian_ridge');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'automatic_relevance_determination');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'stochastic_gradient_descent');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'passive_aggressive');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'ransac');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'theil_sen', hyperparams => '{"max_iter": 10, "max_subpopulation": 100}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'huber');
-- Quantile Regression too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'quantile');

-- support vector machines
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'svm', hyperparams => '{"max_iter": 100}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'nu_svm', hyperparams => '{"max_iter": 10}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'linear_svm', hyperparams => '{"max_iter": 100}');

-- ensembles
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'ada_boost', hyperparams => '{"n_estimators": 5}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'bagging', hyperparams => '{"n_estimators": 5}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'extra_trees', hyperparams => '{"n_estimators": 5}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'gradient_boosting_trees', hyperparams => '{"n_estimators": 5}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'random_forest', hyperparams => '{"n_estimators": 5}');

-- other
-- Kernel Ridge is too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'kernel_ridge');
-- Gaussian Process is too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'gaussian_process');

-- gradient boosting
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost', hyperparams => '{"n_estimators": 10}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'catboost', hyperparams => '{"n_estimators": 10}');
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost_random_forest', hyperparams => '{"n_estimators": 10}');
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'lightgbm', hyperparams => '{"n_estimators": 1}');
-- Histogram Gradient Boosting is too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'hist_gradient_boosting', hyperparams => '{"max_iter": 10}');

-- runtimes
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'linear', runtime => 'python');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'linear', runtime => 'rust');

--SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost', runtime => 'python', hyperparams => '{"n_estimators": 1}'); -- too slow
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost', runtime => 'rust', hyperparams => '{"n_estimators": 10}');

-- check out all that hard work
SELECT trained_models.* FROM pgml.trained_models 
JOIN pgml.models on models.id = trained_models.id
ORDER BY models.metrics->>'mean_squared_error' DESC LIMIT 5;

-- deploy the random_forest model for prediction use
SELECT * FROM pgml.deploy('Diabetes Progression', 'most_recent', 'random_forest');
-- check out that throughput
SELECT * FROM pgml.deployed_models ORDER BY deployed_at DESC LIMIT 5;

-- do a hyperparam search on your favorite algorithm
SELECT pgml.train(
    'Diabetes Progression', 
    algorithm => 'xgboost',
    hyperparams => '{"eval_metric": "rmse"}'::JSONB,
    search => 'grid', 
    search_params => '{
        "max_depth": [1, 2], 
        "n_estimators": [20, 40]
    }'
);

-- deploy the "best" model for prediction use
SELECT * FROM pgml.deploy('Diabetes Progression', 'best_score');
SELECT * FROM pgml.deploy('Diabetes Progression', 'most_recent');
SELECT * FROM pgml.deploy('Diabetes Progression', 'rollback');
SELECT * FROM pgml.deploy('Diabetes Progression', 'best_score', 'svm');

-- check out the improved predictions
SELECT target, pgml.predict('Diabetes Progression', (age, sex, bmi, bp, s1, s2, s3, s4, s5, s6)) AS prediction
FROM pgml.diabetes 
LIMIT 10;

Clustering

Example

-- This example trains models on the sklean digits dataset
-- which is a copy of the test set of the UCI ML hand-written digits datasets
-- https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
--
-- This demonstrates using a table with a single array feature column
-- for clustering. You could do something similar with a vector column
--

-- Exit on error (psql)
-- \set ON_ERROR_STOP true
\timing on

SELECT pgml.load_dataset('digits');

-- create an unlabeled table of the images for unsupervised learning
CREATE VIEW pgml.digit_vectors AS
SELECT image FROM pgml.digits;

-- view the dataset
SELECT left(image::text, 40) || ',...}' FROM pgml.digit_vectors LIMIT 10;

-- train a simple model to classify the data
SELECT * FROM pgml.train('Handwritten Digit Clusters', 'clustering', 'pgml.digit_vectors', hyperparams => '{"n_clusters": 10}');

-- check out the predictions
SELECT target, pgml.predict('Handwritten Digit Clusters', image) AS prediction
FROM pgml.digits
LIMIT 10;

SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'affinity_propagation');
SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'birch', hyperparams => '{"n_clusters": 10}');
SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'kmeans', hyperparams => '{"n_clusters": 10}');
SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'mini_batch_kmeans', hyperparams => '{"n_clusters": 10}');

Dimensionality reduction

Example

-- This example reduces the dimensionality of images in the sklean digits dataset
-- which is a copy of the test set of the UCI ML hand-written digits datasets
-- https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
--
-- This demonstrates using a table with a single array feature column
-- for decomposition to reduce dimensionality.
--
-- Exit on error (psql)
-- \set ON_ERROR_STOP true
\timing on

SELECT pgml.load_dataset('digits');

-- view the dataset
SELECT left(image::text, 40) || ',...}', target FROM pgml.digits LIMIT 10;

-- create a view of just the vectors for decomposition, without any labels
CREATE VIEW digit_vectors AS
SELECT image FROM pgml.digits;

SELECT * FROM pgml.train('Handwritten Digits Reduction', 'decomposition', 'digit_vectors');

-- check out the decomposed vectors
SELECT target, pgml.decompose('Handwritten Digits Reduction', image) AS pca
FROM pgml.digits
LIMIT 10;

--
-- After a project has been trained, omitted parameters will be reused from previous training runs
-- In these examples we'll reuse the training data snapshots from the initial call.
--

-- We can reduce the image vectors from 64 dimensions to 3 components
SELECT * FROM pgml.train('Handwritten Digits Reduction', hyperparams => '{"n_components": 3}');

-- check out the reduced vectors
SELECT target, pgml.decompose('Handwritten Digits Reduction', image) AS pca
FROM pgml.digits
LIMIT 10;

-- check out all that hard work
SELECT trained_models.* FROM pgml.trained_models
                                 JOIN pgml.models on models.id = trained_models.id
ORDER BY models.metrics->>'cumulative_explained_variance' DESC LIMIT 5;

-- deploy the PCA model for prediction use
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'most_recent', 'pca');
-- check out that throughput
SELECT * FROM pgml.deployed_models ORDER BY deployed_at DESC LIMIT 5;

-- deploy the "best" model for prediction use
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'best_score');
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'most_recent');
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'rollback');
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'best_score', 'pca');

-- check out the improved predictions
SELECT target, pgml.predict('Handwritten Digits Reduction', image) AS prediction
FROM pgml.digits
LIMIT 10;

Preprocessing

Example

-- load the diamonds dataset, that contains text categorical variables
SELECT pgml.load_dataset('jdxcosta/diamonds');

-- view the data
SELECT * FROM pgml."jdxcosta/diamonds" LIMIT 10;

-- drop the Unamed column, since it's not useful for training (you could create a view instead)
ALTER TABLE pgml."jdxcosta/diamonds" DROP COLUMN "Unnamed: 0";

-- train a model using preprocessors to scale the numeric variables, and target encode the categoricals
SELECT pgml.train(
       project_name => 'Diamond prices',
       task => 'regression',
       relation_name => 'pgml.jdxcosta/diamonds',
       y_column_name => 'price',
       algorithm => 'lightgbm',
       preprocess => '{
                      "carat": {"scale": "standard"},
                      "depth": {"scale": "standard"},
                      "table": {"scale": "standard"},
                      "cut": {"encode": "target", "scale": "standard"},
                      "color": {"encode": "target", "scale": "standard"},
                      "clarity": {"encode": "target", "scale": "standard"}
                  }'
);

-- run some predictions, notice we're passing a heterogeneous row (tuple) as input, rather than a homogenous ARRAY[].
SELECT price, pgml.predict('Diamond prices', (carat, cut, color, clarity, depth, "table", x, y, z)) AS prediction
FROM pgml."jdxcosta/diamonds"
LIMIT 10;

-- This is a difficult dataset for more algorithms, which makes it a good challenge for preprocessing, and additional
-- feature engineering. What's next?

Sampling method	Description
random	Performs random sampling on test data.
last	Performs a sequential scan to retrieve the tail end of test data.
stratified	Performs stratified sampling on test data.

Runtime environment	Description
Python	The Python runtime environment.
Rust	The Rust runtime environment.