All Products
Search
Document Center

AnalyticDB:Use machine learning

Last Updated:Nov 13, 2024

AnalyticDB for PostgreSQL V7.0 provides the in-database AI/ML feature that integrates Hugging Face models for AI, and the XGBoost, LightGBM, CatBoost, Linfa, and all scikit-learn algorithms for machine learning. You can use this feature to perform AI or machine learning training, fine-tuning, deployment, and inference on data in tables or views. This topic describes how to use machine learning to perform training, deployment, and inference in AnalyticDB for PostgreSQL V7.0.

Training

You can create a user-defined function (UDF) named pgml.train to perform training. This operation is supported only on AnalyticDB for PostgreSQL coordinator nodes.

Syntax

CREATE FUNCTION pgml.train(
    "project_name" TEXT,
    "task" TEXT DEFAULT NULL,
    "relation_name" TEXT DEFAULT NULL,
    "y_column_name" TEXT DEFAULT NULL,
    "algorithm" pgml.Algorithm DEFAULT 'linear',
    "hyperparams" jsonb DEFAULT '{}',
    "search" pgml.Search DEFAULT NULL,
    "search_params" jsonb DEFAULT '{}',
    "search_args" jsonb DEFAULT '{}',
    "test_size" real DEFAULT 0.25,
    "test_sampling" pgml.Sampling DEFAULT 'stratified',
    "runtime" pgml.Runtime DEFAULT NULL,
    "automatic_deploy" bool DEFAULT true,
    "materialize_snapshot" bool DEFAULT false, 
    "preprocess" jsonb DEFAULT '{}'
) RETURNS TABLE (
    "project" TEXT,
    "task" TEXT,
    "algorithm" TEXT,
    "deployed" bool
)
EXECUTE ON COORDINATOR
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'train_wrapper';

Parameters

Parameter

Description

Example

project_name

The name of the project.

'Breast Cancer Detection'

task

The name of the task.

'classification'

relation_name

The name of the training table. You can specify the name of a table in a schema other than the current schema in the SCHEMA.TABLE format.

'pgml.breast_cancer'

y_column_name

The name of the training column in the table.

'mglignant'

algorithm

The name of the algorithm. For information about the supported algorithms, see the pgml.algorithm enumeration type table in this topic.

'XGBoost'

hyperparams

The hyperparameters that you want to pass to the algorithm for training. The value of this parameter is in the JSON format.

{

"n_estimators": 25,

"nthread": 4

}

search

The search method that you want to use to obtain the optimal set of hyperparameters. For information about the supported search methods, see the pgml.search enumeration type table in this topic.

'grid'

search_params

The hyperparameters and corresponding ranges of hyperparameter values that you want to use in the hyperparameter search to obtain the optimal set of hyperparameters. For more information, see the official documentation of the model algorithm that you use, such as scikit-learn and XGBoost Documentation.

{ "max_depth": [1, 2, 3, 4] }

search_args

The supplementary parameters of the search method. For more information, see the official documentation of the model algorithm that you use, such as scikit-learn and XGBoost Documentation.

{ "n_iter": 10}

test_size

The ratio of data that you want to use as the test set.

0.25

test_sampling

The sampling method that you want to use to retrieve data from the test set. For information about the supported sampling methods, see the pgml.sampling enumeration type table in this topic.

'random'

runtime

The runtime environment in which the algorithm is implemented. Valid values: Python and Rust.

'python'

automatic_deploy

Specifies whether to enable automatic deployment.

true

materialize_snapshot

Specifies whether to enable materialization for deployment.

true

preprocess

The preprocessing steps that you want to perform.

{"col_name": {"impute": "mean", scale: "standard"}}

Note

If you do not specify the number of cores that you want to use for training, the task uses a single core by default. You can use the hyperparams parameter to specify the hyperparameters that you want to pass to the algorithm for training. For more information, see the parameter documentation of the training model. For example, you can use the nthread parameter to specify the number of cores that you want to use for training in the XGBoost model.

Custom enumeration types of the pgml extension

The following tables describe the algorithms, sampling methods, and runtime environments supported by the pgml.algorithm, pgml.sampling, and pgml.runtime custom enumeration types.

  • The following table describes the algorithms supported by the pgml.algorithm custom enumeration type.

    Algorithm

    Description

    linear

    The linear model that is used for regression analysis.

    xgboost

    The optimized distributed gradient boosting library.

    xgboost_random_forest

    The XGBoost version that is specially tuned for the random forest algorithm.

    svm

    The Support Vector Machine (SVM) algorithm that is used for classification and regression.

    lasso

    The algorithm that is used for regression analysis in statistics and machine learning.

    elastic_net

    The elastic net regression algorithm that combines Lasso (L1) and Ridge (L2) regularization.

    ridge

    The ridge regression algorithm that uses L2 regularization to prevent overfitting.

    kmeans

    The clustering algorithm.

    dbscan

    The density-based clustering algorithm that can identify clusters of various shapes.

    knn

    The k-nearest neighbors (KNN) algorithm that is used for regression.

    random_forest

    The ensemble learning algorithm that is used for classification and regression.

    least_angle

    The least angle regression (LARS) algorithm that is compatible with the lasso algorithm.

    lasso_least_angle

    The lasso least angle regression algorithm that combines features of LARS and lasso.

    orthogonal_matching_pursuit

    The greedy algorithm that is used for sparse signal recovery.

    bayesian_ridge

    The Bayesian ridge regression algorithm that estimates regression parameters by using the prior distribution.

    automatic_relevance_determination

    The Bayesian regression algorithm.

    stochastic_gradient_descent

    The simple and efficient optimization algorithm that is used for regression and classification.

    perceptron

    The linear model algorithm that is used for binary classification.

    passive_aggressive

    The online learning algorithm that is suitable for large-scale learning.

    ransac

    The random sample consensus algorithm that is used for robust linear regression.

    theil_sen

    The Theil-Sen estimator that is used for robust linear regression.

    huber

    The Huber regression that is robust to outliers.

    quantile

    The quantile regression algorithm that predicts conditional quantiles such as the median value.

    kernel_ridge

    The kernel ridge regression (KRR) algorithm that uses kernel tricks to learn nonlinear relationships.

    gaussian_process

    The Gaussian process algorithm that is used for probabilistic regression and classification.

    nu_svm

    The algorithm that is used for classification and regression. This algorithm is a variant of the SVM algorithm.

    ada_boost

    The boosting algorithm that is used to resolve classification issues.

    bagging

    The ensemble algorithm that is used to reduce overfitting.

    extra_trees

    The extra trees algorithm that is an extension of the random forest algorithm.

    gradient_boosting_trees

    The boosting algorithm that is used to resolve regression and classification issues.

    hist_gradient_boosting

    The gradient boosting algorithm that is suitable for large datasets.

    linear_svm

    The linear SVM that is suitable for resolving classification issues.

    lightgbm

    The gradient boosting framework that is developed by Microsoft for large-scale data.

    affinity_propagation

    The message-passing clustering algorithm that is based on the similarity between data points.

    birch

    The clustering algorithm that is suitable for large datasets.

    feature_agglomeration

    The clustering algorithm that is used to select features.

    mini_batch_kmeans

    The algorithm that is suitable for large or online datasets. This algorithm is a variant of k-means clustering.

    mean_shift

    The centroid-based clustering algorithm that can identify any shape of clusters.

    optics

    The algorithm that can identify clusters of various densities.

    spectral

    The spectral clustering algorithm that uses the spectral properties of data.

    spectral_bi

    The spectral bi-clustering algorithm that simultaneously clusters data of two dimensions, such as rows and columns.

    spectral_co

    The spectral co-clustering algorithm that simultaneously clusters data of two dimensions.

    catboost

    The gradient boosting algorithm that is used for resolving classification and regression issues.

    pca

    The algorithm that is used to reduce the dimensionality of data.

  • The following table describes the sampling methods supported by the pgml.sampling custom enumeration type.

    Sampling method

    Description

    random

    Performs random sampling on test data.

    last

    Performs a sequential scan to retrieve the tail end of test data.

    stratified

    Performs stratified sampling on test data.

  • The following table describes the runtime environments supported by the pgml.runtime custom enumeration type.

    Runtime environment

    Description

    Python

    The Python runtime environment.

    Rust

    The Rust runtime environment.

Hyperparameter search

You can use hyperparameter search and cross-validation to find the optimal set of hyperparameters for model training. The training parameters related to hyperparameter search include search, search_params, and search_args.

  • The following table describes the search methods supported by the pgml.search custom enumeration type.

    Search method

    Description

    grid

    Uses the Cartesian product to train every possible combination of the search hyperparameters specified by the search_params parameter.

    random

    Randomly selects parameter combinations from the search hyperparameters specified by the search_params parameter for up to n_iter sampling iterations based on the search_args parameter.

  • The search_params parameter specifies the hyperparameters and corresponding ranges of hyperparameter values that you want to use in the hyperparameter search to obtain the optimal set of hyperparameters. For more information about search hyperparameters, see the official documentation of the model algorithm that you use, such as scikit-learn and XGBoost Documentation.

  • The following table describes the supplementary parameters supported by the search_args parameter.

    Parameter in the JSONB format

    Description

    '{ "n_iter": K }'

    The number of iterations that you want the model to perform for hyperparameter search. This parameter is applicable when you set the pgml.search custom enumeration type to random.

    '{ "cv": K }'

    The number of folds into which you want to split the dataset for k-fold cross-validation.

    Example

    Perform grid search to train a model by evaluating combinations of the search hyperparameters specified by the search_params parameter. The total number of combinations is calculated by using the following formula: len(max_depth) × len(n_estimators) × k-fold = 3 × 3 × 3 = 27.

    Note
    • For regression tasks, the R2 score is used to select the optimal set of hyperparameters by default.

    • For classification tasks, the F1 score is used to select the optimal set of hyperparameters by default.

    SELECT * FROM pgml.train(
        'Handwritten Digit Image Classifier', 
        'classification',
        'pgml.digits',
        'target',
        algorithm => 'xgboost', 
        SEARCH => 'grid', 
        search_params => '{
            "max_depth": [1, 3, 5], 
            "n_estimators": [20, 80, 160]
        }',
        search_args => '{ "cv": 3 }'
    );

Preprocessing

You can specify the preprocess parameters in a training function to preprocess data. The preprocessor can configure all columns of the training data in a centralized manner. The pgml extension supports the following data preprocessing types.

Note

After training, the information about the preprocessor is saved in the pgml.snapshots metadata table. When you subsequently invoke the pgml.predict() function, the same preprocessing steps that you defined during training are automatically applied to the new prediction data.

Enumeration type encoding

Converts enumeration types into numeric types.

Encoding method

Description

native

Converts a non-text type into a 32-bit floating-point type. This method is the default method.

target

Converts categorical variables into numeric values by replacing each category with the average value of the target variable for that category.

one_hot

Converts enumeration types into the one-hot encoding format.

ordinal

Converts categorical data into numeric values by assigning each category a unique integer based on the index specified in a predefined ordinal array. By default, NULL values are assigned an index of 0.

Missing value imputation

Replaces NULL or NaN values with numeric values.

Imputation method

Description

error

Returns an error when NULL or NaN values are detected during training or inference. This method is the default method.

mean

Replaces NULL or NaN values with the average value in the training data.

median

Replaces NULL or NaN values with the median value in the training data.

mode

Replaces NULL or NaN values with the most frequently occurring value in the training data.

min

Replaces NULL or NaN values with the minimum value in the training data.

max

Replaces NULL or NaN values with the maximum value in the training data.

zero

Replaces NULL or NaN values with 0.

Scaling

Scales all variables to a standard range to help improve the generalization of the model.

Scaling method

Description

preserve

Does not perform scaling. This method is the default method.

standard

Standardizes features by removing the average value and scaling features to unit variance.

min-max

Scales features to the range between the specified minimum and maximum values. In most cases, the range is between 0 and 1.

max-abs

Scales data by dividing each feature by the maximum absolute value, which ensures that the maximum absolute value of each feature in the training set is 1.

robust

Scales data based on the range defined by the first and third quartiles of statistics. Compared with other methods, this method is more robust in handling outliers.

Deployment

During deployment, the training project information, preprocessing methods, and model parameters are stored in metadata tables. In subsequent inference, you need to only specify the project number or name to perform inference, which simplifies the inference process.

The following table describes the deployment policies supported by the pgml.strategy custom enumeration type.

Deployment policy

Description

best_score

Deploys the model that achieves the best metrics.

most_recent

Deploys the most recently trained model and ignores the information about metrics.

rollback

Rolls back to the previous deployment of the current model.

The automatic_deploy parameter is used to perform deployment during training. The default value of this parameter is true. This parameter automatically selects the optimal model for deployment during multiple trainings. For classification tasks, the model that has the highest F1 score is selected by default. For regression tasks, the model that has the highest R2 score is selected by default. You can also use the following UDFs to manually deploy a trained model.

  • Forcibly deploy a model specified by the model_id parameter.

    CREATE FUNCTION pgml."deploy"(
            "model_id" bigint
    ) RETURNS TABLE (
            "project" TEXT,
            "strategy" TEXT,
            "algorithm" TEXT
    )
    STRICT EXECUTE ON COORDINATOR
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'deploy_model_wrapper';
  • Deploy a model that matches the specified algorithm and deployment policy.

    CREATE FUNCTION pgml."deploy"(
            "project_name" TEXT,
            "strategy" pgml.Strategy,
            "algorithm" pgml.Algorithm DEFAULT NULL
    ) RETURNS TABLE (
            "project" TEXT,
            "strategy" TEXT,
            "algorithm" TEXT
    )
    EXECUTE ON COORDINATOR
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'deploy_strategy_wrapper';

Inference

Perform inference on a piece of text

You can use the predict() function to perform inference on a piece of text based on the project name or model ID. The data type of the return value is FLOAT4. The return value is the predicted class or category to which the input data belongs for classification tasks and the numeric value predicted based on the input features for regression tasks. The following examples show how to perform inference on a piece of text in different scenarios.

Note

The model identified based on a project name is the most recently deployed model for that project.

  • Perform inference on the data rows of a table based on the project name.

    CREATE FUNCTION pgml."predict"(
            "project_name" TEXT,
            "row" anyelement
    ) RETURNS real
    IMMUTABLE STRICT PARALLEL SAFE
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_row_wrapper';
  • Perform inference on the data rows of a table based on the model ID.

    CREATE FUNCTION pgml."predict"(
            "model_id" bigint,
            "row" anyelement
    ) RETURNS real
    IMMUTABLE STRICT PARALLEL SAFE
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_model_row_wrapper';
  • Perform inference on FLOAT4 features based on the project name.

    CREATE FUNCTION pgml."predict"(
            "project_name" TEXT,
            "features" real[]
    ) RETURNS real
    IMMUTABLE STRICT PARALLEL SAFE
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_f32_wrapper';
  • Perform inference on FLOAT4 features based on the model ID.

    CREATE FUNCTION pgml."predict"(
            "model_id" bigint,
            "features" real[]
    ) RETURNS real
    IMMUTABLE STRICT PARALLEL SAFE
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_model_wrapper';
  • Perform inference on BIGINT features based on the project name.

    CREATE FUNCTION pgml."predict"(
            "project_name" TEXT,
            "features" bigint[]
    ) RETURNS real
    IMMUTABLE STRICT PARALLEL SAFE
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_i64_wrapper';
  • Perform inference on INT features based on the project name.

    CREATE FUNCTION pgml."predict"(
            "project_name" TEXT,
            "features" INT[]
    ) RETURNS real
    IMMUTABLE STRICT PARALLEL SAFE
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_i32_wrapper';
  • Perform inference on SMALLINT features based on the project name.

    CREATE FUNCTION pgml."predict"(
            "project_name" TEXT,
            "features" smallint[]
    ) RETURNS real
    IMMUTABLE STRICT PARALLEL SAFE
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_i16_wrapper';
  • Perform inference on FLOAT8 features based on the project name.

    CREATE FUNCTION pgml."predict"(
            "project_name" TEXT,
            "features" double precision[]
    ) RETURNS real
    IMMUTABLE STRICT PARALLEL SAFE
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_f64_wrapper';
  • Perform inference on BOOL features based on the project name.

    CREATE FUNCTION pgml."predict"(
            "project_name" TEXT,
            "features" bool[]
    ) RETURNS real /* f32 */
    IMMUTABLE STRICT PARALLEL SAFE
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_bool_wrapper';

Perform inference on multiple pieces of text at a time

You can perform inference on multiple sets of FLOAT4 features based on the project name or model ID by using the predict_batch(), predict_proba(), or predict_joint() function.

Note

The multiple sets of features must be flattened into a one-dimensional array. The return value is an array of FLOAT4 values.

predict_batch

In most cases, you can use the predict_batch() function to perform inference on multiple pieces of text at a time. The following examples show how to perform inference on multiple pieces of text at a time in different scenarios.

  • Perform inference on the data rows of a table based on the model ID.

    CREATE FUNCTION pgml."predict_batch"(
        "model_id" bigint, /* i64 */
        "features" real[] /* alloc::vec::Vec<f32> */
    ) RETURNS real[] /* alloc::vec::Vec<f32> */
    IMMUTABLE STRICT PARALLEL SAFE 
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_model_batch_wrapper';
  • Perform inference on the data rows of a table based on the project name.

    CREATE FUNCTION pgml."predict_batch"(
        "project_name" TEXT, /* &str */
        "features" real[] /* alloc::vec::Vec<f32> */
    ) RETURNS SETOF real /* f32 */
    IMMUTABLE STRICT PARALLEL SAFE 
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_batch_wrapper';

predict_proba

If you want confidence probabilities to be returned, we recommend that you use the predict_proba() function. The predict_proba() function can be used in a manner similar to the predict_batch function except that the predict_proba() function returns the confidence probability of each class.

  • Perform inference on the data rows of a table based on the model ID.

    CREATE FUNCTION pgml."predict_proba"(
        "model_id" bigint, /* i64 */
        "features" real[] /* alloc::vec::Vec<f32> */
    ) RETURNS real[] /* alloc::vec::Vec<f32> */
    IMMUTABLE STRICT PARALLEL SAFE 
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_model_proba_wrapper';
  • Perform inference on the data rows of a table based on the project name.

    CREATE FUNCTION pgml."predict_proba"(
        "project_name" TEXT, /* &str */
        "features" real[] /* alloc::vec::Vec<f32> */
    ) RETURNS real[] /* alloc::vec::Vec<f32> */
    IMMUTABLE STRICT PARALLEL SAFE 
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_proba_wrapper';

predict_joint

To perform inference on multi-label columns, use the predict_joint() function. This function does not support classification tasks. The predict_joint() function returns multi-dimensional prediction results for each set of features.

  • Perform inference on multiple sets of FLOAT4 features based on the model ID.

     CREATE FUNCTION pgml."predict_joint"(
        "model_id" bigint, /* i64 */
        "features" real[] /* alloc::vec::Vec<f32> */
    ) RETURNS real[] /* alloc::vec::Vec<f32> */
    IMMUTABLE STRICT PARALLEL SAFE 
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_model_joint_wrapper';
    
  • Perform inference on multiple sets of FLOAT4 features based on the project name.

    CREATE FUNCTION pgml."predict_joint"(
        "project_name" TEXT, /* &str */
        "features" real[] /* alloc::vec::Vec<f32> */
    ) RETURNS real[] /* alloc::vec::Vec<f32> */
    IMMUTABLE STRICT PARALLEL SAFE 
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_joint_wrapper';

Examples

The following examples show complete SQL syntax that is used to perform classification, regression, clustering, dimensionality reduction, and preprocessing.

Classification

Example

-- Exit on error (psql)
-- \set ON_ERROR_STOP true
\timing on

SELECT pgml.load_dataset('breast_cancer');

-- view the dataset
SELECT * FROM pgml.breast_cancer LIMIT 10;

-- train a simple model to classify the data
SELECT * FROM pgml.train('Breast Cancer Detection', 'classification', 'pgml.breast_cancer', 'malignant');

-- check out the predictions
SELECT malignant, pgml.predict(
    'Breast Cancer Detection', 
    (
        "mean radius", 
        "mean texture", 
        "mean perimeter", 
        "mean area",
        "mean smoothness",
        "mean compactness",
        "mean concavity",
        "mean concave points",
        "mean symmetry",
        "mean fractal dimension",
        "radius error",
        "texture error",
        "perimeter error",
        "area error",
        "smoothness error",
        "compactness error",
        "concavity error",
        "concave points error",
        "symmetry error",
        "fractal dimension error",
        "worst radius",
        "worst texture",
        "worst perimeter",
        "worst area",
        "worst smoothness",
        "worst compactness",
        "worst concavity",
        "worst concave points",
        "worst symmetry",
        "worst fractal dimension"
    )
) AS prediction
FROM pgml.breast_cancer
LIMIT 10;

-- view raw class probabilities
SELECT malignant, pgml.predict_proba(
    'Breast Cancer Detection',
    ARRAY[
        "mean radius",
        "mean texture",
        "mean perimeter",
        "mean area",
        "mean smoothness",
        "mean compactness",
        "mean concavity",
        "mean concave points",
        "mean symmetry",
        "mean fractal dimension",
        "radius error",
        "texture error",
        "perimeter error",
        "area error",
        "smoothness error",
        "compactness error",
        "concavity error",
        "concave points error",
        "symmetry error",
        "fractal dimension error",
        "worst radius",
        "worst texture",
        "worst perimeter",
        "worst area",
        "worst smoothness",
        "worst compactness",
        "worst concavity",
        "worst concave points",
        "worst symmetry",
        "worst fractal dimension"
    ]
) AS prediction
FROM pgml.breast_cancer
LIMIT 10;

--
-- After a project has been trained, omitted parameters will be reused from previous training runs
-- In these examples we'll reuse the training data snapshots from the initial call.
--

-- linear models
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'ridge');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'stochastic_gradient_descent');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'perceptron');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'passive_aggressive');

-- support vector machines
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'svm');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'nu_svm');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'linear_svm');

-- ensembles
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'ada_boost');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'bagging');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'extra_trees', hyperparams => '{"n_estimators": 10}');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'gradient_boosting_trees', hyperparams => '{"n_estimators": 10}');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'random_forest', hyperparams => '{"n_estimators": 10}');

-- other
-- Gaussian Process is too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'gaussian_process', hyperparams => '{"max_iter_predict": 100, "warm_start": true}');

-- Gradient Boosting
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'xgboost', hyperparams => '{"n_estimators": 10}');
-- SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'xgboost_random_forest', hyperparams => '{"n_estimators": 10}');
-- SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'lightgbm', hyperparams => '{"n_estimators": 1}');
-- Histogram Gradient Boosting is too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'hist_gradient_boosting', hyperparams => '{"max_iter": 2}');


-- check out all that hard work
SELECT trained_models.* FROM pgml.trained_models 
JOIN pgml.models on models.id = trained_models.id
ORDER BY models.metrics->>'f1' DESC LIMIT 5;

-- deploy the random_forest model for prediction use
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'most_recent', 'random_forest');
-- check out that throughput
SELECT * FROM pgml.deployed_models ORDER BY deployed_at DESC LIMIT 5;

-- do a hyperparam search on your favorite algorithm
SELECT pgml.train(
    'Breast Cancer Detection', 
    algorithm => 'xgboost', 
    search => 'grid', 
    search_params => '{
        "n_estimators": [2, 4],
        "max_depth": [1, 2, 3]
    }'
);

-- deploy the "best" model for prediction use
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'best_score');
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'most_recent');
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'rollback');
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'best_score', 'svm');

-- check out the improved predictions
SELECT malignant, pgml.predict(
    'Breast Cancer Detection', 
    (
        "mean radius", 
        "mean texture", 
        "mean perimeter", 
        "mean area",
        "mean smoothness",
        "mean compactness",
        "mean concavity",
        "mean concave points",
        "mean symmetry",
        "mean fractal dimension",
        "radius error",
        "texture error",
        "perimeter error",
        "area error",
        "smoothness error",
        "compactness error",
        "concavity error",
        "concave points error",
        "symmetry error",
        "fractal dimension error",
        "worst radius",
        "worst texture",
        "worst perimeter",
        "worst area",
        "worst smoothness",
        "worst compactness",
        "worst concavity",
        "worst concave points",
        "worst symmetry",
        "worst fractal dimension"
    )
) AS prediction
FROM pgml.breast_cancer 
LIMIT 10;

Regression

Example

SELECT pgml.load_dataset('diabetes');

-- view the dataset
SELECT * FROM pgml.diabetes LIMIT 10;

-- train a simple model on the data
SELECT * FROM pgml.train('Diabetes Progression', 'regression', 'pgml.diabetes', 'target');

-- check out the predictions
SELECT target, pgml.predict('Diabetes Progression', (age, sex, bmi, bp, s1, s2, s3, s4, s5, s6)) AS prediction
FROM pgml.diabetes 
LIMIT 10;

-- Check predictions against a specific model id
SELECT model_id, target, pgml.predict(model_id, (age, sex, bmi, bp, s1, s2, s3, s4, s5, s6)) AS prediction
FROM pgml.diabetes
CROSS JOIN LATERAL (
    SELECT pgml.models.id AS model_id FROM pgml.models
    INNER JOIN pgml.projects
    ON pgml.models.project_id = pgml.projects.id
    WHERE pgml.projects.name = 'Diabetes Progression'
    LIMIT 1
) models
LIMIT 10;

--
-- After a project has been trained, omitted parameters will be reused from previous training runs
-- In these examples we'll reuse the training data snapshots from the initial call.
--

-- linear models
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'ridge');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'lasso');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'elastic_net');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'least_angle');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'lasso_least_angle');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'orthogonal_matching_pursuit');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'bayesian_ridge');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'automatic_relevance_determination');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'stochastic_gradient_descent');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'passive_aggressive');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'ransac');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'theil_sen', hyperparams => '{"max_iter": 10, "max_subpopulation": 100}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'huber');
-- Quantile Regression too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'quantile');

-- support vector machines
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'svm', hyperparams => '{"max_iter": 100}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'nu_svm', hyperparams => '{"max_iter": 10}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'linear_svm', hyperparams => '{"max_iter": 100}');

-- ensembles
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'ada_boost', hyperparams => '{"n_estimators": 5}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'bagging', hyperparams => '{"n_estimators": 5}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'extra_trees', hyperparams => '{"n_estimators": 5}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'gradient_boosting_trees', hyperparams => '{"n_estimators": 5}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'random_forest', hyperparams => '{"n_estimators": 5}');

-- other
-- Kernel Ridge is too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'kernel_ridge');
-- Gaussian Process is too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'gaussian_process');

-- gradient boosting
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost', hyperparams => '{"n_estimators": 10}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'catboost', hyperparams => '{"n_estimators": 10}');
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost_random_forest', hyperparams => '{"n_estimators": 10}');
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'lightgbm', hyperparams => '{"n_estimators": 1}');
-- Histogram Gradient Boosting is too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'hist_gradient_boosting', hyperparams => '{"max_iter": 10}');

-- runtimes
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'linear', runtime => 'python');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'linear', runtime => 'rust');

--SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost', runtime => 'python', hyperparams => '{"n_estimators": 1}'); -- too slow
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost', runtime => 'rust', hyperparams => '{"n_estimators": 10}');

-- check out all that hard work
SELECT trained_models.* FROM pgml.trained_models 
JOIN pgml.models on models.id = trained_models.id
ORDER BY models.metrics->>'mean_squared_error' DESC LIMIT 5;

-- deploy the random_forest model for prediction use
SELECT * FROM pgml.deploy('Diabetes Progression', 'most_recent', 'random_forest');
-- check out that throughput
SELECT * FROM pgml.deployed_models ORDER BY deployed_at DESC LIMIT 5;

-- do a hyperparam search on your favorite algorithm
SELECT pgml.train(
    'Diabetes Progression', 
    algorithm => 'xgboost',
    hyperparams => '{"eval_metric": "rmse"}'::JSONB,
    search => 'grid', 
    search_params => '{
        "max_depth": [1, 2], 
        "n_estimators": [20, 40]
    }'
);

-- deploy the "best" model for prediction use
SELECT * FROM pgml.deploy('Diabetes Progression', 'best_score');
SELECT * FROM pgml.deploy('Diabetes Progression', 'most_recent');
SELECT * FROM pgml.deploy('Diabetes Progression', 'rollback');
SELECT * FROM pgml.deploy('Diabetes Progression', 'best_score', 'svm');

-- check out the improved predictions
SELECT target, pgml.predict('Diabetes Progression', (age, sex, bmi, bp, s1, s2, s3, s4, s5, s6)) AS prediction
FROM pgml.diabetes 
LIMIT 10;

Clustering

Example

-- This example trains models on the sklean digits dataset
-- which is a copy of the test set of the UCI ML hand-written digits datasets
-- https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
--
-- This demonstrates using a table with a single array feature column
-- for clustering. You could do something similar with a vector column
--

-- Exit on error (psql)
-- \set ON_ERROR_STOP true
\timing on

SELECT pgml.load_dataset('digits');

-- create an unlabeled table of the images for unsupervised learning
CREATE VIEW pgml.digit_vectors AS
SELECT image FROM pgml.digits;

-- view the dataset
SELECT left(image::text, 40) || ',...}' FROM pgml.digit_vectors LIMIT 10;

-- train a simple model to classify the data
SELECT * FROM pgml.train('Handwritten Digit Clusters', 'clustering', 'pgml.digit_vectors', hyperparams => '{"n_clusters": 10}');

-- check out the predictions
SELECT target, pgml.predict('Handwritten Digit Clusters', image) AS prediction
FROM pgml.digits
LIMIT 10;

SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'affinity_propagation');
SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'birch', hyperparams => '{"n_clusters": 10}');
SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'kmeans', hyperparams => '{"n_clusters": 10}');
SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'mini_batch_kmeans', hyperparams => '{"n_clusters": 10}');

Dimensionality reduction

Example

-- This example reduces the dimensionality of images in the sklean digits dataset
-- which is a copy of the test set of the UCI ML hand-written digits datasets
-- https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
--
-- This demonstrates using a table with a single array feature column
-- for decomposition to reduce dimensionality.
--
-- Exit on error (psql)
-- \set ON_ERROR_STOP true
\timing on

SELECT pgml.load_dataset('digits');

-- view the dataset
SELECT left(image::text, 40) || ',...}', target FROM pgml.digits LIMIT 10;

-- create a view of just the vectors for decomposition, without any labels
CREATE VIEW digit_vectors AS
SELECT image FROM pgml.digits;

SELECT * FROM pgml.train('Handwritten Digits Reduction', 'decomposition', 'digit_vectors');

-- check out the decomposed vectors
SELECT target, pgml.decompose('Handwritten Digits Reduction', image) AS pca
FROM pgml.digits
LIMIT 10;

--
-- After a project has been trained, omitted parameters will be reused from previous training runs
-- In these examples we'll reuse the training data snapshots from the initial call.
--

-- We can reduce the image vectors from 64 dimensions to 3 components
SELECT * FROM pgml.train('Handwritten Digits Reduction', hyperparams => '{"n_components": 3}');

-- check out the reduced vectors
SELECT target, pgml.decompose('Handwritten Digits Reduction', image) AS pca
FROM pgml.digits
LIMIT 10;

-- check out all that hard work
SELECT trained_models.* FROM pgml.trained_models
                                 JOIN pgml.models on models.id = trained_models.id
ORDER BY models.metrics->>'cumulative_explained_variance' DESC LIMIT 5;

-- deploy the PCA model for prediction use
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'most_recent', 'pca');
-- check out that throughput
SELECT * FROM pgml.deployed_models ORDER BY deployed_at DESC LIMIT 5;

-- deploy the "best" model for prediction use
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'best_score');
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'most_recent');
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'rollback');
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'best_score', 'pca');

-- check out the improved predictions
SELECT target, pgml.predict('Handwritten Digits Reduction', image) AS prediction
FROM pgml.digits
LIMIT 10;

Preprocessing

Example

-- load the diamonds dataset, that contains text categorical variables
SELECT pgml.load_dataset('jdxcosta/diamonds');

-- view the data
SELECT * FROM pgml."jdxcosta/diamonds" LIMIT 10;

-- drop the Unamed column, since it's not useful for training (you could create a view instead)
ALTER TABLE pgml."jdxcosta/diamonds" DROP COLUMN "Unnamed: 0";

-- train a model using preprocessors to scale the numeric variables, and target encode the categoricals
SELECT pgml.train(
       project_name => 'Diamond prices',
       task => 'regression',
       relation_name => 'pgml.jdxcosta/diamonds',
       y_column_name => 'price',
       algorithm => 'lightgbm',
       preprocess => '{
                      "carat": {"scale": "standard"},
                      "depth": {"scale": "standard"},
                      "table": {"scale": "standard"},
                      "cut": {"encode": "target", "scale": "standard"},
                      "color": {"encode": "target", "scale": "standard"},
                      "clarity": {"encode": "target", "scale": "standard"}
                  }'
);

-- run some predictions, notice we're passing a heterogeneous row (tuple) as input, rather than a homogenous ARRAY[].
SELECT price, pgml.predict('Diamond prices', (carat, cut, color, clarity, depth, "table", x, y, z)) AS prediction
FROM pgml."jdxcosta/diamonds"
LIMIT 10;

-- This is a difficult dataset for more algorithms, which makes it a good challenge for preprocessing, and additional
-- feature engineering. What's next?