All Products
Search
Document Center

AnalyticDB:In-database AI/ML

Last Updated:Dec 04, 2024

AnalyticDB for PostgreSQL V7.0 provides the in-database AI/ML feature. You can process data by using the algorithms and models provided by the feature to reduce data transfer costs. The in-database AI/ML framework is compatible with the open source PostgresML community and performs significant optimization in functionality, performance, and usability to implement training, fine-tuning, deployment, and inference of algorithm models based on GPUs and CPUs. The in-database AI/ML framework is implemented based on the pgml extension. The pgml extension is one of the components in the open source PostgresML community and integrates classic machine learning algorithms, such as XGBoost, LightGBM, and scikit-learn.

Limits

  • Only AnalyticDB for PostgreSQL V7.0 instances of V7.1.1.0 or later support the in-database AI/ML feature.

  • The AnalyticDB for PostgreSQL instances must be in elastic storage mode.

  • The pgml extension must be installed.

    Note

    The pgml extension does not support GUI-based installation. To install the pgml extension, submit a ticket. To uninstall the pgml extension, submit a ticket.

Introduction to metadata

The in-database AI/ML framework in AnalyticDB for PostgreSQL V7.0 is implemented based on the pgml extension. After you install the pgml extension for an AnalyticDB for PostgreSQL instance, the system automatically creates a schema named pgml. The following table describes the metadata tables contained in the pgml schema.

Metadata table

Description

projects

Records information about projects in training tasks.

models

Records information about model training.

files

Records information about stored model files.

snapshots

Records information about the snapshots of training datasets.

logs

Records information about training logs.

deployments

Records information about model deployment.

When a model training task is initiated, training information is automatically written to the preceding metadata tables.

Note

For information about the custom types of the pgml extension in the metadata tables, such as task, runtime, and sampling, see Use machine learning.

projects

The projects table records the project ID, project name, task type, creation time, and update time of training tasks. Table schema and indexes:

                                         Table "pgml.projects"
   Column   |            Type             | Collation | Nullable |               Default                
------------+-----------------------------+-----------+----------+--------------------------------------
 id         | bigint                      |           | not null | nextval('projects_id_seq'::regclass)
 name       | text                        |           | not null | 
 task       | task                        |           | not null | 
 created_at | timestamp without time zone |           | not null | clock_timestamp()
 updated_at | timestamp without time zone |           | not null | clock_timestamp()
Indexes:
    "projects_pkey" PRIMARY KEY, btree (id)
    "projects_name_idx" btree (name)
Triggers:
    projects_auto_updated_at BEFORE UPDATE ON projects FOR EACH ROW EXECUTE FUNCTION set_updated_at()
    trigger_before_insert_pgml_projects BEFORE INSERT ON projects FOR EACH ROW EXECUTE FUNCTION trigger_check_pgml_projects()
Distributed Replicated

models

The models table records the parameters specified for model training and other associated information, such as project ID and snapshot ID. Table schema and indexes:

                                           Table "pgml.models"
    Column     |            Type             | Collation | Nullable |              Default               
---------------+-----------------------------+-----------+----------+------------------------------------
 id            | bigint                      |           | not null | nextval('models_id_seq'::regclass)
 project_id    | bigint                      |           | not null | 
 snapshot_id   | bigint                      |           |          | 
 num_features  | integer                     |           | not null | 
 algorithm     | text                        |           | not null | 
 runtime       | runtime                     |           |          | 'python'::runtime
 hyperparams   | jsonb                       |           | not null | 
 status        | text                        |           | not null | 
 metrics       | jsonb                       |           |          | 
 search        | text                        |           |          | 
 search_params | jsonb                       |           | not null | 
 search_args   | jsonb                       |           | not null | 
 created_at    | timestamp without time zone |           | not null | clock_timestamp()
 updated_at    | timestamp without time zone |           | not null | clock_timestamp()
Indexes:
    "models_pkey" PRIMARY KEY, btree (id)
    "models_project_id_idx" btree (project_id)
    "models_snapshot_id_idx" btree (snapshot_id)
Triggers:
    models_auto_updated_at BEFORE UPDATE ON models FOR EACH ROW EXECUTE FUNCTION set_updated_at()
    trigger_before_insert_pgml_models BEFORE INSERT ON models FOR EACH ROW EXECUTE FUNCTION trigger_check_pgml_models_fk()
Distributed Replicated

files

The files table records post-training model files in the data column in the binary format. Each file is divided into multiple splits of 100 MB each. Table schema and indexes:

                                         Table "pgml.files"
   Column   |            Type             | Collation | Nullable |              Default              
------------+-----------------------------+-----------+----------+-----------------------------------
 id         | bigint                      |           | not null | nextval('files_id_seq'::regclass)
 model_id   | bigint                      |           | not null | 
 path       | text                        |           | not null | 
 part       | integer                     |           | not null | 
 created_at | timestamp without time zone |           | not null | clock_timestamp()
 updated_at | timestamp without time zone |           | not null | clock_timestamp()
 data       | bytea                       |           | not null | 
Indexes:
    "files_pkey" PRIMARY KEY, btree (id)
    "files_model_id_path_part_idx" btree (model_id, path, part)
Triggers:
    files_auto_updated_at BEFORE UPDATE ON files FOR EACH ROW EXECUTE FUNCTION set_updated_at()
    trigger_before_insert_pgml_files BEFORE INSERT ON files FOR EACH ROW EXECUTE FUNCTION trigger_check_pgml_files()
Distributed Replicated

snapshots

The snapshots table records information about the snapshots of training datasets, such as the table name and test set division information. Table schema and indexes:

                                           Table "pgml.snapshots"
    Column     |            Type             | Collation | Nullable |                Default                
---------------+-----------------------------+-----------+----------+---------------------------------------
 id            | bigint                      |           | not null | nextval('snapshots_id_seq'::regclass)
 relation_name | text                        |           | not null | 
 y_column_name | text[]                      |           |          | 
 test_size     | real                        |           | not null | 
 test_sampling | sampling                    |           | not null | 
 status        | text                        |           | not null | 
 columns       | jsonb                       |           |          | 
 analysis      | jsonb                       |           |          | 
 created_at    | timestamp without time zone |           | not null | clock_timestamp()
 updated_at    | timestamp without time zone |           | not null | clock_timestamp()
 materialized  | boolean                     |           |          | false
Indexes:
    "snapshots_pkey" PRIMARY KEY, btree (id)
Triggers:
    snapshots_auto_updated_at BEFORE UPDATE ON snapshots FOR EACH ROW EXECUTE FUNCTION set_updated_at()
Distributed Replicated

logs

The logs table records information about training logs. A training task may contain multiple training logs. You can view the logs based on the values of the created_at column in ascending order. Table schema and indexes:

                                         Table "pgml.logs"
   Column   |            Type             | Collation | Nullable |             Default              
------------+-----------------------------+-----------+----------+----------------------------------
 id         | integer                     |           | not null | nextval('logs_id_seq'::regclass)
 model_id   | bigint                      |           |          | 
 project_id | bigint                      |           |          | 
 created_at | timestamp without time zone |           |          | CURRENT_TIMESTAMP
 logs       | jsonb                       |           |          | 
Indexes:
    "logs_pkey" PRIMARY KEY, btree (id)
Distributed Replicated

deployments

The deployments table records information about model deployment, such as the project ID, deployment ID, and model ID. Table schema and indexes:

                                         Table "pgml.deployments"
   Column   |            Type             | Collation | Nullable |                 Default                 
------------+-----------------------------+-----------+----------+-----------------------------------------
 id         | bigint                      |           | not null | nextval('deployments_id_seq'::regclass)
 project_id | bigint                      |           | not null | 
 model_id   | bigint                      |           | not null | 
 strategy   | strategy                    |           | not null | 
 created_at | timestamp without time zone |           | not null | clock_timestamp()
Indexes:
    "deployments_pkey" PRIMARY KEY, btree (id)
    "deployments_model_id_created_at_idx" btree (model_id)
    "deployments_project_id_created_at_idx" btree (project_id)
Triggers:
    deployments_auto_updated_at BEFORE UPDATE ON deployments FOR EACH ROW EXECUTE FUNCTION set_updated_at()
    trigger_before_insert_pgml_deployments BEFORE INSERT ON deployments FOR EACH ROW EXECUTE FUNCTION trigger_check_pgml_deployments_fk()
Distributed Replicated