AnalyticDB PostgreSQL 7.0版支援In-Database AI/ML功能,可在資料庫內直接進行資料處理與模型計算,顯著降低資料流轉成本。該功能基於相容PostgresML開源社區介面的pgml外掛程式實現,並在效能、功能和易用性方面進行了深度最佳化,支援GPU/CPU加速下的模型訓練、Fine-Tune、部署與推理。內建整合XGBoost、LightGBM、SciKit-Learn等主流機器學習演算法,助力企業高效構建智能化分析應用。
前提條件
核心版本為V7.1.1.0及以上的AnalyticDB PostgreSQL 7.0版執行個體。
執行個體資源類型為儲存彈性模式。
已經安裝pgml外掛程式。
說明pgml暫不支援白屏化安裝,如有需要請提交工單聯絡工作人員協助安裝。如有卸載外掛程式需求,也請提交工單聯絡工作人員協助卸載。
暫不支援在AnalyticDB for PostgreSQL7.0經濟版安裝和使用pgml外掛程式。
中繼資料簡介
AnalyticDB PostgreSQL 7.0版中In-Database AI/ML架構是基於pgml外掛程式實現的。當在合格版本中安裝完pgml外掛程式後,系統會自動建立名為pgml的Schema。在該Schema下有以下中繼資料表。
中繼資料表名稱 | 描述 |
projects | 訓練任務中對應的專案資訊。 |
models | 訓練後的模型資訊。 |
files | 模型檔案的儲存資訊。 |
snapshots | 訓練時資料集的快照。 |
logs | 訓練過程中輸出的日誌資訊。 |
deployments | 訓練後模型的部署資訊。 |
當發起訓練時,訓練資訊會被自動寫入以上中繼資料表。
中繼資料表中pgml的自訂類型(如task、runtime和sampling等)的介紹請參見機器學習。
projects
projects表記錄訓練任務的專案ID、專案名稱、任務類型、建立時間和更新時間。表結構和索引等資訊如下。
Table "pgml.projects"
Column | Type | Collation | Nullable | Default
------------+-----------------------------+-----------+----------+--------------------------------------
id | bigint | | not null | nextval('projects_id_seq'::regclass)
name | text | | not null |
task | task | | not null |
created_at | timestamp without time zone | | not null | clock_timestamp()
updated_at | timestamp without time zone | | not null | clock_timestamp()
Indexes:
"projects_pkey" PRIMARY KEY, btree (id)
"projects_name_idx" btree (name)
Triggers:
projects_auto_updated_at BEFORE UPDATE ON projects FOR EACH ROW EXECUTE FUNCTION set_updated_at()
trigger_before_insert_pgml_projects BEFORE INSERT ON projects FOR EACH ROW EXECUTE FUNCTION trigger_check_pgml_projects()
Distributed Replicatedmodels
models表記錄模型訓練時指定的參數和關聯的專案ID和快照ID等資訊。表結構和索引等資訊如下。
Table "pgml.models"
Column | Type | Collation | Nullable | Default
---------------+-----------------------------+-----------+----------+------------------------------------
id | bigint | | not null | nextval('models_id_seq'::regclass)
project_id | bigint | | not null |
snapshot_id | bigint | | |
num_features | integer | | not null |
algorithm | text | | not null |
runtime | runtime | | | 'python'::runtime
hyperparams | jsonb | | not null |
status | text | | not null |
metrics | jsonb | | |
search | text | | |
search_params | jsonb | | not null |
search_args | jsonb | | not null |
created_at | timestamp without time zone | | not null | clock_timestamp()
updated_at | timestamp without time zone | | not null | clock_timestamp()
Indexes:
"models_pkey" PRIMARY KEY, btree (id)
"models_project_id_idx" btree (project_id)
"models_snapshot_id_idx" btree (snapshot_id)
Triggers:
models_auto_updated_at BEFORE UPDATE ON models FOR EACH ROW EXECUTE FUNCTION set_updated_at()
trigger_before_insert_pgml_models BEFORE INSERT ON models FOR EACH ROW EXECUTE FUNCTION trigger_check_pgml_models_fk()
Distributed Replicatedfiles
在訓練結束後,模型目錄下的每個檔案以二進位形式被儲存到files表的data列裡,檔案二進位流會按照每100MB切片儲存。表結構和索引等資訊如下。
Table "pgml.files"
Column | Type | Collation | Nullable | Default
------------+-----------------------------+-----------+----------+-----------------------------------
id | bigint | | not null | nextval('files_id_seq'::regclass)
model_id | bigint | | not null |
path | text | | not null |
part | integer | | not null |
created_at | timestamp without time zone | | not null | clock_timestamp()
updated_at | timestamp without time zone | | not null | clock_timestamp()
data | bytea | | not null |
Indexes:
"files_pkey" PRIMARY KEY, btree (id)
"files_model_id_path_part_idx" btree (model_id, path, part)
Triggers:
files_auto_updated_at BEFORE UPDATE ON files FOR EACH ROW EXECUTE FUNCTION set_updated_at()
trigger_before_insert_pgml_files BEFORE INSERT ON files FOR EACH ROW EXECUTE FUNCTION trigger_check_pgml_files()
Distributed Replicatedsnapshots
snapshots表記錄訓練時資料集的快照資訊:資料表名稱、測試集劃分資訊等。表結構和索引等資訊如下。
Table "pgml.snapshots"
Column | Type | Collation | Nullable | Default
---------------+-----------------------------+-----------+----------+---------------------------------------
id | bigint | | not null | nextval('snapshots_id_seq'::regclass)
relation_name | text | | not null |
y_column_name | text[] | | |
test_size | real | | not null |
test_sampling | sampling | | not null |
status | text | | not null |
columns | jsonb | | |
analysis | jsonb | | |
created_at | timestamp without time zone | | not null | clock_timestamp()
updated_at | timestamp without time zone | | not null | clock_timestamp()
materialized | boolean | | | false
Indexes:
"snapshots_pkey" PRIMARY KEY, btree (id)
Triggers:
snapshots_auto_updated_at BEFORE UPDATE ON snapshots FOR EACH ROW EXECUTE FUNCTION set_updated_at()
Distributed Replicatedlogs
Logs表記錄輸出訓練過程中的資訊。對於一個訓練任務可能會存在多條訓練資訊,可以對created_at列升序查看。表結構和索引等資訊如下。
Table "pgml.logs"
Column | Type | Collation | Nullable | Default
------------+-----------------------------+-----------+----------+----------------------------------
id | integer | | not null | nextval('logs_id_seq'::regclass)
model_id | bigint | | |
project_id | bigint | | |
created_at | timestamp without time zone | | | CURRENT_TIMESTAMP
logs | jsonb | | |
Indexes:
"logs_pkey" PRIMARY KEY, btree (id)
Distributed Replicateddeployments
當模型需要部署時,系統會建立一條部署資訊,關聯專案ID、部署ID和模型ID,deployments表記錄部署的策略。表結構和索引等資訊如下。
Table "pgml.deployments"
Column | Type | Collation | Nullable | Default
------------+-----------------------------+-----------+----------+-----------------------------------------
id | bigint | | not null | nextval('deployments_id_seq'::regclass)
project_id | bigint | | not null |
model_id | bigint | | not null |
strategy | strategy | | not null |
created_at | timestamp without time zone | | not null | clock_timestamp()
Indexes:
"deployments_pkey" PRIMARY KEY, btree (id)
"deployments_model_id_created_at_idx" btree (model_id)
"deployments_project_id_created_at_idx" btree (project_id)
Triggers:
deployments_auto_updated_at BEFORE UPDATE ON deployments FOR EACH ROW EXECUTE FUNCTION set_updated_at()
trigger_before_insert_pgml_deployments BEFORE INSERT ON deployments FOR EACH ROW EXECUTE FUNCTION trigger_check_pgml_deployments_fk()
Distributed Replicated