All Products
Search
Document Center

AnalyticDB:Create a document collection

Last Updated:Jul 18, 2025

This topic describes how to create a document collection to store chunks and vector data within AnalyticDB for PostgreSQL.

Sample code

def create_document_collection(account,
                               account_password,
                               namespace,
                               collection,
                               metadata: str = None,
                               full_text_retrieval_fields: str = None,
                               parser: str = None,
                               embedding_model: str = None,
                               metrics: str = None,
                               hnsw_m: int = None,
                               pq_enable: int = None,
                               external_storage: int = None,):
    request = gpdb_20160503_models.CreateDocumentCollectionRequest(
        region_id=ADBPG_INSTANCE_REGION,
        dbinstance_id=ADBPG_INSTANCE_ID,
        manager_account=account,
        manager_account_password=account_password,
        namespace=namespace,
        collection=collection,
        metadata=metadata,
        full_text_retrieval_fields=full_text_retrieval_fields,
        parser=parser,
        embedding_model=embedding_model,
        metrics=metrics,
        hnsw_m=hnsw_m,
        pq_enable=pq_enable,
        external_storage=external_storage
    )
    response = get_client().create_document_collection(request)
    print(f"create_document_collection response code: {response.status_code}, body:{response.body}")


if __name__ == '__main__':
    metadata = '{"title":"text", "page":"int"}'
    full_text_retrieval_fields = "title"
    embedding_model = "m3e-small"
    create_document_collection("testacc", "Test1234", "ns1", "dc1", 
                               metadata=metadata, full_text_retrieval_fields=full_text_retrieval_fields, 
                               embedding_model=embedding_model)


# output: body:
# {
#    "Message":"success",
#    "RequestId":"7BC35B66-5F49-1E79-A153-8D26576C4A3E",
#    "Status":"success"
# }

You can invoke the create_document_collection() function to create a document collection in an AnalyticDB for PostgreSQL instance. The parameters for create_document_collection are described below:

  • account: the privileged account of the AnalyticDB for PostgreSQL instance.

  • account_password: the password of the privileged account.

  • namespace: the name of the namespace in which you want to create the document collection.

  • collection: the name of the document collection that you want to create.

  • metadata: the metadata of the custom map structure which contains multiple fields represented by key-value pairs. The key specifies the field name and the value specifies the field type.

  • full_text_retrieval_fields: the custom full-text search fields separated by commas (,). The fields must be the keys of metadata.

  • parser: the tokenizer. This parameter is a full-text search parameter. Default value: zh_cn. For more information, see Full-text search.

  • embedding_model: the embedding model. For more information, see Embedding models.

  • metrics: the distance or similarity metric algorithm. This parameter is a vector index parameter. For more information, see Vector indexes.

  • hnsw_m: the maximum number of neighbors for the Hierarchical Navigable Small World (HNSW) algorithm. This parameter is a vector index parameter. Valid values: 1 to 1000. For more information, see Vector indexes.

  • pq_enable: specifies whether to enable the product quantization (PQ) feature for index acceleration and dimensionality reduction. This parameter is a vector index parameter. For more information, see Vector indexes.

  • external_storage: specifies whether to use the memory mapping technology to create HNSW indexes. This parameter is a vector index parameter. For more information, see Vector indexes.

    Important

    Only AnalyticDB for PostgreSQL V6.0 supports the external_storage parameter.

View database changes

After the code invocation is successful, you can log on to the DMS console to view the schema of the created table named ns1.dc1.

Field

Type

Source

Description

id

text

Fixed field

The UUID of the chunk. This field is the primary key.

vector

real[]

Fixed field

The array of vector data. The length of vector data is the same as the number of vector dimensions of the specified embedding model.

doc_name

text

Fixed field

The name of the document.

content

text

Fixed field

The content of the chunk. Chunks are obtained after the document is loaded and split.

loader_metadata

json

Fixed field

The metadata of the document that is parsed by the document loader.

to_tsvector

TSVECTOR

Fixed field

The full-text search fields, which are the fields specified by the full_text_retrieval_fields parameter. The content field is a default field. In this example, full-text search is performed on the content and title fields.

title

text

Metadata definition

Specify a custom value.

page

int

Metadata definition

Specify a custom value.

Full-text search

To improve search accuracy, AnalyticDB for PostgreSQL supports full-text search in addition to vector similarity search. Full-text search can be used together with vector similarity search to implement two-way retrieval.

Define full-text search fields

Before you use full-text search, you must specify which metadata fields are used as full-text search fields. The content field is used by the document collection operation by default. You can also specify other custom metadata fields.

Tokenization

When you create a document collection, you can specify the Parser field as the tokenizer. In most cases, you can use the default value of zh_ch. If you have special requirements for word segmentation characters, contact Alibaba Cloud technical support.

When you insert data, the tokenizer segments the data in the specified fields for full-text search based on the delimiter and saves the data to the to_tsvector field for subsequent full-text search.

Embedding models

AnalyticDB for PostgreSQL supports the following embedding models:

embedding_model

Number of dimensions

Description

m3e-small

512

Sourced from moka-ai/m3e-small. This model supports only Chinese but not English.

m3e-base

768

Sourced from moka-ai/m3e-base. This model supports Chinese and English.

text2vec

1024

Sourced from GanymedeNil/text2vec-large-chinese. This model supports Chinese and English.

text-embedding-v1

1536

Sourced from the text embedding model in Alibaba Cloud Model Studio. This model supports Chinese and English.

text-embedding-v2

1536

An upgraded version of text-embedding-v1.

clip-vit-b-32 (multimodal)

512

An open source multimodal model that supports images.

Note

Vector indexes

The following table describes the parameters for vector indexes:

Parameter

Description

metrics

The distance or similarity metric algorithm. Valid values:

  • l2: uses the squared Euclidean distance function to create indexes. This metric algorithm is suitable for image similarity search scenarios.

  • ip: uses the negative inner product distance function to create indexes. This metric algorithm is used to replace cosine similarity after vector normalization.

  • cosine: uses the cosine distance function to create indexes. This metric algorithm is suitable for text similarity search scenarios.

hnsw_m

The maximum number of neighbors for the HNSW algorithm. The AnalyticDB for PostgreSQL API automatically sets different values based on the vector dimension.

pq_enable

Specifies whether to enable the PQ feature for index acceleration and dimensionality reduction. Valid values:

  • 0: no.

  • 1: yes.

The PQ feature uses existing vector sample data for training. If the number of rows that are occupied by data is less than 500,000, we recommend that you do not specify this parameter.

external_storage

Specifies whether to use the memory mapping technology to create HNSW indexes. Valid values:

  • 0: uses segmented paging storage to create indexes. This method uses the shared buffer of PostgreSQL for caching and supports the delete and update operations.

  • 1: uses the memory mapping technology to create indexes. This method does not support the delete or update operation.

Important

Only AnalyticDB for PostgreSQL V6.0 supports the external_storage parameter.