Create a document collection - AnalyticDB - Alibaba Cloud Documentation Center

This topic describes how to create a document collection, which is used to store chunks and vector data.

Sample code

def create_document_collection(account,
                               account_password,
                               namespace,
                               collection,
                               metadata: str = None,
                               full_text_retrieval_fields: str = None,
                               parser: str = None,
                               embedding_model: str = None,
                               metrics: str = None,
                               hnsw_m: int = None,
                               pq_enable: int = None,
                               external_storage: int = None,):
    request = gpdb_20160503_models.CreateDocumentCollectionRequest(
        region_id=ADBPG_INSTANCE_REGION,
        dbinstance_id=ADBPG_INSTANCE_ID,
        manager_account=account,
        manager_account_password=account_password,
        namespace=namespace,
        collection=collection,
        metadata=metadata,
        full_text_retrieval_fields=full_text_retrieval_fields,
        parser=parser,
        embedding_model=embedding_model,
        metrics=metrics,
        hnsw_m=hnsw_m,
        pq_enable=pq_enable,
        external_storage=external_storage
    )
    response = get_client().create_document_collection(request)
    print(f"create_document_collection response code: {response.status_code}, body:{response.body}")


if __name__ == '__main__':
    metadata = '{"title":"text", "page":"int"}'
    full_text_retrieval_fields = "title"
    embedding_model = "m3e-small"
    create_document_collection("testacc", "Test1234", "ns1", "dc1", 
                               metadata=metadata, full_text_retrieval_fields=full_text_retrieval_fields, 
                               embedding_model=embedding_model)


# output: body:
# {
#    "Message":"success",
#    "RequestId":"7BC35B66-5F49-1E79-A153-8D26576C4A3E",
#    "Status":"success"
# }

You can invoke the create_document_collection() function to create a document collection in an AnalyticDB for PostgreSQL instance. Parameters:

account: the privileged account of the AnalyticDB for PostgreSQL instance.
account_password: the password of the privileged account.
namespace: the name of the namespace in which you want to create the document collection.
collection: the name of the document collection that you want to create.
metadata: the metadata of the custom map structure. key specifies the field name and value specifies the field type.
full_text_retrieval_fields: the custom full-text search fields separated by commas (,). The fields must be the keys of metadata.
parser: the analyzer. This parameter is a full-text search parameter. Default value: zh_cn. For more information, see the "Introduction to full-text search" section of this topic.
embedding_model: the embedding model. For more information, see the "Introduction to embedding models" section of this topic.
metrics: the distance or similarity metric algorithm. This parameter is a vector index parameter. For more information, see the "Vector indexes" section of this topic.
hnsw_m: the maximum number of neighbors for the Hierarchical Navigable Small World (HNSW) algorithm. This parameter is a vector index parameter. Valid values: 1 to 1000. For more information, see the "Vector indexes" section of this topic.
pq_enable: specifies whether to enable the product quantization (PQ) feature for index acceleration and dimensionality reduction. This parameter is a vector index parameter. For more information, see the "Vector indexes" section of this topic.
external_storage: specifies whether to use the memory mapping technology to create HNSW indexes. This parameter is a vector index parameter. For more information, see the "Vector indexes" section of this topic.

View database changes

After you run the preceding code, you can log on to the Data Management (DMS) console to view the schema of the created table named ns1.dc1.

Field	Type	Source	Description
id	TEXT	Fixed field	The UUID of the chunk. This field is the primary key.
vector	REAL[]	Fixed field	The array of vector data. The length of vector data is the same as the number of vector dimensions of the specified embedding model.
doc_name	TEXT	Fixed field	The name of the document.
content	TEXT	Fixed field	The content of the chunk. Chunks are obtained after the document is loaded and split.
loader_metadata	JSON	Fixed field	The metadata of the document that is parsed by the document loader.
to_tsvector	TSVECTOR	Fixed field	The full-text search fields, which are the fields specified by the full_text_retrieval_fields parameter. The content field is a default field. In this example, full-text search is performed on the content and title fields.
title	TEXT	Metadata definition	A custom field name.
page	INT	Metadata definition	A custom field name.

Introduction to full-text search

To improve search accuracy, AnalyticDB for PostgreSQL supports full-text search in addition to vector similarity search. Full-text search can be used together with vector similarity search to implement two-way retrieval.

Define full-text search fields

Before you use full-text search, you must specify which metadata fields are used as full-text search fields. The content field is used by the document collection operation by default. You can also specify other custom metadata fields.

Specify an analyzer

When you create a document collection, you can specify the Parser field as the analyzer. In general, you can use the default zh_ch. If you have special requirements for word segmentation characters, contact Alibaba Cloud technical support.

When you insert data, the analyzer segments the data in the specified fields for full-text search based on the delimiter and saves the data to the to_tsvector field for subsequent full-text search.

Introduction to embedding models

The following table describes the supported embedding models.

Embedding model	Number of dimensions	Description
m3e-small	512	Sourced from moka-ai/m3e-small. This model supports only Chinese but not English.
m3e-base	768	Sourced from moka-ai/m3e-base. This model supports Chinese and English.
text2vec	1024	Sourced from GanymedeNil/text2vec-large-chinese. This model supports Chinese and English.
text-embedding-v1	1536	Sourced from Text Embedding of Alibaba Cloud Model Studio. This model supports Chinese and English.
text-embedding-v2	1536	An upgraded version of text-embedding-v1.
clip-vit-b-32 (multimodal)	512	An open source multimodal model that supports text and images.

Vector indexes

The following table describes the parameters that you can configure for vector indexes.

Parameter	Description
metrics	The distance or similarity metric algorithm. Valid values: l2: uses the squared Euclidean distance function to create indexes. This metric algorithm is suitable for image similarity search scenarios. ip: uses the negative inner product distance function to create indexes. This metric algorithm is used to replace cosine similarity after vector normalization. cosine: uses the cosine distance function to create indexes. This metric algorithm is suitable for text similarity search scenarios.
hnsw_m	The maximum number of neighbors for the HNSW algorithm. The AnalyticDB for PostgreSQL API automatically sets different values based on the vector dimension.
pq_enable	Specifies whether to enable the PQ feature for index acceleration and dimensionality reduction. Valid values: 0: no. 1: yes. The PQ feature uses existing vector sample data for training. If the number of rows that are occupied by data is less than 500,000, we recommend that you do not specify this parameter.
external_storage	Specifies whether to use the memory mapping technology to create HNSW indexes. Valid values: 0: uses segmented paging storage to create indexes. This method uses the shared buffer of PostgreSQL for caching and supports the delete and update operations. 1: uses the memory mapping technology to create indexes. This method does not support the delete or update operation.