This topic describes how to create a document collection to store chunks and vector data within AnalyticDB for PostgreSQL.
Sample code
def create_document_collection(account,
account_password,
namespace,
collection,
metadata: str = None,
full_text_retrieval_fields: str = None,
parser: str = None,
embedding_model: str = None,
metrics: str = None,
hnsw_m: int = None,
pq_enable: int = None,
external_storage: int = None,):
request = gpdb_20160503_models.CreateDocumentCollectionRequest(
region_id=ADBPG_INSTANCE_REGION,
dbinstance_id=ADBPG_INSTANCE_ID,
manager_account=account,
manager_account_password=account_password,
namespace=namespace,
collection=collection,
metadata=metadata,
full_text_retrieval_fields=full_text_retrieval_fields,
parser=parser,
embedding_model=embedding_model,
metrics=metrics,
hnsw_m=hnsw_m,
pq_enable=pq_enable,
external_storage=external_storage
)
response = get_client().create_document_collection(request)
print(f"create_document_collection response code: {response.status_code}, body:{response.body}")
if __name__ == '__main__':
metadata = '{"title":"text", "page":"int"}'
full_text_retrieval_fields = "title"
embedding_model = "m3e-small"
create_document_collection("testacc", "Test1234", "ns1", "dc1",
metadata=metadata, full_text_retrieval_fields=full_text_retrieval_fields,
embedding_model=embedding_model)
# output: body:
# {
# "Message":"success",
# "RequestId":"7BC35B66-5F49-1E79-A153-8D26576C4A3E",
# "Status":"success"
# }You can invoke the create_document_collection() function to create a document collection in an AnalyticDB for PostgreSQL instance. The parameters for create_document_collection are described below:
account: the privileged account of the AnalyticDB for PostgreSQL instance.
account_password: the password of the privileged account.
namespace: the name of the namespace in which you want to create the document collection.
collection: the name of the document collection that you want to create.
metadata: the metadata of the custom map structure which contains multiple fields represented by key-value pairs. The key specifies the field name and the value specifies the field type.
full_text_retrieval_fields: the custom full-text search fields separated by commas (,). The fields must be the keys of metadata.
parser: the tokenizer. This parameter is a full-text search parameter. Default value: zh_cn. For more information, see Full-text search.
embedding_model: the embedding model. For more information, see Embedding models.
metrics: the distance or similarity metric algorithm. This parameter is a vector index parameter. For more information, see Vector indexes.
hnsw_m: the maximum number of neighbors for the Hierarchical Navigable Small World (HNSW) algorithm. This parameter is a vector index parameter. Valid values: 1 to 1000. For more information, see Vector indexes.
pq_enable: specifies whether to enable the product quantization (PQ) feature for index acceleration and dimensionality reduction. This parameter is a vector index parameter. For more information, see Vector indexes.
external_storage: specifies whether to use the memory mapping technology to create HNSW indexes. This parameter is a vector index parameter. For more information, see Vector indexes.
ImportantOnly AnalyticDB for PostgreSQL V6.0 supports the external_storage parameter.
View database changes
After the code invocation is successful, you can log on to the DMS console to view the schema of the created table named ns1.dc1.
Field | Type | Source | Description |
id | text | Fixed field | The UUID of the chunk. This field is the primary key. |
vector | real[] | Fixed field | The array of vector data. The length of vector data is the same as the number of vector dimensions of the specified embedding model. |
doc_name | text | Fixed field | The name of the document. |
content | text | Fixed field | The content of the chunk. Chunks are obtained after the document is loaded and split. |
loader_metadata | json | Fixed field | The metadata of the document that is parsed by the document loader. |
to_tsvector | TSVECTOR | Fixed field | The full-text search fields, which are the fields specified by the full_text_retrieval_fields parameter. The content field is a default field. In this example, full-text search is performed on the content and title fields. |
title | text | Metadata definition | Specify a custom value. |
page | int | Metadata definition | Specify a custom value. |
Full-text search
To improve search accuracy, AnalyticDB for PostgreSQL supports full-text search in addition to vector similarity search. Full-text search can be used together with vector similarity search to implement two-way retrieval.
Define full-text search fields
Before you use full-text search, you must specify which metadata fields are used as full-text search fields. The content field is used by the document collection operation by default. You can also specify other custom metadata fields.
Tokenization
When you create a document collection, you can specify the Parser field as the tokenizer. In most cases, you can use the default value of zh_ch. If you have special requirements for word segmentation characters, contact Alibaba Cloud technical support.
When you insert data, the tokenizer segments the data in the specified fields for full-text search based on the delimiter and saves the data to the to_tsvector field for subsequent full-text search.
Embedding models
AnalyticDB for PostgreSQL supports the following embedding models:
embedding_model | Number of dimensions | Description |
m3e-small | 512 | Sourced from moka-ai/m3e-small. This model supports only Chinese but not English. |
m3e-base | 768 | Sourced from moka-ai/m3e-base. This model supports Chinese and English. |
text2vec | 1024 | Sourced from GanymedeNil/text2vec-large-chinese. This model supports Chinese and English. |
text-embedding-v1 | 1536 | Sourced from the text embedding model in Alibaba Cloud Model Studio. This model supports Chinese and English. |
text-embedding-v2 | 1536 | An upgraded version of text-embedding-v1. |
clip-vit-b-32 (multimodal) | 512 | An open source multimodal model that supports images. |
Custom embedding models are not supported.
For more information about supported models, see CreateDocumentCollection - Create a document collection.
Vector indexes
The following table describes the parameters for vector indexes:
Parameter | Description |
metrics | The distance or similarity metric algorithm. Valid values:
|
hnsw_m | The maximum number of neighbors for the HNSW algorithm. The AnalyticDB for PostgreSQL API automatically sets different values based on the vector dimension. |
pq_enable | Specifies whether to enable the PQ feature for index acceleration and dimensionality reduction. Valid values:
The PQ feature uses existing vector sample data for training. If the number of rows that are occupied by data is less than 500,000, we recommend that you do not specify this parameter. |
external_storage | Specifies whether to use the memory mapping technology to create HNSW indexes. Valid values:
Important Only AnalyticDB for PostgreSQL V6.0 supports the external_storage parameter. |