This topic describes how to create a document collection, which is used to store chunks and vector data.
Sample code
def create_document_collection(account,
account_password,
namespace,
collection,
metadata: str = None,
full_text_retrieval_fields: str = None,
parser: str = None,
embedding_model: str = None,
metrics: str = None,
hnsw_m: int = None,
pq_enable: int = None,
external_storage: int = None,):
request = gpdb_20160503_models.CreateDocumentCollectionRequest(
region_id=ADBPG_INSTANCE_REGION,
dbinstance_id=ADBPG_INSTANCE_ID,
manager_account=account,
manager_account_password=account_password,
namespace=namespace,
collection=collection,
metadata=metadata,
full_text_retrieval_fields=full_text_retrieval_fields,
parser=parser,
embedding_model=embedding_model,
metrics=metrics,
hnsw_m=hnsw_m,
pq_enable=pq_enable,
external_storage=external_storage
)
response = get_client().create_document_collection(request)
print(f"create_document_collection response code: {response.status_code}, body:{response.body}")
if __name__ == '__main__':
metadata = '{"title":"text", "page":"int"}'
full_text_retrieval_fields = "title"
embedding_model = "m3e-small"
create_document_collection("testacc", "Test1234", "ns1", "dc1",
metadata=metadata, full_text_retrieval_fields=full_text_retrieval_fields,
embedding_model=embedding_model)
# output: body:
# {
# "Message":"success",
# "RequestId":"7BC35B66-5F49-1E79-A153-8D26576C4A3E",
# "Status":"success"
# }
You can invoke the create_document_collection() function to create a document collection in an AnalyticDB for PostgreSQL instance. Parameters:
account: the privileged account of the AnalyticDB for PostgreSQL instance.
account_password: the password of the privileged account.
namespace: the name of the namespace in which you want to create the document collection.
collection: the name of the document collection that you want to create.
metadata: the metadata of the custom map structure. key specifies the field name and value specifies the field type.
full_text_retrieval_fields: the custom full-text search fields separated by commas (,). The fields must be the keys of metadata.
parser: the analyzer. This parameter is a full-text search parameter. Default value: zh_cn. For more information, see the "Introduction to full-text search" section of this topic.
embedding_model: the embedding model. For more information, see the "Introduction to embedding models" section of this topic.
metrics: the distance or similarity metric algorithm. This parameter is a vector index parameter. For more information, see the "Vector indexes" section of this topic.
hnsw_m: the maximum number of neighbors for the Hierarchical Navigable Small World (HNSW) algorithm. This parameter is a vector index parameter. Valid values: 1 to 1000. For more information, see the "Vector indexes" section of this topic.
pq_enable: specifies whether to enable the product quantization (PQ) feature for index acceleration and dimensionality reduction. This parameter is a vector index parameter. For more information, see the "Vector indexes" section of this topic.
external_storage: specifies whether to use the memory mapping technology to create HNSW indexes. This parameter is a vector index parameter. For more information, see the "Vector indexes" section of this topic.
View database changes
After you run the preceding code, you can log on to the Data Management (DMS) console to view the schema of the created table named ns1.dc1
.
Field | Type | Source | Description |
id | TEXT | Fixed field | The UUID of the chunk. This field is the primary key. |
vector | REAL[] | Fixed field | The array of vector data. The length of vector data is the same as the number of vector dimensions of the specified embedding model. |
doc_name | TEXT | Fixed field | The name of the document. |
content | TEXT | Fixed field | The content of the chunk. Chunks are obtained after the document is loaded and split. |
loader_metadata | JSON | Fixed field | The metadata of the document that is parsed by the document loader. |
to_tsvector | TSVECTOR | Fixed field | The full-text search fields, which are the fields specified by the full_text_retrieval_fields parameter. The content field is a default field. In this example, full-text search is performed on the content and title fields. |
title | TEXT | Metadata definition | A custom field name. |
page | INT | Metadata definition | A custom field name. |
Introduction to full-text search
To improve search accuracy, AnalyticDB for PostgreSQL supports full-text search in addition to vector similarity search. Full-text search can be used together with vector similarity search to implement two-way retrieval.
Define full-text search fields
Before you use full-text search, you must specify which metadata fields are used as full-text search fields. The content field is used by the document collection operation by default. You can also specify other custom metadata fields.
Specify an analyzer
When you create a document collection, you can specify the Parser field as the analyzer. In general, you can use the default zh_ch. If you have special requirements for word segmentation characters, contact Alibaba Cloud technical support.
When you insert data, the analyzer segments the data in the specified fields for full-text search based on the delimiter and saves the data to the to_tsvector field for subsequent full-text search.
Introduction to embedding models
The following table describes the supported embedding models.
Embedding model | Number of dimensions | Description |
m3e-small | 512 | Sourced from moka-ai/m3e-small. This model supports only Chinese but not English. |
m3e-base | 768 | Sourced from moka-ai/m3e-base. This model supports Chinese and English. |
text2vec | 1024 | Sourced from GanymedeNil/text2vec-large-chinese. This model supports Chinese and English. |
text-embedding-v1 | 1536 | Sourced from Text Embedding of Alibaba Cloud Model Studio. This model supports Chinese and English. |
text-embedding-v2 | 1536 | An upgraded version of text-embedding-v1. |
clip-vit-b-32 (multimodal) | 512 | An open source multimodal model that supports text and images. |
Vector indexes
The following table describes the parameters that you can configure for vector indexes.
Parameter | Description |
metrics | The distance or similarity metric algorithm. Valid values:
|
hnsw_m | The maximum number of neighbors for the HNSW algorithm. The AnalyticDB for PostgreSQL API automatically sets different values based on the vector dimension. |
pq_enable | Specifies whether to enable the PQ feature for index acceleration and dimensionality reduction. Valid values:
The PQ feature uses existing vector sample data for training. If the number of rows that are occupied by data is less than 500,000, we recommend that you do not specify this parameter. |
external_storage | Specifies whether to use the memory mapping technology to create HNSW indexes. Valid values:
|