DashText is a sparse vector encoder recommended for DashVector. DashText can convert the original text into a sparse vector by using the Best Match 25 (BM25) algorithm. This greatly simplifies the use of the keyword-aware semantic search feature of DashVector.
You need to replace YOUR_API_KEY with your API key and YOUR_CLUSTER_ENDPOINT with the endpoint of your cluster in the sample code for the code to run properly.
This topic describes how to use sparse vectors in a search. For simplicity, the number of dense vector dimensions is set to 4. In actual scenarios, set it as needed. For more information, see Vector introduction.
Step 1. Create a collection that supports sparse vectors
import dashvector
client = dashvector.Client(api_key='YOUR_API_KEY', endpoint='YOUR_CLUSTER_ENDPOINT')
assert client
ret = client.create('hybrid_collection', dimension=4, metric='dotproduct')
assert ret
collection = client.get('hybrid_collection')
assert collection
import com.aliyun.dashvector.DashVectorClient;
import com.aliyun.dashvector.DashVectorCollection;
import com.aliyun.dashvector.models.requests.CreateCollectionRequest;
import com.aliyun.dashvector.models.responses.Response;
import com.aliyun.dashvector.proto.CollectionInfo;
DashVectorClient client =
new DashVectorClient("YOUR_API_KEY", "YOUR_CLUSTER_ENDPOINT");
CreateCollectionRequest request = CreateCollectionRequest.builder()
.name("hybrid_collection")
.dimension(4)
.metric(CollectionInfo.Metric.dotproduct)
.dataType(CollectionInfo.DataType.FLOAT)
.build();
Response<Void> response = client.create(request);
System.out.println(response);
DashVectorCollection collection = client.get("hybrid_collection");
Only collections that use the dot product metric (metric='dotproduct'
) support sparse vectors.
Step 2. Create a sparse vector encoder
Use the built-in encoder
from dashtext import SparseVectorEncoder
encoder = SparseVectorEncoder.default()
import com.aliyun.dashtext.encoder.SparseVectorEncoder;
SparseVectorEncoder encoder = SparseVectorEncoder.getDefaultInstance();
The built-in encoder is trained on the Chinese Wikipedia corpus, and Jieba is used for Chinese text segmentation.
Create an encoder based on your own corpus
from dashtext import SparseVectorEncoder
encoder = SparseVectorEncoder()
# Your own corpus.
corpus = [
"向量检索服务DashVector基于阿里云自研的高效向量引擎Proxima内核,提供具备水平拓展能力的云原生、全托管的向量检索服务",
"DashVector将其强大的向量管理、向量查询等多样化能力,通过简洁易用的SDK/API接口透出,方便被上层AI应用迅速集成",
"从而为包括大模型生态、多模态AI搜索、分子结构分析在内的多种应用场景,提供所需的高效向量检索能力",
"简单灵活、开箱即用的SDK,使用极简代码即可实现向量管理",
"自研向量相似性比对算法,快速高效稳定服务",
"Schema-free设计,通过Schema实现任意条件下的组合过滤查询"
]
# Train the encoder by using your own corpus.
encoder.train(corpus)
import com.aliyun.dashtext.encoder.SparseVectorEncoder;
import java.util.*;
SparseVectorEncoder encoder = new SparseVectorEncoder();
// Your own corpus.
List<String> corpus = Arrays.asList(
"向量检索服务DashVector基于阿里云自研的高效向量引擎Proxima内核,提供具备水平拓展能力的云原生、全托管的向量检索服务",
"DashVector将其强大的向量管理、向量查询等多样化能力,通过简洁易用的SDK/API接口透出,方便被上层AI应用迅速集成",
"从而为包括大模型生态、多模态AI搜索、分子结构分析在内的多种应用场景,提供所需的高效向量检索能力",
"简单灵活、开箱即用的SDK,使用极简代码即可实现向量管理",
"自研向量相似性比对算法,快速高效稳定服务",
"Schema-free设计,通过Schema实现任意条件下的组合过滤查询"
);
// Train the encoder by using your own corpus.
encoder.train(corpus);
The built-in encoder is ready for use without the need to be trained on the original corpus, making it user-friendlier and more excellent in generalization. However, the built-in encoder has a low accuracy if the original corpus contains many terms.
To create an encoder based on your own corpus, you must train the encoder on the full corpus in advance. This way, the encoder provides a higher accuracy. For more information, see Advanced use.
You need to select the encoder based on your business requirements. We recommend that you create an encoder based on your own corpus if your business involves a large number of terms specific to a certain field.
Step 3. Insert a document containing a sparse vector
from dashvector import Doc
document = "向量检索服务DashVector基于阿里云自研的高效向量引擎Proxima内核,提供具备水平拓展能力的云原生、全托管的向量检索服务。"
doc_sparse_vector = encoder.encode_documents(document)
print(doc_sparse_vector)
# Output based on the built-in encoder:
# {380823393: 0.7262431704356519, 414191989: 0.7262431704356519, 565176162: 0.7262431704356519, 904594806: 0.7262431704356519, 1005505802: 0.7262431704356519, 1169440797: 0.8883757984694465, 1240922502: 0.7262431704356519, 1313971048: 0.7262431704356519, 1317077351: 0.7262431704356519, 1490140460: 0.7262431704356519, 1574737055: 0.7262431704356519, 1760434515: 0.7262431704356519, 2045788977: 0.8414146776926797, 2141666983: 0.7262431704356519, 2509543087: 0.7262431704356519, 3180265193: 0.7262431704356519, 3845702398: 0.7262431704356519, 4106887295: 0.7262431704356519}
collection.insert(Doc(
id='A',
vector=[0.1, 0.2, 0.3, 0.4],
sparse_vector=doc_sparse_vector
))
String document = "向量检索服务DashVector基于达摩院自研的高效向量引擎Proxima内核,提供具备水平拓展能力的云原生、全托管的向量检索服务。";
Map<Long, Float> sparseVector = encoder.encodeDocuments(document);
System.out.println(sparseVector);
// Output based on the built-in encoder:
// {380823393: 0.7262431704356519, 414191989: 0.7262431704356519, 565176162: 0.7262431704356519, 904594806: 0.7262431704356519, 1005505802: 0.7262431704356519, 1169440797: 0.8883757984694465, 1240922502: 0.7262431704356519, 1313971048: 0.7262431704356519, 1317077351: 0.7262431704356519, 1490140460: 0.7262431704356519, 1574737055: 0.7262431704356519, 1760434515: 0.7262431704356519, 2045788977: 0.8414146776926797, 2141666983: 0.7262431704356519, 2509543087: 0.7262431704356519, 3180265193: 0.7262431704356519, 3845702398: 0.7262431704356519, 4106887295: 0.7262431704356519}
Vector vector = Vector.builder().value(Arrays.asList(0.1f, 0.2f, 0.3f, 0.4f)).build();
// Build a Doc object containing a sparse vector.
Doc doc = Doc.builder()
.id("28")
.sparseVector(sparseVector)
.vector(vector)
.build();
// Insert the document containing a sparse vector.
Response<Void> response = collection.insert(InsertDocRequest.builder().doc(doc).build());
Step 4. Perform a keyword-aware semantic search
query = "什么是向量检索服务?"
sparse_vector = encoder.encode_queries(query)
print(sparse_vector)
# Output based on the built-in encoder:
# {1169440797: 0.2947158712590364, 2045788977: 0.7052841287409635}
docs = collection.query(
vector=[0.1, 0.1, 0.1, 0.1],
sparse_vector=sparse_vector
)
String query = "什么是向量检索服务?";
Map<Long, Float> sparseVector = encoder.encodeQueries(query);
System.out.println(sparseVector);
// Output based on the built-in encoder:
// {1169440797: 0.2947158712590364, 2045788977: 0.7052841287409635}
Vector vector = Vector.builder().value(Arrays.asList(0.1f, 0.2f, 0.3f, 0.4f)).build();
// Build a QueryDocRequest object.
QueryDocRequest request = QueryDocRequest.builder()
.vector(vector)
.sparse_ector(sparseVector)
.topk(100)
.includeVector(true)
.build();
Response<List<Doc>> response = collection.query(request);
System.out.println(response);
Step 5. Perform a weight-factored keyword-aware semantic vector search
from dashtext import combine_dense_and_sparse
query = "什么是向量检索服务?"
sparse_vector = encoder.encode_queries(query)
# Specify the weight factor.
alpha = 0.7
dense_vector = [0.1, 0.1, 0.1, 0.1]
scaled_dense_vector, scaled_sparse_vector = combine_dense_and_sparse(dense_vector, sparse_vector, alpha)
docs = collection.query(
vector=scaled_dense_vector,
sparse_vector=scaled_sparse_vector
)
String query = "什么是向量检索服务?";
Map<Long, Float> sparseVector = encoder.encodeQueries(query);
System.out.println(sparse_vector);
// Output based on the built-in encoder:
// {1169440797: 0.2947158712590364, 2045788977: 0.7052841287409635}
Vector denseVector = Vector.builder().value(Arrays.asList(0.1f, 0.2f, 0.3f, 0.4f)).build();
// Adjust the weights of dense and sparse vectors based on the alpha factor.
float alpha = 0.1;
sparse_vector.forEach((key, value) -> sparse_vector.put(key, value * alpha));
denseVector = Vector.builder().value(
denseVector.getValue().stream().map(number -> number.floatValue() * alpha).collect(Collectors.toList())
).build();
// Build a QueryDocRequest object.
QueryDocRequest request = QueryDocRequest.builder()
.vector(denseVector)
.sparse_ector(sparseVector)
.topk(100)
.includeVector(true)
.build();
Response<List<Doc>> response = collection.query(request);
System.out.println(response);
The alpha parameter controls the weighted distances of dense and sparse vectors. If it is set to 0.0, only sparse vectors are used for distance measurement. If set to 1.0, only dense vectors are used for distance measurement.
API reference
For more information about DashText API, see the following:
SDK for Python: https://pypi.org/project/dashtext/