Quick start - Vector Retrieval Service - Alibaba Cloud Documentation Center

DashText is a sparse vector encoder recommended for DashVector. DashText can convert the original text into a sparse vector by using the Best Match 25 (BM25) algorithm. This greatly simplifies the use of the keyword-aware semantic search feature of DashVector.

Note

You need to replace YOUR_API_KEY with your API key and YOUR_CLUSTER_ENDPOINT with the endpoint of your cluster in the sample code for the code to run properly.
This topic describes how to use sparse vectors in a search. For simplicity, the number of dense vector dimensions is set to 4. In actual scenarios, set it as needed. For more information, see Vector introduction.

Step 1. Create a collection that supports sparse vectors

Python

import dashvector

client = dashvector.Client(api_key='YOUR_API_KEY', endpoint='YOUR_CLUSTER_ENDPOINT')
assert client

ret = client.create('hybrid_collection', dimension=4, metric='dotproduct')
assert ret

collection = client.get('hybrid_collection')
assert collection

Java

import com.aliyun.dashvector.DashVectorClient;
import com.aliyun.dashvector.DashVectorCollection;
import com.aliyun.dashvector.models.requests.CreateCollectionRequest;
import com.aliyun.dashvector.models.responses.Response;
import com.aliyun.dashvector.proto.CollectionInfo;

DashVectorClient client = 
  new DashVectorClient("YOUR_API_KEY", "YOUR_CLUSTER_ENDPOINT");

CreateCollectionRequest request = CreateCollectionRequest.builder()
            .name("hybrid_collection")
            .dimension(4)
            .metric(CollectionInfo.Metric.dotproduct)
            .dataType(CollectionInfo.DataType.FLOAT)
            .build();
      
Response<Void> response = client.create(request);
System.out.println(response);

DashVectorCollection collection = client.get("hybrid_collection");

Important

Only collections that use the dot product metric (metric='dotproduct') support sparse vectors.

Step 2. Create a sparse vector encoder

Use the built-in encoder

Python

from dashtext import SparseVectorEncoder

encoder = SparseVectorEncoder.default()

Java

import com.aliyun.dashtext.encoder.SparseVectorEncoder;

SparseVectorEncoder encoder = SparseVectorEncoder.getDefaultInstance();

Note

The built-in encoder is trained on the Chinese Wikipedia corpus, and Jieba is used for Chinese text segmentation.

Create an encoder based on your own corpus

Python

from dashtext import SparseVectorEncoder

encoder = SparseVectorEncoder()

# Your own corpus.
corpus = [
    "向量检索服务DashVector基于阿里云自研的高效向量引擎Proxima内核，提供具备水平拓展能力的云原生、全托管的向量检索服务",
    "DashVector将其强大的向量管理、向量查询等多样化能力，通过简洁易用的SDK/API接口透出，方便被上层AI应用迅速集成",
    "从而为包括大模型生态、多模态AI搜索、分子结构分析在内的多种应用场景，提供所需的高效向量检索能力",
    "简单灵活、开箱即用的SDK，使用极简代码即可实现向量管理",
    "自研向量相似性比对算法，快速高效稳定服务",
    "Schema-free设计，通过Schema实现任意条件下的组合过滤查询"
]

# Train the encoder by using your own corpus.
encoder.train(corpus)

Java

import com.aliyun.dashtext.encoder.SparseVectorEncoder;
import java.util.*;

SparseVectorEncoder encoder = new SparseVectorEncoder();

// Your own corpus.
List<String> corpus = Arrays.asList(
  "向量检索服务DashVector基于阿里云自研的高效向量引擎Proxima内核，提供具备水平拓展能力的云原生、全托管的向量检索服务",
  "DashVector将其强大的向量管理、向量查询等多样化能力，通过简洁易用的SDK/API接口透出，方便被上层AI应用迅速集成",
  "从而为包括大模型生态、多模态AI搜索、分子结构分析在内的多种应用场景，提供所需的高效向量检索能力",
  "简单灵活、开箱即用的SDK，使用极简代码即可实现向量管理",
  "自研向量相似性比对算法，快速高效稳定服务",
  "Schema-free设计，通过Schema实现任意条件下的组合过滤查询"
);

// Train the encoder by using your own corpus.
encoder.train(corpus);

Note

The built-in encoder is ready for use without the need to be trained on the original corpus, making it user-friendlier and more excellent in generalization. However, the built-in encoder has a low accuracy if the original corpus contains many terms.
To create an encoder based on your own corpus, you must train the encoder on the full corpus in advance. This way, the encoder provides a higher accuracy. For more information, see Advanced use.
You need to select the encoder based on your business requirements. We recommend that you create an encoder based on your own corpus if your business involves a large number of terms specific to a certain field.

Step 3. Insert a document containing a sparse vector

Python

from dashvector import Doc

document = "向量检索服务DashVector基于阿里云自研的高效向量引擎Proxima内核，提供具备水平拓展能力的云原生、全托管的向量检索服务。"
doc_sparse_vector = encoder.encode_documents(document)

print(doc_sparse_vector)
# Output based on the built-in encoder: 
# {380823393: 0.7262431704356519, 414191989: 0.7262431704356519, 565176162: 0.7262431704356519, 904594806: 0.7262431704356519, 1005505802: 0.7262431704356519, 1169440797: 0.8883757984694465, 1240922502: 0.7262431704356519, 1313971048: 0.7262431704356519, 1317077351: 0.7262431704356519, 1490140460: 0.7262431704356519, 1574737055: 0.7262431704356519, 1760434515: 0.7262431704356519, 2045788977: 0.8414146776926797, 2141666983: 0.7262431704356519, 2509543087: 0.7262431704356519, 3180265193: 0.7262431704356519, 3845702398: 0.7262431704356519, 4106887295: 0.7262431704356519}

collection.insert(Doc(
    id='A',
    vector=[0.1, 0.2, 0.3, 0.4],
    sparse_vector=doc_sparse_vector
))

Java

String document = "向量检索服务DashVector基于达摩院自研的高效向量引擎Proxima内核，提供具备水平拓展能力的云原生、全托管的向量检索服务。";
Map<Long, Float> sparseVector = encoder.encodeDocuments(document);

System.out.println(sparseVector);
// Output based on the built-in encoder: 
// {380823393: 0.7262431704356519, 414191989: 0.7262431704356519, 565176162: 0.7262431704356519, 904594806: 0.7262431704356519, 1005505802: 0.7262431704356519, 1169440797: 0.8883757984694465, 1240922502: 0.7262431704356519, 1313971048: 0.7262431704356519, 1317077351: 0.7262431704356519, 1490140460: 0.7262431704356519, 1574737055: 0.7262431704356519, 1760434515: 0.7262431704356519, 2045788977: 0.8414146776926797, 2141666983: 0.7262431704356519, 2509543087: 0.7262431704356519, 3180265193: 0.7262431704356519, 3845702398: 0.7262431704356519, 4106887295: 0.7262431704356519}

Vector vector = Vector.builder().value(Arrays.asList(0.1f, 0.2f, 0.3f, 0.4f)).build();

// Build a Doc object containing a sparse vector.
Doc doc = Doc.builder()
  .id("28")
  .sparseVector(sparseVector)
  .vector(vector)
  .build();

// Insert the document containing a sparse vector.
Response<Void> response = collection.insert(InsertDocRequest.builder().doc(doc).build());

Step 4. Perform a keyword-aware semantic search

Python

query = "什么是向量检索服务？"
sparse_vector = encoder.encode_queries(query)

print(sparse_vector)
# Output based on the built-in encoder: 
# {1169440797: 0.2947158712590364, 2045788977: 0.7052841287409635}

docs = collection.query(
    vector=[0.1, 0.1, 0.1, 0.1],
    sparse_vector=sparse_vector
)

Java

String query = "什么是向量检索服务？";

Map<Long, Float> sparseVector = encoder.encodeQueries(query);

System.out.println(sparseVector);
// Output based on the built-in encoder: 
// {1169440797: 0.2947158712590364, 2045788977: 0.7052841287409635}

Vector vector = Vector.builder().value(Arrays.asList(0.1f, 0.2f, 0.3f, 0.4f)).build();
      	
// Build a QueryDocRequest object. 
QueryDocRequest request = QueryDocRequest.builder()
  .vector(vector)
  .sparse_ector(sparseVector)
  .topk(100)
  .includeVector(true)
  .build();

Response<List<Doc>> response = collection.query(request);
System.out.println(response);

Step 5. Perform a weight-factored keyword-aware semantic vector search

Python

from dashtext import combine_dense_and_sparse

query = "什么是向量检索服务？"
sparse_vector = encoder.encode_queries(query)

# Specify the weight factor.
alpha = 0.7
dense_vector = [0.1, 0.1, 0.1, 0.1]
scaled_dense_vector, scaled_sparse_vector = combine_dense_and_sparse(dense_vector, sparse_vector, alpha)

docs = collection.query(
    vector=scaled_dense_vector,
    sparse_vector=scaled_sparse_vector
)

Java

String query = "什么是向量检索服务？";

Map<Long, Float> sparseVector = encoder.encodeQueries(query);

System.out.println(sparse_vector);
// Output based on the built-in encoder: 
// {1169440797: 0.2947158712590364, 2045788977: 0.7052841287409635}

Vector denseVector = Vector.builder().value(Arrays.asList(0.1f, 0.2f, 0.3f, 0.4f)).build();

// Adjust the weights of dense and sparse vectors based on the alpha factor.
float alpha = 0.1;
sparse_vector.forEach((key, value) -> sparse_vector.put(key, value * alpha));
denseVector = Vector.builder().value(
            denseVector.getValue().stream().map(number -> number.floatValue() * alpha).collect(Collectors.toList())
    ).build();

// Build a QueryDocRequest object. 
QueryDocRequest request = QueryDocRequest.builder()
  .vector(denseVector)
  .sparse_ector(sparseVector)
  .topk(100)
  .includeVector(true)
  .build();

Response<List<Doc>> response = collection.query(request);
System.out.println(response);

Note

The alpha parameter controls the weighted distances of dense and sparse vectors. If it is set to 0.0, only sparse vectors are used for distance measurement. If set to 1.0, only dense vectors are used for distance measurement.

API reference

For more information about DashText API, see the following:

SDK for Python: https://pypi.org/project/dashtext/