Vectorize text data by using the vectorization model of Baichuan AI - DashVector

This topic describes how to vectorize text data by using the vectorization model of Baichuan AI and import the vector data into DashVector for vector search.

Prerequisites

DashVector:
- A cluster is created. For more information, see Create a cluster.
- An API key is obtained. For more information, see Manage API keys.
- The SDK of the latest version is installed. For more information, see Install DashVector SDK.
Baichuan AI:
- An API key is obtained. For more information, see API introduction.

Vectorization model of Baichuan AI

Overview

Model name	Vector dimensions	Distance metric	Vector data type	Remarks
Baichuan-Text-Embedding	1,024	Cosine	Float32	Maximum number of characters in a token: 512. If the number of characters in a token exceeds 512, the excess characters are automatically truncated. Maximum number of tokens that can be specified at a time: 16

Note

For more information about the vectorization model of Baichuan AI, see Baichuan AI vectorization model.

Example

Note

You must perform the following operations for the code to run properly:

Replace {your-dashvector-api-key} in the sample code with your DashVector API key.
Replace {your-dashvector-cluster-endpoint} in the sample code with the endpoint of your DashVector cluster.
Replace {your-baichuan-api-key} in the following sample code with your Baichuan AI API key.

Python

from dashvector import Client
import requests
from typing import List


# Use the vectorization model of Baichuan AI to embed text data into vector data.
def generate_embeddings(texts: List[str]):
    headers = {
      'Content-Type': 'application/json',
      'Authorization': 'Bearer {your-baichuan-api-key}'
    }
    data = {'input': texts, 'model': 'Baichuan-Text-Embedding'}
    response = requests.post('http://api.baichuan-ai.com/v1/embeddings', headers=headers, json=data)
    return [record["embedding"] for record in response.json()["data"]]


# Create a DashVector client.
client = Client(
    api_key='{your-dashvector-api-key}',
    endpoint='{your-dashvector-cluster-endpoint}'
)

# Create a DashVector collection.
rsp = client.create('baichuan-text-embedding', 1024)
assert rsp
collection = client.get('baichuan-text-embedding')
assert collection

# Convert text into a vector and store it in DashVector.
collection.insert(
    ('ID1', generate_embeddings(['Alibaba Cloud DashVector is one of the best vector databases in performance and cost-effectiveness.'])[0])
)

# Perform a vector search.
docs = collection.query(
    generate_embeddings(['The best vector database'])[0]
)
print(docs)