Use DashVector and models from ModelScope to implement multi-modal searches - Vector Retrieval Service

This topic describes how to implement real-time text-to-image searches by using DashVector and ChineseCLIP, a multi-modal search model that is available on ModelScope. In this example, the Multimodal Understanding and Generation Evaluation (MUGE) dataset is used as the image corpus, and users can search for the most similar images by text.

Overall process

The process consists of the following two steps:

Generate image embeddings and store them in DashVector: Convert the MUGE dataset into high-dimensional vectors by using the embedding API of the ChineseCLIP model and write the vectors into DashVector.
Search for images by text: Obtain the embedding of the query text by using the ChineseCLIP model and search for similar images in DashVector.

Preparations

1. API key

Activate DashVector. For more information, see Activate DashVector.
Create an API key for DashVector. For more information, see Manage API keys.

2. Environment

In this example, the latest ChineseCLIP model Huge (224 pixels) from ModelScope is used. The model is trained on large-scale Chinese data that contains approximately 0.2 billion image-text pairs to excel in image searches by Chinese text and extraction of image and word embeddings. The following environmental dependencies are required, as described on the model details page on ModelScope:

Note

Python 3.7 or later is required.

Shell

# Install dashvector.
pip3 install dashvector

# Install modelscope.
# You must install modelscope 0.3.7 or later. By default, the version of modelscope that you install is later than 0.3.7. After the installation, check the version that is installed.
# You can install modelscope by updating the image or running the following command:
pip3 install --upgrade modelscope -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
# Separately install decord.
# pip3 install decord
# Install the dependencies that are required to install modelscope.
# pip3 install torch torchvision opencv-python timm librosa fairseq transformers unicodedata2 zhconv rapidfuzz

3. Data

In this example, the validation dataset of MUGE is used as the image dataset for embedding generation, which can be obtained by using the dataset API of ModelScope.

Python

from modelscope.msdatasets import MsDataset

dataset = MsDataset.load("muge", split="validation")

Procedure

Note

You must replace your-xxx-api-key with your API key and your-xxx-cluster-endpoint with the endpoint of your cluster in the sample code for the code to run properly.

1. Generate image embeddings and store them in DashVector

The validation dataset of MUGE contains multi-modal information of 30,588 images. In this example, the embeddings of the original images are extracted by using the ChineseCLIP model and stored in DashVector. The original images are also encoded and stored in DashVector. This facilitates image display. Sample code:

Python

import torch
from modelscope.utils.constant import Tasks
from modelscope.pipelines import pipeline
from modelscope.msdatasets import MsDataset
from dashvector import Client, Doc, DashVectorException, DashVectorCode
from PIL import Image
import base64
import io


def image2str(image):
    image_byte_arr = io.BytesIO()
    image.save(image_byte_arr, format='PNG')
    image_bytes = image_byte_arr.getvalue()
    return base64.b64encode(image_bytes).decode()


if __name__ == '__main__':
    # Initialize the DashVector client.
    client = Client(
      api_key='{your-dashvector-api-key}',
      endpoint='{your-dashvector-cluster-endpoint}'
    )

    # Create a collection by specifying the collection name and the number of vector dimensions. In the Huge model of ChineseCLIP, the number of vector dimensions is 1,024.
    rsp = client.create('muge_embedding', 1024)
    if not rsp:
        raise DashVectorException(rsp.code, reason=rsp.message)

    # Batch generate and store image embeddings.
    collection = client.get('muge_embedding')
    pipe = pipeline(task=Tasks.multi_modal_embedding,
                    model='damo/multi-modal_clip-vit-huge-patch14_zh', 
                    model_revision='v1.0.0')
    ds = MsDataset.load("muge", split="validation")

    BATCH_COUNT = 10
    TOTAL_DATA_NUM = len(ds)
    print(f"Start indexing muge validation data, total data size: {TOTAL_DATA_NUM}, batch size:{BATCH_COUNT}")
    idx = 0
    while idx < TOTAL_DATA_NUM:
        batch_range = range(idx, idx + BATCH_COUNT) if idx + BATCH_COUNT <= TOTAL_DATA_NUM else range(idx, TOTAL_DATA_NUM)
        images = [ds[i]['image'] for i in batch_range]
        # Generate image embeddings by using the ChineseCLIP model.
        image_embeddings = pipe.forward({'img': images})['img_embedding']
        image_vectors = image_embeddings.detach().cpu().numpy()
        collection.insert(
            [
                Doc(
                    id=str(img_id),
                    vector=img_vec,
                    fields={'png_img': image2str(img)}
                )
                for img_id, img_vec, img in zip(batch_range, image_vectors, images)
            ]
        )
        idx += BATCH_COUNT
    print("Finish indexing muge validation data")

Note

In the preceding code, the model runs in the CPU by default. If the model runs in the GPU, its performance can be improved to different degrees based on the GPU performance.

2. Search for images by text

Obtain the text vectors by using the ChineseCLIP model. Then, search for similar images by using the search API of DashVector. Sample code:

Python

import torch
from modelscope.utils.constant import Tasks
from modelscope.pipelines import pipeline
from modelscope.msdatasets import MsDataset
from dashvector import Client, Doc, DashVectorException
from PIL import Image
import base64
import io


def str2image(image_str):
    image_bytes = base64.b64decode(image_str)
    return Image.open(io.BytesIO(image_bytes))


def multi_modal_search(input_text):
    # Initialize the DashVector client.
    client = Client(
      api_key='{your-dashvector-api-key}',
      endpoint='{your-dashvector-cluster-endpoint}'
    )

    # Obtain the collection that stores the relevant embeddings.
    collection = client.get('muge_embedding')

    # Obtain the embedding of the query text.
    pipe = pipeline(task=Tasks.multi_modal_embedding,
                    model='damo/multi-modal_clip-vit-huge-patch14_zh', model_revision='v1.0.0')
    text_embedding = pipe.forward({'text': input_text})['text_embedding']  # 2D Tensor, [The number of words in the text, Feature dimensions]
    text_vector = text_embedding.detach().cpu().numpy()[0]

    # Search in DashVector.
    rsp = collection.query(text_vector, topk=3)
    image_list = list()
    for doc in rsp:
        image_str = doc.fields['png_img']
        image_list.append(str2image(image_str))
    return image_list


if __name__ == '__main__':
    text_query = "戴眼镜的狗"
    
    images = multi_modal_search(text_query)
    for img in images:
        # Note: You may need to install an image viewer on the Linux server for the show() function to work.
        # We recommend that you run the code on a server that supports Jupyter Notebook.
        img.show()

The following images are returned after the preceding code is run.