TairVectorとCLIPを使用して、高性能なクロスモーダル画像テキスト検索を実装する - Tair (Redis® OSS-Compatible)

このトピックでは、TairVectorを使用してクロスモーダルなイメージテキスト検索を実装する方法について説明します。

背景情報

インターネットは非構造化情報で溢れています。これに関連して、Contrastive Language-Image Pre-Training (CLIP) ニューラルネットワークのようなモデルが出現し、クロスモーダルデータ検索の必要性に対応します。 DAMO Academyは、非構造化データから特徴を抽出し、それらを構造化データに解析するために組み込まれたText TransformerおよびResNetアーキテクチャを持つオープンソースCLIPモデルを提供します。

Tairに関しては、CLIPを使用して画像やドキュメントなどの非構造化データを前処理し、その結果をTairインスタンスに保存してから、TairVectorの近似最近傍 (ANN) 検索アルゴリズムを使用して、クロスモーダルな画像とテキストのデータ取得を実装できます。 TairVectorの詳細については、「Vector」をご参照ください。

ソリューションの概要

テストデータとして画像をダウンロードします。
この例では、次のテストデータが使用されます。
- 画像: Extreme Martのオープンソースデータセットからの動物ペットの7,000を超える画像。
- テキスト: 「犬」、「白い犬」、「走っている白い犬」。
Tairインスタンスに接続します。特定のコード実装については、次のサンプルコードのget_tair関数を参照してください。
Tairインスタンスに2つのベクトルインデックスを作成します。1つは画像の特徴ベクトル用、もう1つはテキストの特徴ベクトル用です。特定のコード実装については、次のサンプルコードのcreate_index関数を参照してください。
上記の画像とテキストをTairインスタンスに書き込みます。
この場合、CLIPは画像とテキストを前処理するために使用されます。次に、TairVectorのTVS.HSETコマンドを実行して、画像およびテキストの名前および特徴情報をTairインスタンスに書き込む。特定のコード実装については、次のサンプルコードのinsert_imagesおよびupsert_text関数を参照してください。
Tairインスタンスでクロスモーダル検索を実行します。
- テキストを使用して画像を取得する
  CLIPを使用してテキストを前処理し、TairVectorのTVS.KNNSEARCHコマンドを実行して、テキストがTairインスタンスから記述したものに最も類似した画像を取得します。特定のコード実装については、次のサンプルコードのquery_images_by_text関数を参照してください。
- 画像を使用してテキストを取得する
  CLIPを使用して画像を前処理し、TairVectorのTVS.KNNSEARCHコマンドを実行して、Tairインスタンスから画像に最も類似したテキストを取得します。特定のコード実装については、次のサンプルコードのquery_texts_by_image関数を参照してください。
説明
- Tairインスタンスのデータを取得するために使用するテキストと画像を格納する必要はありません。
- TVS.KNNSEARCHでは、topKパラメーターを使用して、返される結果の数を指定できます。 距離パラメータの値が小さいほど、テキストまたは画像と検索されたデータとの間の類似性が高い。

サンプルコード

この例では、Python 3.8が使用され、Tair-py、torch、Image、pylab、plt、およびCLIPの依存関係がインストールされています。 pip3 install tairコマンドを実行してTair-pyをインストールできます。

# -*- coding: utf-8 -*-
# !/usr/bin/env python
from tair import Tair
from tair.tairvector import DistanceMetric
from tair import ResponseError

from typing import List
import torch
from PIL import Image
import pylab
from matplotlib import pyplot as plt
import os
import cn_clip.clip as clip
from cn_clip.clip import available_models


def get_tair() -> Tair:
    """
    This method is used to connect to a Tair instance. 
    * host: the endpoint that is used to connect to the Tair instance. 
    * port: the port number that is used to connect to the Tair instance. Default value: 6379. 
    * password: the password of the default database account of the Tair instance. If you want to connect to the Tair instance by using a custom database account, you must specify the password in the username:password format. 
    """
    tair: Tair = Tair(
        host="r-8vbehg90y9rlk9****pd.redis.rds.aliyuncs.com",
        port=6379,
        db=0,
        password="D******3",
        decode_responses=True
    )
    return tair


def create_index():
    """
    Create two vector indexes in the Tair instance, one for feature vectors of images and one for feature vectors of texts.
    The vector index for feature vectors of images is named index_images. The vector index for feature vectors of texts is named index_texts. 
    * The vector dimension is 1024. 
    * The inner product formula is used. 
    * The Hierarchical Navigable Small World (HNSW) indexing algorithm is used. 
    """
    ret = tair.tvs_get_index("index_images")
    if ret is None:
        tair.tvs_create_index("index_images", 1024, distance_type="IP",
                              index_type="HNSW")
    ret = tair.tvs_get_index("index_texts")
    if ret is None:
        tair.tvs_create_index("index_texts", 1024, distance_type="IP",
                              index_type="HNSW")


def insert_images(image_dir):
    """
    Specify the directory of the images that you want to store in the Tair instance. This method automatically traverses all images in this directory. 
    Additionally, this method calls the extract_image_features method to use CLIP to preprocess the images, returns the feature vectors of these images, and then stores these feature vectors in the Tair instance. 
    The feature vector of an image is stored with the following information:
    * Vector index name: index_images. 
    * Key: the image path that contains the image name. Example: test/images/boxer_18.jpg. 
    * Feature information: The vector dimension is 1024. 
    """
    file_names = [f for f in os.listdir(image_dir) if (f.endswith('.jpg') or f.endswith('.jpeg'))]
    for file_name in file_names:
        image_feature = extract_image_features(image_dir + "/" + file_name)
        tair.tvs_hset("index_images", image_dir + "/" + file_name, image_feature)


def extract_image_features(img_name):
    """
    This method uses CLIP to preprocess images and returns the feature vectors of these images. The vector dimension is 1024. 
    """
    image_data = Image.open(img_name).convert("RGB")
    infer_data = preprocess(image_data)
    infer_data = infer_data.unsqueeze(0).to("cuda")
    with torch.no_grad():
        image_features = model.encode_image(infer_data)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    return image_features.cpu().numpy()[0]  # [1, 1024]


def upsert_text(text):
    """
    Specify the texts that you want to store in the Tair instance. This method calls the extract_text_features method to use CLIP to preprocess the texts, returns the feature vectors of these texts, and then stores the feature vectors in the Tair instance. 
    The feature vector of a text is stored with the following information:
    * Vector index name: index_texts. 
    * Key: the text content. Example: a running dog. 
    * Feature information: The vector dimension is 1024. 
    """
    text_features = extract_text_features(text)
    tair.tvs_hset("index_texts", text, text_features)


def extract_text_features(text):
    """
    This method uses CLIP to preprocess texts and returns the feature vectors of these texts. The vector dimension is 1024. 
    """
    text_data = clip.tokenize([text]).to("cuda")
    with torch.no_grad():
        text_features = model.encode_text(text_data)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    return text_features.cpu().numpy()[0]  # [1, 1024]


def query_images_by_text(text, topK):
    """
    This method uses a text to retrieve images. 
    Specify the text content as the text parameter and the number of results that you want Tair to return as the topK parameter. 
    This method uses CLIP to preprocess the text and runs the TVS.KNNSEARCH command of TairVector to retrieve images that are the most similar to what the text describes from the Tair instance. 
    This method returns the values of the distance parameter and keys of the images retrieved. The smaller the value of the distance parameter of an image retrieved, the higher similarity between the specified text and the image. 
    """
    text_feature = extract_text_features(text)
    result = tair.tvs_knnsearch("index_images", topK, text_feature)
    for k, s in result:
        print(f'key : {k}, distance : {s}')
        img = Image.open(k.decode('utf-8'))
        plt.imshow(img)
        pylab.show()


def query_texts_by_image(image_path, topK=3):
    """
    This method uses an image to retrieve texts. 
    Specify the number of results that you want Tair to return as the topK parameter value and the image path. 
    This method uses CLIP to preprocess the image and runs the TVS.KNNSEARCH command of TairVector to retrieve texts that are the most similar to what the image shows from the Tair instance. 
    This method returns the values of the distance parameter and keys of the texts retrieved. The smaller the value of the distance parameter of a text retrieved, the higher similarity between the specified image and the text. 
    """
    image_feature = extract_image_features(image_path)
    result = tair.tvs_knnsearch("index_texts", topK, image_feature)
    for k, s in result:
        print(f'text : {k}, distance : {s}')

if __name__ == "__main__":
    # Create two vector indexes in the Tair instance, one for feature vectors of images and one for feature vectors of texts. 
    tair = get_tair()
    create_index()
    # Load CLIP. 
    model, preprocess = clip.load_from_name("RN50", device="cuda", download_root="./")
    model.eval()
    
    # Write the path of the dataset of sample images. Example: /home/CLIP_Demo. 
    insert_images("/home/CLIP_Demo")
    # Write the following sample texts: "a dog", "a white dog", "a running white dog". 
    upsert_text("a dog")
    upsert_text("a white dog")
    upsert_text("a running white dog")

    # Use the "a running dog" text to retrieve three images that show a running dog. 
    query_images_by_text("a running dog", 3)
    # Specify the path of an image to retrieve texts that describe what the image shows. 
    query_texts_by_image("/home/CLIP_Demo/boxer_18.jpg",3)

結果

「走っている犬」のテキストを使用して、走っている犬を示す3つの画像を取得します。

次の画像を使用して、画像が示す内容を説明するテキストを取得します。

奔跑的狗（搜索图）..jpeg

次のコードは結果を示しています。

{
  "results":[
    {
      "text":"a running white dog",
      "distance": "0.4052203893661499"
    },
    {
      "text":"a white dog",
      "distance": "0.44666868448257446"
    },
    {
      "text":"a dog",
      "distance": "0.4553511142730713"
    }
  ]
}

概要

Tairはメモリ内データベースサービスで、HNSWなどのインデックスアルゴリズムを使用してデータ検索を高速化し、CLIPとTairVectorを組み合わせてクロスモーダルな画像とテキストの検索を行うことができます。

Tairは、商品の推奨や画像に基づく書き込みなどのシナリオで使用できます。さらに、TairのCLIPを他の埋め込みモデルに置き換えて、クロスモーダルなビデオテキストまたはオーディオテキストの取得を実装できます。