TairVectorを使用して分子ジオメトリの近似クエリを実装する - Tair (Redis® OSS-Compatible)

このトピックでは、TairVectorを使用して分子ジオメトリの近似クエリを実装する方法について説明します。

背景情報

AI対応創薬の分野では、化合物と医薬品を示し、さまざまな化合物または医薬品がどの程度近似しているかを計算するためにベクトルが一般的に使用されます。これにより、研究者はさまざまな化合物または医薬品の化学反応を予測および最適化できます。このシナリオは、ベクトルを使用して新しい医薬品の研究開発を加速することにより、分子形状の迅速かつ正確な近似クエリを必要とします。

従来のベクトル検索サービスと比較して、TairVectorはメモリにデータを格納し、インデックスのリアルタイム更新をサポートして読み取りおよび書き込みの待ち時間を削減します。さらに、TairVectorは、TVS.KNNSEARCHなどのベクトル最近隣クエリのコマンドを提供します。これにより、研究者は特定の分子ジオメトリで最も類似した分子ジオメトリをすばやく取得できます。これにより、手動計算による間違いや損失を防ぎます。

解決策

次の図は、ワークフローを示しています。 Tair Vector分子结构检索流程图..jpeg

ダウンロード Simplified molecular Input Line Entry System (SMILESまたはSMI) ファイル形式の分子ジオメトリのデータセット。
この例では、PubChemのオープンソースデータセットからの11,012行のデータがテストデータとして使用されます。分子式および固有のIDカラムが含まれる。
説明
実際のユースケースでは、より多くのデータをTairに書き込んで、ミリ秒以内にベクトルを取得することができます。
```
CCC1=CN=C2C(C(=O)N(C)C(=O)N2C)/C1=N/c1ccc(OC)cc1OC,168000001
CC(C)CN1C(=O)C2SCCC2N2C(=S)NNC12,168000002
CC1=C[NH+]=C2C(C(=O)N(C)C(=O)N2C)/C1=N/c1cccc(C(F)(F)F)c1,168000003
CC1=CN=C2C(C(=O)N(C)C(=O)N2C)/C1=N/c1cccc(C(F)(F)F)c1,168000004
```
PubChemから直接データセットをダウンロードする場合は、Spatial Data file (SDF) 形式のファイルを取得します。この場合、次のコードを実行して、ファイルをSMIファイル形式に変換する必要があります。
サンプルコード
```
import sys
from rdkit import Chem

def converter(file_name):
    mols = [mol for mol in Chem.SDMolSupplier(file_name)]
    outname = file_name.split(".sdf")[0] + ".smi"
    out_file = open(outname, "w")
    for mol in mols:
        smi = Chem.MolToSmiles(mol)
        name = mol.GetProp("_Name")
        out_file.write("{},{}\n".format(smi, name))
    out_file.close()

if __name__ == "__main__":
    converter(sys.argv[1])
```
Tairインスタンスに接続します。特定のコード実装については、次のサンプルコードのget_tair関数を参照してください。
Tairインスタンスにベクトルインデックスを作成して、分子ジオメトリを格納します。特定のコード実装については、次のサンプルコードのcreate_index関数を参照してください。
類似分子ジオメトリをクエリする分子ジオメトリを記述します。特定のコード実装については、次のサンプルコードのdo_load関数を参照してください。
RDKitを使用して、指定した分子ジオメトリから特徴ベクトルを抽出し、TairVectorのTVS.HSETコマンドを実行して、分子ジオメトリの一意のID、特徴情報、および分子式をTairインスタンスに書き込みます。
指定された分子ジオメトリについて同様の分子ジオメトリを照会します。特定のコード実装については、次のサンプルコードのdo_search関数を参照してください。
RDKitを使用して、指定された分子ジオメトリから特徴ベクトルを抽出し、TairVectorのTVS.KNNSEARCHコマンドを実行して、Tairインスタンスの特定のインデックスから最も類似した分子ジオメトリをクエリします。

サンプルコード

この例では、Python 3.8が使用され、numpy、rdkit、tair、およびmatplotlibの依存関係がpip installコマンドを使用してインストールされます。

import os
import sys
from tair import Tair
from tair.tairvector import DistanceMetric
from rdkit.Chem import Draw, AllChem
from rdkit import DataStructs, Chem
from rdkit import RDLogger
from concurrent.futures import ThreadPoolExecutor
RDLogger.DisableLog('rdApp.*')


def get_tair() -> Tair:
    """
    Connect to the Tair instance. 
    * host: the endpoint that is used to connect to the Tair instance. 
    * port: the port number that is used to connect to the Tair instance. Default value: 6379. 
    * password: the password of the default database account of the Tair instance. If you want to connect to the Tair instance by using a custom database account, you must specify the password in the username:password format. 
    """
    tair: Tair = Tair(
        host="r-bp1mlxv3xzv6kf****pd.redis.rds.aliyuncs.com",
        port=6379,
        db=0,
        password="Da******3",
    )
    return tair


def create_index():
    """
    Create a vector index to store molecular geometries.
    * In this example, the index is named MOLSEARCH_TEST. 
    * The vector dimension is 512. 
    * The Euclidean distance (L2 norm) measure is used. 
    * The Hierarchical Navigable Small World (HNSW) indexing algorithm is used. 
    """
    ret = tair.tvs_get_index(INDEX_NAME)
    if ret is None:
        tair.tvs_create_index(INDEX_NAME, 512, distance_type=DistanceMetric.L2, index_type="HNSW")
    print("create index done")


def do_load(file_path):
    """
    Specify the path of your dataset of molecular geometries. This method automatically extracts feature vectors from molecular geometries by invoking the smiles_to_vector function and writes the feature vectors to TairVector. 
    This method invokes functions such as parallel_submit_lines, handle_line, smiles_to_vector, and insert_data. 
    Write data about a molecular geometry to TairVector in the following formats:
    * Vector index name: MOLSEARCH_TEST. 
    * Unique ID: the key of the molecular geometry. Example: 168000001. 
    * Feature information: The vector dimension is 512. 
    * smiles: the molecular formula of the molecular geometry. Example: CCC1=CN=C2C(C(=O)N(C)C(=O)N2C)/C1=N/c1ccc(OC)cc1OC. 
    """
    num = 0
    lines = []
    with open(file_path, 'r') as f:
        for line in f:
            if line.find("smiles") >= 0:
                continue
            lines.append(line)
            if len(lines) >= 10:
                parallel_submit_lines(lines)
                num += len(lines)
                lines.clear()
                if num % 10000 == 0:
                    print("load num", num)
    if len(lines) > 0:
        parallel_submit_lines(lines)
    print("load done")


def parallel_submit_lines(lines):
    """
    Call this method for concurrent writes. 
    """
    with ThreadPoolExecutor(len(lines)) as t:
        for line in lines:
            t.submit(handle_line, line=line)


def handle_line(line):
    """
    Write a single molecular geometry. 
    """
    if line.find("smiles") >= 0:
        return
    parts = line.strip().split(',')
    try:
        ids = parts[1]
        smiles = parts[0]
        vec = smiles_to_vector(smiles)
        insert_data(ids, smiles, vec)
    except Exception as result:
        print(result)


def smiles_to_vector(smiles):
    """
    Extract feature vectors from molecular geometries and convert the extracted data from the SMI file format to vectors. 
    """
    mols = Chem.MolFromSmiles(smiles)
    fp = AllChem.GetMorganFingerprintAsBitVect(mols, 2, 512 * 8)
    hex_fp = DataStructs.BitVectToFPSText(fp)
    vec = list(bytearray.fromhex(hex_fp))
    return vec


def insert_data(id, smiles, vector):
    """
    Write the vectors of molecular geometries to TairVector. 
    """
    attr = {'smiles': smiles}
    tair.tvs_hset(INDEX_NAME, id, vector, **attr)


def do_search(search_smiles,k):
    """
    Specify the molecular geometry for which you want to perform an approximate query. This method queries and returns k molecular geometries that are the most similar to the specified molecular geometry from a specific index in the Tair instance. 
    This method extracts the feature vector of the specified molecular geometry, runs the TVS.KNNSEARCH command to query the IDs of k molecular geometries that are the most similar to the specified molecular geometry, and then runs the TVS.HMGET command to query the molecular formulas of these molecular geometries. In this example, k is set to 10. 
    """
    vector = smiles_to_vector(search_smiles)
    result = tair.tvs_knnsearch(INDEX_NAME, k, vector)
    print("The following molecular geometries that are the most similar to the specified molecular geometry are returned:")
    for key, value in result:
        similar_smiles = tair.tvs_hmget(INDEX_NAME, key, "smiles")
        print(key, value, similar_smiles)


if __name__ == "__main__":
    # Connect to the Tair instance and create a vector index named MOLSEARCH_TEST. 
    tair = get_tair()
    INDEX_NAME = "MOLSEARCH_TEST"
    create_index()
    # Write sample data. 
    do_load("D:\Test\Compound_168000001_168500000.smi")
    # Query 10 molecular geometries that are the most similar to the CCOC(=O)N1CCC(NC(=O)CN2CCN(c3cc(C)cc(C)c3)C(=O)C2=O)CC1 molecular formula from the MOLSEARCH_TEST index. 
    do_search("CCOC(=O)N1CCC(NC(=O)CN2CCN(c3cc(C)cc(C)c3)C(=O)C2=O)CC1",10)

サンプル成功出力:

create index done
load num 10000
load done
The following molecular geometries that are the most similar to the specified molecular geometry are returned:
b'168000009' 0.0 ['CCOC(=O)N1CCC(NC(=O)CN2CCN(c3cc(C)cc(C)c3)C(=O)C2=O)CC1']
b'168003114' 29534.0 ['Cc1cc(C)cc(N2CCN(CC(=O)NC3CCCC3)C(=O)C2=O)c1']
b'168000210' 60222.0 ['COc1ccc(N2CCN(CC(=O)Nc3cc(C)cc(C)c3)C(=O)C2=O)cc1OC']
b'168001000' 61123.0 ['COc1ccc(N2CCN(CC(=O)Nc3ccc(C)cc3)C(=O)C2=O)cc1OC']
b'168003038' 64524.0 ['CCN1CCN(c2cc(C)cc(C)c2)C(=O)C1=O']
b'168003095' 67591.0 ['O=C(CN1CCN(c2cccc(Cl)c2)C(=O)C1=O)NC1CCCC1']
b'168000396' 70376.0 ['COc1ccc(N2CCN(Cc3ccc(C)cc3)C(=O)C2=O)cc1OC']
b'168002227' 71121.0 ['CCOC(=O)CN1CCN(C2CC2)C(=O)C1=O']
b'168000441' 73197.0 ['Cc1cc(C)cc(NC(=O)CN2CCN(c3ccc(F)c(F)c3)C(=O)C2=O)c1']
b'168000561' 73269.0 ['Cc1cc(C)cc(N2CCN(CC(=O)Nc3ccc(C)cc3C)C(=O)C2=O)c1']

結果

次の図に示すように、分子ジオメトリのSMILES表現を画像オブジェクトに変換できます。相似分子结构检索..jpeg

サンプルコード

import numpy
from rdkit.Chem import Draw
from rdkit import Chem
import matplotlib.pyplot as plt

def to_images(data):
    imgs = []
    for smiles in data:
        mol = Chem.MolFromSmiles(smiles)
        img=Chem.Draw.MolToImage(mol,size=(500,500))
        imgs.append(img )
        plt.imshow(img)
        plt.show()
    return imgs

if __name__ == "__main__":
    images = to_images(["CCOC(=O)N1CCC(NC(=O)CN2CCN(c3cc(C)cc(C)c3)C(=O)C2=O)CC1"])

概要

TairVectorを使用すると、分子ジオメトリの最も類似した分子ジオメトリをミリ秒以内にクエリできます。 Tairインスタンスの分子ジオメトリが多いほど、近似クエリ結果はより正確になります。これは新しい医薬品のR&Dを加速するのに役立ちます。