Implement approximate query for molecular geometries by using TairVector - Tair (Redis® OSS-Compatible)

This topic describes how to use TairVector to implement approximate query for molecular geometries.

Background information

In the field of AI-enabled drug discovery, vectors are commonly used to denote compounds and pharmaceuticals and calculate how approximate different compounds or pharmaceuticals are. This enables researchers to predict and optimize the chemical reactions of different compounds or pharmaceuticals. This scenario entails quick and accurate approximate query for molecular geometries by using vectors to accelerate the R&D of new pharmaceuticals.

Compared with conventional vector retrieval services, TairVector stores data in memory and supports real-time updates of indexes to reduce read and write latencies. Additionally, TairVector provides commands for vector nearest neighbor queries, such as TVS.KNNSEARCH, which allows researchers to quickly retrieve the most similar molecular geometries for a given molecular geometry. This prevents mistakes and losses caused by manual calculation.

Solution

The following figure shows the workflow. Tair Vector分子结构检索流程图..jpeg

Download a dataset of molecular geometries in the Simplified Molecular Input Line Entry System (SMILES or SMI) file format.

In this example, 11,012 rows of data from open source datasets on PubChem are used as test data. The molecular formula and unique ID columns are included.

Note

In an actual use case, you can write more data to Tair to experience its ability to retrieve vectors within milliseconds.

CCC1=CN=C2C(C(=O)N(C)C(=O)N2C)/C1=N/c1ccc(OC)cc1OC,168000001
CC(C)CN1C(=O)C2SCCC2N2C(=S)NNC12,168000002
CC1=C[NH+]=C2C(C(=O)N(C)C(=O)N2C)/C1=N/c1cccc(C(F)(F)F)c1,168000003
CC1=CN=C2C(C(=O)N(C)C(=O)N2C)/C1=N/c1cccc(C(F)(F)F)c1,168000004

If you download datasets directly from PubChem, you obtain a file in the Spatial Data File (SDF) format. In this case, you must run the following code to convert the file to the SMI file format.

Sample code

import sys
from rdkit import Chem

def converter(file_name):
    mols = [mol for mol in Chem.SDMolSupplier(file_name)]
    outname = file_name.split(".sdf")[0] + ".smi"
    out_file = open(outname, "w")
    for mol in mols:
        smi = Chem.MolToSmiles(mol)
        name = mol.GetProp("_Name")
        out_file.write("{},{}\n".format(smi, name))
    out_file.close()

if __name__ == "__main__":
    converter(sys.argv[1])

Connect to your Tair instance. For the specific code implementation, refer to the get_tair function in the following sample code.
Create a vector index in the Tair instance to store molecular geometries. For the specific code implementation, refer to the create_index function in the following sample code.
Write the molecular geometry for which you want to query similar molecular geometries. For the specific code implementation, refer to the do_load function in the following sample code.
RDKit is used to extract the feature vector from the specified molecular geometry, and the TVS.HSET command of TairVector is run to write the unique ID, feature information, and molecular formula of the molecular geometry to your Tair instance.
Query similar molecular geometries for the specified molecular geometry. For the specific code implementation, refer to the do_search function in the following sample code.
RDKit is used to extract the feature vector from the specified molecular geometry, and the TVS.KNNSEARCH command of TairVector is run to query the most similar molecular geometries from a specific index in your Tair instance.

Sample code

In this example, Python 3.8 is used, and the numpy, rdkit, tair, and matplotlib dependencies are installed by using the pip install command.

import os
import sys
from tair import Tair
from tair.tairvector import DistanceMetric
from rdkit.Chem import Draw, AllChem
from rdkit import DataStructs, Chem
from rdkit import RDLogger
from concurrent.futures import ThreadPoolExecutor
RDLogger.DisableLog('rdApp.*')


def get_tair() -> Tair:
    """
    Connect to the Tair instance. 
    * host: the endpoint that is used to connect to the Tair instance. 
    * port: the port number that is used to connect to the Tair instance. Default value: 6379. 
    * password: the password of the default database account of the Tair instance. If you want to connect to the Tair instance by using a custom database account, you must specify the password in the username:password format. 
    """
    tair: Tair = Tair(
        host="r-bp1mlxv3xzv6kf****pd.redis.rds.aliyuncs.com",
        port=6379,
        db=0,
        password="Da******3",
    )
    return tair


def create_index():
    """
    Create a vector index to store molecular geometries.
    * In this example, the index is named MOLSEARCH_TEST. 
    * The vector dimension is 512. 
    * The Euclidean distance (L2 norm) measure is used. 
    * The Hierarchical Navigable Small World (HNSW) indexing algorithm is used. 
    """
    ret = tair.tvs_get_index(INDEX_NAME)
    if ret is None:
        tair.tvs_create_index(INDEX_NAME, 512, distance_type=DistanceMetric.L2, index_type="HNSW")
    print("create index done")


def do_load(file_path):
    """
    Specify the path of your dataset of molecular geometries. This method automatically extracts feature vectors from molecular geometries by invoking the smiles_to_vector function and writes the feature vectors to TairVector. 
    This method invokes functions such as parallel_submit_lines, handle_line, smiles_to_vector, and insert_data. 
    Write data about a molecular geometry to TairVector in the following formats:
    * Vector index name: MOLSEARCH_TEST. 
    * Unique ID: the key of the molecular geometry. Example: 168000001. 
    * Feature information: The vector dimension is 512. 
    * smiles: the molecular formula of the molecular geometry. Example: CCC1=CN=C2C(C(=O)N(C)C(=O)N2C)/C1=N/c1ccc(OC)cc1OC. 
    """
    num = 0
    lines = []
    with open(file_path, 'r') as f:
        for line in f:
            if line.find("smiles") >= 0:
                continue
            lines.append(line)
            if len(lines) >= 10:
                parallel_submit_lines(lines)
                num += len(lines)
                lines.clear()
                if num % 10000 == 0:
                    print("load num", num)
    if len(lines) > 0:
        parallel_submit_lines(lines)
    print("load done")


def parallel_submit_lines(lines):
    """
    Call this method for concurrent writes. 
    """
    with ThreadPoolExecutor(len(lines)) as t:
        for line in lines:
            t.submit(handle_line, line=line)


def handle_line(line):
    """
    Write a single molecular geometry. 
    """
    if line.find("smiles") >= 0:
        return
    parts = line.strip().split(',')
    try:
        ids = parts[1]
        smiles = parts[0]
        vec = smiles_to_vector(smiles)
        insert_data(ids, smiles, vec)
    except Exception as result:
        print(result)


def smiles_to_vector(smiles):
    """
    Extract feature vectors from molecular geometries and convert the extracted data from the SMI file format to vectors. 
    """
    mols = Chem.MolFromSmiles(smiles)
    fp = AllChem.GetMorganFingerprintAsBitVect(mols, 2, 512 * 8)
    hex_fp = DataStructs.BitVectToFPSText(fp)
    vec = list(bytearray.fromhex(hex_fp))
    return vec


def insert_data(id, smiles, vector):
    """
    Write the vectors of molecular geometries to TairVector. 
    """
    attr = {'smiles': smiles}
    tair.tvs_hset(INDEX_NAME, id, vector, **attr)


def do_search(search_smiles,k):
    """
    Specify the molecular geometry for which you want to perform an approximate query. This method queries and returns k molecular geometries that are the most similar to the specified molecular geometry from a specific index in the Tair instance. 
    This method extracts the feature vector of the specified molecular geometry, runs the TVS.KNNSEARCH command to query the IDs of k molecular geometries that are the most similar to the specified molecular geometry, and then runs the TVS.HMGET command to query the molecular formulas of these molecular geometries. In this example, k is set to 10. 
    """
    vector = smiles_to_vector(search_smiles)
    result = tair.tvs_knnsearch(INDEX_NAME, k, vector)
    print("The following molecular geometries that are the most similar to the specified molecular geometry are returned:")
    for key, value in result:
        similar_smiles = tair.tvs_hmget(INDEX_NAME, key, "smiles")
        print(key, value, similar_smiles)


if __name__ == "__main__":
    # Connect to the Tair instance and create a vector index named MOLSEARCH_TEST. 
    tair = get_tair()
    INDEX_NAME = "MOLSEARCH_TEST"
    create_index()
    # Write sample data. 
    do_load("D:\Test\Compound_168000001_168500000.smi")
    # Query 10 molecular geometries that are the most similar to the CCOC(=O)N1CCC(NC(=O)CN2CCN(c3cc(C)cc(C)c3)C(=O)C2=O)CC1 molecular formula from the MOLSEARCH_TEST index. 
    do_search("CCOC(=O)N1CCC(NC(=O)CN2CCN(c3cc(C)cc(C)c3)C(=O)C2=O)CC1",10)

Sample success output:

create index done
load num 10000
load done
The following molecular geometries that are the most similar to the specified molecular geometry are returned:
b'168000009' 0.0 ['CCOC(=O)N1CCC(NC(=O)CN2CCN(c3cc(C)cc(C)c3)C(=O)C2=O)CC1']
b'168003114' 29534.0 ['Cc1cc(C)cc(N2CCN(CC(=O)NC3CCCC3)C(=O)C2=O)c1']
b'168000210' 60222.0 ['COc1ccc(N2CCN(CC(=O)Nc3cc(C)cc(C)c3)C(=O)C2=O)cc1OC']
b'168001000' 61123.0 ['COc1ccc(N2CCN(CC(=O)Nc3ccc(C)cc3)C(=O)C2=O)cc1OC']
b'168003038' 64524.0 ['CCN1CCN(c2cc(C)cc(C)c2)C(=O)C1=O']
b'168003095' 67591.0 ['O=C(CN1CCN(c2cccc(Cl)c2)C(=O)C1=O)NC1CCCC1']
b'168000396' 70376.0 ['COc1ccc(N2CCN(Cc3ccc(C)cc3)C(=O)C2=O)cc1OC']
b'168002227' 71121.0 ['CCOC(=O)CN1CCN(C2CC2)C(=O)C1=O']
b'168000441' 73197.0 ['Cc1cc(C)cc(NC(=O)CN2CCN(c3ccc(F)c(F)c3)C(=O)C2=O)c1']
b'168000561' 73269.0 ['Cc1cc(C)cc(N2CCN(CC(=O)Nc3ccc(C)cc3C)C(=O)C2=O)c1']

Results

You can convert the SMILES representations of molecular geometries to image objects, as shown in the following figure. 相似分子结构检索..jpeg

Sample code

import numpy
from rdkit.Chem import Draw
from rdkit import Chem
import matplotlib.pyplot as plt

def to_images(data):
    imgs = []
    for smiles in data:
        mol = Chem.MolFromSmiles(smiles)
        img=Chem.Draw.MolToImage(mol,size=(500,500))
        imgs.append(img )
        plt.imshow(img)
        plt.show()
    return imgs

if __name__ == "__main__":
    images = to_images(["CCOC(=O)N1CCC(NC(=O)CN2CCN(c3cc(C)cc(C)c3)C(=O)C2=O)CC1"])

Summary

TairVector allows you to query the most similar molecular geometries for a molecular geometry within milliseconds. The more molecular geometries your Tair instance has, the more accurate an approximate query result is. This helps you accelerate the R&D of new pharmaceuticals.