Use the RDKit plug-in - ApsaraDB RDS - Alibaba Cloud Documentation Center

This topic describes how to use the RDKit plug-in of ApsaraDB RDS for PostgreSQL to implement functions such as molecular computing and search.

Prerequisites

Your RDS instance runs PostgreSQL 12.

Background information

RDKit supports two data types: the mol data type that is used to describe molecular types, and the fp data type that is used to describe molecular fingerprints. It allows for comparison computing, similarity computing based on the Tanimoto and Dice coefficients, and GiST indexing.

For more information about the SQL statements that are supported by RDKit, visit RDKit SQL.

Precautions

Input and output functions based on the mol data type comply with the simplified molecular input line entry specification (SMILES).
Input and output functions based on the fp data type comply with the bytea format that is used to store binary data.

Create the RDKit plug-in

postgres=# create extension rdkit ;
CREATE EXTENSION

Default parameter settings

postgres=# show rdkit.tanimoto_threshold ;
 rdkit.tanimoto_threshold 
--------------------------
 0.5
(1 row)

postgres=# show rdkit.dice_threshold;
 rdkit.dice_threshold 
----------------------
 0.5
(1 row)

Indexes supported

B-tree and hash indexes are supported for comparison computing operations that are based on the mol and fp data types. Examples:
```
CREATE INDEX molidx ON pgmol (mol);
CREATE INDEX molidx ON pgmol (fp);
```
GiST indexes are supported for the following operations that are based on the mol and fp data types: "mol % mol", "mol # mol", "mol @> mol", "mol <@ mol", "fp % fp", and "fp # fp." Example:
```
CREATE INDEX molidx ON pgmol USING gist (mol);
```

Sample functions

The tanimoto_sml function calculates the degree of similarity based on the Tanimoto coefficient.

postgres=# \df tanimoto_sml
                           List of functions
 Schema |     Name     | Result data type | Argument data types | Type 
--------+--------------+------------------+---------------------+------
 public | tanimoto_sml | double precision | bfp, bfp            | func
 public | tanimoto_sml | double precision | sfp, sfp            | func
(2 rows)

The dice_sml function calculates the degree of similarity based on the Dice coefficient.

postgres=# \df dice_sml
                         List of functions
 Schema |   Name   | Result data type | Argument data types | Type 
--------+----------+------------------+---------------------+------
 public | dice_sml | double precision | bfp, bfp            | func
 public | dice_sml | double precision | sfp, sfp            | func
(2 rows)

If the second argument is a substructure of the first argument, the substruct function returns the TRUE value.

postgres=# \df substruct
                         List of functions
 Schema |   Name    | Result data type | Argument data types | Type 
--------+-----------+------------------+---------------------+------
 public | substruct | boolean          | mol, mol            | func
 public | substruct | boolean          | mol, qmol           | func
 public | substruct | boolean          | reaction, reaction  | func
(3 rows)

Basic operations

mol % mol and fp % fp
If the degree of similarity that is calculated based on the Tanimoto coefficient is less than the value of the rdkit.tanimoto_threshold GUC variable, the TRUE value is returned.
mol # mol and fp # fp
If the degree of similarity that is calculated based on the Dice coefficient is less than the value of the rdkit.dice_threshold GUC variable, the TRUE value is returned.
mol @> mol
If the left operand contains the right operand, the TRUE value is returned.
mol <@ mol
If the right operand contains the left operand, the TRUE value is returned.