This topic describes how to use the RDKit plug-in of ApsaraDB RDS for PostgreSQL to implement functions such as molecular computing and search.
Prerequisites
Background information
RDKit supports two data types: the mol data type that is used to describe molecular types, and the fp data type that is used to describe molecular fingerprints. It allows for comparison computing, similarity computing based on the Tanimoto and Dice coefficients, and GiST indexing.
For more information about the SQL statements that are supported by RDKit, visit RDKit SQL.
Precautions
- Input and output functions based on the mol data type comply with the simplified molecular input line entry specification (SMILES).
- Input and output functions based on the fp data type comply with the bytea format that is used to store binary data.
Create the RDKit plug-in
postgres=# create extension rdkit ;
CREATE EXTENSION
Default parameter settings
postgres=# show rdkit.tanimoto_threshold ;
rdkit.tanimoto_threshold
--------------------------
0.5
(1 row)
postgres=# show rdkit.dice_threshold;
rdkit.dice_threshold
----------------------
0.5
(1 row)
Indexes supported
- B-tree and hash indexes are supported for comparison computing operations that are
based on the mol and fp data types. Examples:
CREATE INDEX molidx ON pgmol (mol); CREATE INDEX molidx ON pgmol (fp);
- GiST indexes are supported for the following operations that are based on the mol
and fp data types: "mol % mol", "mol # mol", "mol @> mol", "mol <@ mol", "fp % fp",
and "fp # fp." Example:
CREATE INDEX molidx ON pgmol USING gist (mol);
Sample functions
- The tanimoto_sml function calculates the degree of similarity based on the Tanimoto
coefficient.
postgres=# \df tanimoto_sml List of functions Schema | Name | Result data type | Argument data types | Type --------+--------------+------------------+---------------------+------ public | tanimoto_sml | double precision | bfp, bfp | func public | tanimoto_sml | double precision | sfp, sfp | func (2 rows)
- The dice_sml function calculates the degree of similarity based on the Dice coefficient.
postgres=# \df dice_sml List of functions Schema | Name | Result data type | Argument data types | Type --------+----------+------------------+---------------------+------ public | dice_sml | double precision | bfp, bfp | func public | dice_sml | double precision | sfp, sfp | func (2 rows)
- If the second argument is a substructure of the first argument, the substruct function
returns the TRUE value.
postgres=# \df substruct List of functions Schema | Name | Result data type | Argument data types | Type --------+-----------+------------------+---------------------+------ public | substruct | boolean | mol, mol | func public | substruct | boolean | mol, qmol | func public | substruct | boolean | reaction, reaction | func (3 rows)
Basic operations
mol % mol
andfp % fp
If the degree of similarity that is calculated based on the Tanimoto coefficient is less than the value of the rdkit.tanimoto_threshold GUC variable, the TRUE value is returned.
mol # mol
andfp # fp
If the degree of similarity that is calculated based on the Dice coefficient is less than the value of the rdkit.dice_threshold GUC variable, the TRUE value is returned.
mol @> mol
If the left operand contains the right operand, the TRUE value is returned.
mol <@ mol
If the right operand contains the left operand, the TRUE value is returned.