Use the pg_trgm extension to perform a text similarity search or a fuzzy search - AnalyticDB

The pg_trgm extension provides functions and operators that can be used to determine the similarity of text and index operators that support fast search for similar strings. You can use database search tools or indexes to accelerate fuzzy searches for text.

Introduction

The pg_trgm extension introduces the trigram concept. A trigram consists of three consecutive characters that are taken from a string. pg_trgm extracts trigrams that consist of three characters. A trigram that has less than three characters contains a prefix of up to two spaces or a suffix of only one space. Example:

postgres=# SELECT show_trgm('abc');
        show_trgm
-------------------------
 {"  a"," ab","abc","bc "}
(1 row)

You can measure the similarity between two strings by counting the number of trigrams that the strings share. This idea is very effective for measuring the similarity of words in many natural languages. This topic describes how to use the pg_trgm extension.

Usage notes

AnalyticDB for PostgreSQL V6.0 instances of V6.3.8.9 or later support the pg_trgm extension.
AnalyticDB for PostgreSQL V7.0 instances of V7.0.2 or later support the pg_trgm extension.

Note

For information about how to view the minor version of an AnalyticDB for PostgreSQL instance, see View the minor engine version.

Install the extension

Before you use the pg_trgm extension to perform text similarity searches in an AnalyticDB for PostgreSQL instance, install the pg_trgm extension on the Extensions page of the instance. For more information, see Install, update, and uninstall extensions.

Use trigram matching to perform a text similarity search

After you install the pg_trgm extension, you can use trigram matching to measure the similarity of text and sort search results in an order from the best match to the worst match. For example, create a test table and insert data into the table.

CREATE TABLE test_trgm (t text);
INSERT INTO test_trgm values('word'), ('This is a pg_trgm test'), ('word test'), ('w0rd'), ('test word');

You can use one of the following methods to perform a text similarity search based on trigram matching:

Method 1: Use the similarity() function and the % operator in the pg_trgm extension. For example, query the similarity between the values of column t and the string word. For more information about the similarity() function and the % operator, see the "Appendixes" section of this topic.
```
SELECT t, similarity(t, 'word') AS sml
  FROM test_trgm
  WHERE t % 'word'
  ORDER BY sml DESC, t;
```
The column values that contain word are displayed in descending order of similarity.
```
     t     | sml
-----------+-----
 word      |   1
 test word | 0.5
 word test | 0.5
(3 rows)
```
Method 2: Use the <-> operator in the pg_trgm extension. For example, query the distance between the values of column t and the string word. The distance is the opposite of similarity. The column values are displayed in order of distance from the nearest to the farthest. For more information about the <-> operator, see the "Appendixes" section of this topic.
```
postgres=# SELECT t, t <-> 'word' AS dist
  FROM test_trgm
  ORDER BY dist LIMIT 10;
           t            | dist
------------------------+------
 word                   |    0
 word test              |  0.5
 test word              |  0.5
 w0rd                   | 0.75
 This is a pg_trgm test |    1
(5 rows)
```

Use indexes to accelerate fuzzy searches

Before the pg_trgm extension became available, the LIKE operator can be used to perform fuzzy searches. However, the LIKE operator does not support indexes and provides poor query performance. The pg_trgm extension provides Generalized Search Tree (GiST) and Generalized Inverted Index (GIN) index operators that allow you to create an index for a text column to accelerate similarity searches. The following example describes how to use the pg_trgm extension and the GIN index to accelerate fuzzy searches:

Create a test table and insert data into the table.

CREATE TABLE test_trgm (t text);
INSERT INTO test_trgm 
SELECT md5(random()::text) FROM generate_series(1,1000000) i;

Use the EXPLAIN ANALYZE statement to perform a fuzzy search on the table. The execution plan shows that sequential scans are performed on the entire table. This results in poor query performance in scenarios that involve large amounts of data.

postgres=# explain analyze  SELECT * FROM test_trgm WHERE t LIKE '%abcd%';
                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..259.63 rows=422 width=32) (actual time=137.606..179.674 rows=431 loops=1)
   ->  Seq Scan on test_trgm  (cost=0.00..254.00 rows=141 width=32) (actual time=0.961..136.977 rows=146 loops=1)
         Filter: (t ~~ '%abcd%'::text)
         Rows Removed by Filter: 333458
 Optimizer: Postgres-based planner
 Planning Time: 0.328 ms
   (slice0)    Executor memory: 37K bytes.
   (slice1)    Executor memory: 36K bytes avg x 3 workers, 36K bytes max (seg0).
 Memory used:  128000kB
 Execution Time: 180.533 ms
(10 rows)

Use the gin_trgm_ops index operator in the pg_trgm extension to create a GIN index. For more information about the gin_trgm_ops index operator, see the "Appendixes" section of this topic.

CREATE INDEX trgm_idx ON test_trgm USING GIN (t gin_trgm_ops);

Use the EXPLAIN ANALYZE statement to perform a fuzzy search on the table. The execution plan shows that bitmap index scans are performed. The query performance is significantly improved.

postgres=# explain analyze  SELECT * FROM test_trgm WHERE t LIKE '%abcd%';
                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=48.67..882.67 rows=8000 width=32) (actual time=4.353..4.550 rows=431 loops=1)
   ->  Bitmap Heap Scan on test_trgm  (cost=48.67..776.00 rows=2667 width=32) (actual time=1.612..3.091 rows=146 loops=1)
         Recheck Cond: (t ~~ '%abcd%'::text)
         Rows Removed by Index Recheck: 10
         ->  Bitmap Index Scan on trgm_idx  (cost=0.00..48.00 rows=2667 width=0) (actual time=1.536..1.537 rows=163 loops=1)
               Index Cond: (t ~~ '%abcd%'::text)
 Optimizer: Postgres-based planner
 Planning Time: 1.353 ms
   (slice0)    Executor memory: 44K bytes.
   (slice1)    Executor memory: 2438K bytes avg x 3 workers, 2438K bytes max (seg0).
 Memory used:  128000kB
 Execution Time: 5.385 ms
(12 rows)

Appendixes

Grand Unified Configuration (GUC) parameters

pg_trgm.similarity_threshold

Specifies the current similarity threshold that is used by the % operator. The threshold must be between 0 and 1. Default value: 0.3.

pg_trgm.word_similarity_threshold

Specifies the current word similarity threshold that is used by the <% and %> operators. The threshold must be between 0 and 1. Default value: 0.6.

Functions

Function	Type of the return value	Description
similarity(text, text)	real	Returns a number that indicates the similarity between the two strings. The number ranges from 0 to 1. A value of 0 indicates that the two strings are completely dissimilar. A value of 1 indicates that the two strings are identical.
show_trgm(text)	text[]	Returns an array of all trigrams in the string. In actual scenarios, this function is seldom useful except for debugging.
word_similarity(text, text)	real	Returns a number that indicates the greatest similarity between trigrams in two strings. The number ranges from 0 to 1. A value of 0 indicates that trigrams in the two strings are completely dissimilar. A value of 1 indicates that the trigram set in the first string and a continuous extent of an ordered trigram set in the second string are identical.

Operators

Operator	Type of the return value	Description
text % text	boolean	Returns true if the result of the function is greater than the threshold that is specified by the `pg_trgm.similarity_threshold` parameter.
text <% text	boolean	Returns true if the similarity between the trigram set in the first string and a continuous extent of an ordered trigram set in the second string is greater than the threshold that is specified by the `pg_trgm.word_similarity_threshold` parameter.
text %> text	boolean	The commutator of the `<%` operator.
text <-> text	real	Returns the distance between two strings. The distance is one minus the `similarity()` function value.
text <<-> text	real	Returns the distance between two strings. The distance is one minus the `word_similarity()` function value.
text <->> text	real	The commutator of the `<<->` operator.

Index operators

Operator	Description
gist_trgm_ops	Converts text data into a trigram set and uses a GiST index to store the trigram set.
gin_trgm_ops	Converts text data into a trigram set and uses a GIN index to store the trigram set.