The pg_trgm extension provides functions and operators that can be used to determine the similarity of text and index operators that support fast search for similar strings. You can use database search tools or indexes to accelerate fuzzy searches for text.
Introduction
The pg_trgm extension introduces the trigram concept. A trigram consists of three consecutive characters that are taken from a string. pg_trgm extracts trigrams that consist of three characters. A trigram that has less than three characters contains a prefix of up to two spaces or a suffix of only one space. Example:
postgres=# SELECT show_trgm('abc');
show_trgm
-------------------------
{" a"," ab","abc","bc "}
(1 row)
You can measure the similarity between two strings by counting the number of trigrams that the strings share. This idea is very effective for measuring the similarity of words in many natural languages. This topic describes how to use the pg_trgm extension.
Usage notes
AnalyticDB for PostgreSQL V6.0 instances of V6.3.8.9 or later support the pg_trgm extension.
AnalyticDB for PostgreSQL V7.0 instances of V7.0.2 or later support the pg_trgm extension.
For information about how to view the minor version of an AnalyticDB for PostgreSQL instance, see View the minor engine version.
Install the extension
Before you use the pg_trgm extension to perform text similarity searches in an AnalyticDB for PostgreSQL instance, install the pg_trgm extension on the Extensions page of the instance. For more information, see Install, update, and uninstall extensions.
Use trigram matching to perform a text similarity search
After you install the pg_trgm extension, you can use trigram matching to measure the similarity of text and sort search results in an order from the best match to the worst match. For example, create a test table and insert data into the table.
CREATE TABLE test_trgm (t text);
INSERT INTO test_trgm values('word'), ('This is a pg_trgm test'), ('word test'), ('w0rd'), ('test word');
You can use one of the following methods to perform a text similarity search based on trigram matching:
Method 1: Use the similarity() function and the
%
operator in the pg_trgm extension. For example, query the similarity between the values of column t and the stringword
. For more information about the similarity() function and the%
operator, see the "Appendixes" section of this topic.SELECT t, similarity(t, 'word') AS sml FROM test_trgm WHERE t % 'word' ORDER BY sml DESC, t;
The column values that contain
word
are displayed in descending order of similarity.t | sml -----------+----- word | 1 test word | 0.5 word test | 0.5 (3 rows)
Method 2: Use the
<->
operator in the pg_trgm extension. For example, query the distance between the values of column t and the string word. The distance is the opposite of similarity. The column values are displayed in order of distance from the nearest to the farthest. For more information about the<->
operator, see the "Appendixes" section of this topic.postgres=# SELECT t, t <-> 'word' AS dist FROM test_trgm ORDER BY dist LIMIT 10; t | dist ------------------------+------ word | 0 word test | 0.5 test word | 0.5 w0rd | 0.75 This is a pg_trgm test | 1 (5 rows)
Use indexes to accelerate fuzzy searches
Before the pg_trgm extension became available, the LIKE operator can be used to perform fuzzy searches. However, the LIKE operator does not support indexes and provides poor query performance. The pg_trgm extension provides Generalized Search Tree (GiST) and Generalized Inverted Index (GIN) index operators that allow you to create an index for a text column to accelerate similarity searches. The following example describes how to use the pg_trgm extension and the GIN index to accelerate fuzzy searches:
Create a test table and insert data into the table.
CREATE TABLE test_trgm (t text);
INSERT INTO test_trgm
SELECT md5(random()::text) FROM generate_series(1,1000000) i;
Use the EXPLAIN ANALYZE statement to perform a fuzzy search on the table. The execution plan shows that sequential scans are performed on the entire table. This results in poor query performance in scenarios that involve large amounts of data.
postgres=# explain analyze SELECT * FROM test_trgm WHERE t LIKE '%abcd%';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..259.63 rows=422 width=32) (actual time=137.606..179.674 rows=431 loops=1)
-> Seq Scan on test_trgm (cost=0.00..254.00 rows=141 width=32) (actual time=0.961..136.977 rows=146 loops=1)
Filter: (t ~~ '%abcd%'::text)
Rows Removed by Filter: 333458
Optimizer: Postgres-based planner
Planning Time: 0.328 ms
(slice0) Executor memory: 37K bytes.
(slice1) Executor memory: 36K bytes avg x 3 workers, 36K bytes max (seg0).
Memory used: 128000kB
Execution Time: 180.533 ms
(10 rows)
Use the gin_trgm_ops
index operator in the pg_trgm extension to create a GIN index. For more information about the gin_trgm_ops
index operator, see the "Appendixes" section of this topic.
CREATE INDEX trgm_idx ON test_trgm USING GIN (t gin_trgm_ops);
Use the EXPLAIN ANALYZE statement to perform a fuzzy search on the table. The execution plan shows that bitmap index scans are performed. The query performance is significantly improved.
postgres=# explain analyze SELECT * FROM test_trgm WHERE t LIKE '%abcd%';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Gather Motion 3:1 (slice1; segments: 3) (cost=48.67..882.67 rows=8000 width=32) (actual time=4.353..4.550 rows=431 loops=1)
-> Bitmap Heap Scan on test_trgm (cost=48.67..776.00 rows=2667 width=32) (actual time=1.612..3.091 rows=146 loops=1)
Recheck Cond: (t ~~ '%abcd%'::text)
Rows Removed by Index Recheck: 10
-> Bitmap Index Scan on trgm_idx (cost=0.00..48.00 rows=2667 width=0) (actual time=1.536..1.537 rows=163 loops=1)
Index Cond: (t ~~ '%abcd%'::text)
Optimizer: Postgres-based planner
Planning Time: 1.353 ms
(slice0) Executor memory: 44K bytes.
(slice1) Executor memory: 2438K bytes avg x 3 workers, 2438K bytes max (seg0).
Memory used: 128000kB
Execution Time: 5.385 ms
(12 rows)
Appendixes
Grand Unified Configuration (GUC) parameters
pg_trgm.similarity_threshold
Specifies the current similarity threshold that is used by the %
operator. The threshold must be between 0 and 1. Default value: 0.3.
pg_trgm.word_similarity_threshold
Specifies the current word similarity threshold that is used by the <%
and %>
operators. The threshold must be between 0 and 1. Default value: 0.6.
Functions
Function | Type of the return value | Description |
similarity(text, text) | real | Returns a number that indicates the similarity between the two strings. The number ranges from 0 to 1.
|
show_trgm(text) | text[] | Returns an array of all trigrams in the string. In actual scenarios, this function is seldom useful except for debugging. |
word_similarity(text, text) | real | Returns a number that indicates the greatest similarity between trigrams in two strings. The number ranges from 0 to 1.
|
Operators
Operator | Type of the return value | Description |
text % text | boolean | Returns true if the result of the function is greater than the threshold that is specified by the |
text <% text | boolean | Returns true if the similarity between the trigram set in the first string and a continuous extent of an ordered trigram set in the second string is greater than the threshold that is specified by the |
text %> text | boolean | The commutator of the |
text <-> text | real | Returns the distance between two strings. The distance is one minus the |
text <<-> text | real | Returns the distance between two strings. The distance is one minus the |
text <->> text | real | The commutator of the |
Index operators
Operator | Description |
gist_trgm_ops | Converts text data into a trigram set and uses a GiST index to store the trigram set. |
gin_trgm_ops | Converts text data into a trigram set and uses a GIN index to store the trigram set. |