The pg_bigm extension in PolarDB for PostgreSQL creates a 2-gram Generalized Inverted Index (GIN) index that is used to accelerate full-text search.
Prerequisites
A PolarDB for PostgreSQL cluster that runs one of the following engine versions is created:
PostgreSQL 14 (revision version 14.5.2.0 or later)
PostgreSQL 11 (revision version 1.1.28 or later)
You can execute one of the following statements to view the revision version of a PolarDB for PostgreSQL cluster:
PostgreSQL 14
SELECT version();PostgreSQL 11
SHOW polar_version;
Comparison between the pg_bigm extension and the pg_trgm extension
The pg_trgm extension in PolarDB for PostgreSQL uses the 3-gram model to implement full-text search. The pg_bigm extension is developed based on the pg_trgm extension. The following table describes the differences between the two extensions.
Functionality | pg_trgm | pg_bigm |
Phrase matching model | 3-gram | 2-gram |
Index types | GIN and Generalized Search Tree (GiST) | GIN |
Operators |
|
|
Non-alphabet full-text search | Not supported | Supported |
Full-text search by using keywords that contain 1 to 2 characters | Slow | Fast |
Similarity search | Supported | Supported |
Maximum size of an indexed column | 238,609,291 bytes (approximately 228 MB) | 107,374,180 bytes (approximately 102 MB) |
Usage notes
The size of the column on which you create a GIN index cannot exceed 107,374,180 bytes, which is approximately 102 MB. Sample statement:
CREATE TABLE t1 (description text); CREATE INDEX t1_idx ON t1 USING gin (description gin_bigm_ops); INSERT INTO t1 SELECT repeat('A', 107374181);If the data in your PolarDB cluster is not encoded in the ASCII format, we recommend that you change the encoding format to UTF-8. Execute the following statement to query the encoding format of the current database:
SELECT pg_encoding_to_char(encoding) FROM pg_database WHERE datname = current_database();
Basic operations
Create the pg_bigm extension.
CREATE EXTENSION pg_bigm;When you create a GIN index, you must specify the operators provided by the
pg_bigmextension.CREATE TABLE pg_tools (tool text, description text); INSERT INTO pg_tools VALUES ('pg_hint_plan', 'Tool that allows a user to specify an optimizer HINT to PostgreSQL'); INSERT INTO pg_tools VALUES ('pg_dbms_stats', 'Tool that allows a user to stabilize planner statistics in PostgreSQL'); INSERT INTO pg_tools VALUES ('pg_bigm', 'Tool that provides 2-gram full text search capability in PostgreSQL'); INSERT INTO pg_tools VALUES ('pg_trgm', 'Tool that provides 3-gram full text search capability in PostgreSQL'); CREATE INDEX pg_tools_idx ON pg_tools USING gin (description gin_bigm_ops); CREATE INDEX pg_tools_multi_idx ON pg_tools USING gin (tool gin_bigm_ops, description gin_bigm_ops) WITH (FASTUPDATE = off);Perform full-text search.
SELECT * FROM pg_tools WHERE description LIKE '%search%';Sample result:
tool | description ---------+--------------------------------------------------------------------- pg_bigm | Tool that provides 2-gram full text search capability in PostgreSQL pg_trgm | Tool that provides 3-gram full text search capability in PostgreSQL (2 rows)Use the
=%operator to perform similarity search.SELECT tool FROM pg_tools WHERE tool =% 'bigm';Sample result:
tool --------- pg_bigm (1 row)Delete the pg_bigm extension.
DROP EXTENSION pg_bigm;
Basic functions
likequery
Purpose: generates a string that can be identified based on the LIKE keyword.
Request parameters: one request parameter of the STRING type.
Return value: a string that can be identified based on the LIKE keyword.
Implementation:
Add a percent sign (
%) before and after the keyword.Use a backward slash (
\) to escape the percent sign (%).
Examples:
SELECT likequery('pg_bigm has improved the full text search performance by 200%');Sample result:
likequery ------------------------------------------------------------------- %pg\_bigm has improved the full text search performance by 200\%% (1 row)SELECT * FROM pg_tools WHERE description LIKE likequery('search');Sample result:
tool | description ---------+--------------------------------------------------------------------- pg_bigm | Tool that provides 2-gram full text search capability in PostgreSQL pg_trgm | Tool that provides 3-gram full text search capability in PostgreSQL (2 rows)
show_bigm
Purpose: obtains all 2-gram elements of a string.
Request parameters: one request parameter of the STRING type.
Return value: an array that consists of all 2-gram elements of a string.
Implementation:
Add a space before and after the string.
Identify all 2-gram elements in the string.
Example:
SELECT show_bigm('full text search');Sample result:
show_bigm ------------------------------------------------------------------ {" f"," s"," t",ar,ch,ea,ex,fu,"h ","l ",ll,rc,se,"t ",te,ul,xt} (1 row)
bigm_similarity
Purpose: obtains the similarity between two strings.
Request parameters: two request parameters of the STRING type.
Return value: a floating-point number that indicates the similarity between the two strings.
Implementation:
Identify the 2-gram elements that are included in both of the two strings.
The return value ranges from 0 to 1. A value of 0 indicates that the two strings are different. A value of 1 indicates that the two strings are the same.
NoteThis function adds a space before and after each string. Therefore, the similarity between the
ABCstring and theBstring is 0, and the similarity between theABCstring and theAstring is 0.25.This function is case-sensitive. For example, this function determines that the similarity between the
ABCstring and theabcstring is 0.
Examples:
SELECT bigm_similarity('full text search', 'text similarity search');Sample result:
bigm_similarity ----------------- 0.571429 (1 row)SELECT bigm_similarity('ABC', 'A');Sample result:
bigm_similarity ----------------- 0.25 (1 row)SELECT bigm_similarity('ABC', 'B');Sample result:
bigm_similarity ----------------- 0 (1 row)SELECT bigm_similarity('ABC', 'abc');Sample result:
bigm_similarity ----------------- 0 (1 row)
pg_gin_pending_stats
Purpose: obtains the number of pages and the number of tuples in the pending list of a GIN index.
Request parameters: one parameter that specifies the name or OID of the GIN index.
Return value: the number of pages and the number of tuples in the pending list of the GIN index.
NoteIf you set the FASTUPDATE parameter to False for a GIN index, the GIN index does not have a pending list. In this case, this function returns two values, 0 and 0.
Example:
SELECT * FROM pg_gin_pending_stats('pg_tools_idx');Sample result:
pages | tuples -------+-------- 0 | 0 (1 row)
Behavior control parameters
pg_bigm.enable_recheck
This parameter specifies whether to perform a recheck.
NoteWe recommend that you use the default value on. This way, you can obtain accurate query results.
Examples:
Prepare test data.
CREATE TABLE tbl (doc text); INSERT INTO tbl VALUES('He is awaiting trial'); INSERT INTO tbl VALUES('It was a trivial mistake'); CREATE INDEX tbl_idx ON tbl USING gin (doc gin_bigm_ops);Execute the following statements.
Use the default value of the pg_bigm.enable_recheck parameter to perform a recheck.
SET enable_seqscan TO off; EXPLAIN ANALYZE SELECT * FROM tbl WHERE doc LIKE likequery('trial');Sample result:
QUERY PLAN ----------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on tbl (cost=20.00..24.01 rows=1 width=32) (actual time=0.020..0.021 rows=1 loops=1) Recheck Cond: (doc ~~ '%trial%'::text) Rows Removed by Index Recheck: 1 Heap Blocks: exact=1 -> Bitmap Index Scan on tbl_idx (cost=0.00..20.00 rows=1 width=0) (actual time=0.013..0.013 rows=2 loops=1) Index Cond: (doc ~~ '%trial%'::text) Planning Time: 0.117 ms Execution Time: 0.043 ms (8 rows)Execute the following statement:
SELECT * FROM tbl WHERE doc LIKE likequery('trial');Sample result:
doc ---------------------- He is awaiting trial (1 row)Set the pg_bigm.enable_recheck parameter to off to prevent performing a recheck.
SET pg_bigm.enable_recheck = off; SELECT * FROM tbl WHERE doc LIKE likequery('trial');Sample result:
doc -------------------------- He is awaiting trial It was a trivial mistake (2 rows)
pg_bigm.gin_key_limit
This parameter specifies the maximum number of 2-gram elements that can be used to perform full-text search. The default value is 0, which specifies that all 2-gram elements are used.
NoteIf query performance decreases due to the use of all 2-gram elements, you can decrease the value of this parameter.
pg_bigm.similarity_limit
This parameter specifies the threshold for similarity. The tuples between which the similarity exceeds the specified threshold are returned as similarity search results.