pg_bigm extension - PolarDB - Alibaba Cloud Documentation Center

The pg_bigm extension in PolarDB for PostgreSQL creates a 2-gram Generalized Inverted Index (GIN) index that is used to accelerate full-text search.

Prerequisites

A PolarDB for PostgreSQL cluster that runs one of the following engine versions is created:

PostgreSQL 14 (revision version 14.5.2.0 or later)
PostgreSQL 11 (revision version 1.1.28 or later)

Note

You can execute one of the following statements to view the revision version of a PolarDB for PostgreSQL cluster:

PostgreSQL 14
```
SELECT version();
```
PostgreSQL 11
```
SHOW polar_version;
```

Comparison between the pg_bigm extension and the pg_trgm extension

The pg_trgm extension in PolarDB for PostgreSQL uses the 3-gram model to implement full-text search. The pg_bigm extension is developed based on the pg_trgm extension. The following table describes the differences between the two extensions.

Functionality	pg_trgm	pg_bigm
Phrase matching model	3-gram	2-gram
Index types	GIN and Generalized Search Tree (GiST)	GIN
Operators	`LIKE`, `ILIKE`, `~`, and `~*`	`LIKE`
Non-alphabet full-text search	Not supported	Supported
Full-text search by using keywords that contain 1 to 2 characters	Slow	Fast
Similarity search	Supported	Supported
Maximum size of an indexed column	238,609,291 bytes (approximately 228 MB)	107,374,180 bytes (approximately 102 MB)

Usage notes

The size of the column on which you create a GIN index cannot exceed 107,374,180 bytes, which is approximately 102 MB. Sample statement:

CREATE TABLE t1 (description text);

CREATE INDEX t1_idx ON t1 USING gin (description gin_bigm_ops);

INSERT INTO t1 SELECT repeat('A', 107374181);

If the data in your PolarDB cluster is not encoded in the ASCII format, we recommend that you change the encoding format to UTF-8. Execute the following statement to query the encoding format of the current database:
```
SELECT pg_encoding_to_char(encoding)
FROM pg_database
WHERE datname = current_database();
```

Basic operations

Create the pg_bigm extension.
```
CREATE EXTENSION pg_bigm;
```

When you create a GIN index, you must specify the operators provided by the pg_bigm extension.

CREATE TABLE pg_tools (tool text, description text);

INSERT INTO pg_tools VALUES ('pg_hint_plan', 'Tool that allows a user to specify an optimizer HINT to PostgreSQL');
INSERT INTO pg_tools VALUES ('pg_dbms_stats', 'Tool that allows a user to stabilize planner statistics in PostgreSQL');
INSERT INTO pg_tools VALUES ('pg_bigm', 'Tool that provides 2-gram full text search capability in PostgreSQL');
INSERT INTO pg_tools VALUES ('pg_trgm', 'Tool that provides 3-gram full text search capability in PostgreSQL');

CREATE INDEX pg_tools_idx ON pg_tools USING gin (description gin_bigm_ops);
CREATE INDEX pg_tools_multi_idx ON pg_tools USING gin (tool gin_bigm_ops, description gin_bigm_ops) WITH (FASTUPDATE = off);

Perform full-text search.

SELECT * FROM pg_tools WHERE description LIKE '%search%';

Sample result:

  tool   |                             description
---------+---------------------------------------------------------------------
 pg_bigm | Tool that provides 2-gram full text search capability in PostgreSQL
 pg_trgm | Tool that provides 3-gram full text search capability in PostgreSQL
(2 rows)

Use the =% operator to perform similarity search.

SELECT tool FROM pg_tools WHERE tool =% 'bigm';

Sample result:

  tool   
---------
 pg_bigm
(1 row)

Delete the pg_bigm extension.
```
DROP EXTENSION pg_bigm;
```

Basic functions

likequery

Purpose: generates a string that can be identified based on the LIKE keyword.
Request parameters: one request parameter of the STRING type.
Return value: a string that can be identified based on the LIKE keyword.
Implementation:
- Add a percent sign (%) before and after the keyword.
- Use a backward slash (\) to escape the percent sign (%).

Examples:

SELECT likequery('pg_bigm has improved the full text search performance by 200%');

Sample result:

                             likequery
-------------------------------------------------------------------
 %pg\_bigm has improved the full text search performance by 200\%%
(1 row)

SELECT * FROM pg_tools WHERE description LIKE likequery('search');

Sample result:

  tool   |                             description
---------+---------------------------------------------------------------------
 pg_bigm | Tool that provides 2-gram full text search capability in PostgreSQL
 pg_trgm | Tool that provides 3-gram full text search capability in PostgreSQL
(2 rows)

show_bigm
- Purpose: obtains all 2-gram elements of a string.
- Request parameters: one request parameter of the STRING type.
- Return value: an array that consists of all 2-gram elements of a string.
- Implementation:
  - Add a space before and after the string.
  - Identify all 2-gram elements in the string.
- Example:
```
SELECT show_bigm('full text search');
```
  Sample result:
```
                            show_bigm
------------------------------------------------------------------
 {" f"," s"," t",ar,ch,ea,ex,fu,"h ","l ",ll,rc,se,"t ",te,ul,xt}
(1 row)
```
bigm_similarity
- Purpose: obtains the similarity between two strings.
- Request parameters: two request parameters of the STRING type.
- Return value: a floating-point number that indicates the similarity between the two strings.
- Implementation:
  - Identify the 2-gram elements that are included in both of the two strings.
  - The return value ranges from 0 to 1. A value of 0 indicates that the two strings are different. A value of 1 indicates that the two strings are the same.
  Note
  - This function adds a space before and after each string. Therefore, the similarity between the ABC string and the B string is 0, and the similarity between the ABC string and the A string is 0.25.
  - This function is case-sensitive. For example, this function determines that the similarity between the ABC string and the abc string is 0.
- Examples:
  - ```
  SELECT bigm_similarity('full text search', 'text similarity search');
```
  Sample result:
```
   bigm_similarity 
  -----------------
          0.571429
  (1 row)
```
- ```
SELECT bigm_similarity('ABC', 'A');
```
    Sample result:
```
 bigm_similarity
-----------------
            0.25
(1 row)
```
  - ```
  SELECT bigm_similarity('ABC', 'B');
```
  Sample result:
```
   bigm_similarity
  -----------------
                 0
  (1 row)
```
- ```
SELECT bigm_similarity('ABC', 'abc');
```
    Sample result:
```
 bigm_similarity
-----------------
               0
(1 row)
```
pg_gin_pending_stats
- Purpose: obtains the number of pages and the number of tuples in the pending list of a GIN index.
- Request parameters: one parameter that specifies the name or OID of the GIN index.
- Return value: the number of pages and the number of tuples in the pending list of the GIN index.
  Note
  If you set the FASTUPDATE parameter to False for a GIN index, the GIN index does not have a pending list. In this case, this function returns two values, 0 and 0.
- Example:
```
SELECT * FROM pg_gin_pending_stats('pg_tools_idx');
```
  Sample result:
```
 pages | tuples
-------+--------
     0 |      0
(1 row)
```

Behavior control parameters

pg_bigm.enable_recheck

This parameter specifies whether to perform a recheck.

Note

We recommend that you use the default value on. This way, you can obtain accurate query results.

Examples:

Prepare test data.

CREATE TABLE tbl (doc text);

INSERT INTO tbl VALUES('He is awaiting trial');
INSERT INTO tbl VALUES('It was a trivial mistake');

CREATE INDEX tbl_idx ON tbl USING gin (doc gin_bigm_ops);

Execute the following statements.

Use the default value of the pg_bigm.enable_recheck parameter to perform a recheck.

SET enable_seqscan TO off;

EXPLAIN ANALYZE SELECT * FROM tbl WHERE doc LIKE likequery('trial');

Sample result:

                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tbl  (cost=20.00..24.01 rows=1 width=32) (actual time=0.020..0.021 rows=1 loops=1)
   Recheck Cond: (doc ~~ '%trial%'::text)
   Rows Removed by Index Recheck: 1
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on tbl_idx  (cost=0.00..20.00 rows=1 width=0) (actual time=0.013..0.013 rows=2 loops=1)
         Index Cond: (doc ~~ '%trial%'::text)
 Planning Time: 0.117 ms
 Execution Time: 0.043 ms
(8 rows)

Execute the following statement:

SELECT * FROM tbl WHERE doc LIKE likequery('trial');

Sample result:

         doc
----------------------
 He is awaiting trial
(1 row)

Set the pg_bigm.enable_recheck parameter to off to prevent performing a recheck.

SET pg_bigm.enable_recheck = off;

SELECT * FROM tbl WHERE doc LIKE likequery('trial');

Sample result:

           doc
--------------------------
 He is awaiting trial
 It was a trivial mistake
(2 rows)

pg_bigm.gin_key_limit
This parameter specifies the maximum number of 2-gram elements that can be used to perform full-text search. The default value is 0, which specifies that all 2-gram elements are used.
Note
If query performance decreases due to the use of all 2-gram elements, you can decrease the value of this parameter.
pg_bigm.similarity_limit
This parameter specifies the threshold for similarity. The tuples between which the similarity exceeds the specified threshold are returned as similarity search results.