All Products
Search
Document Center

PolarDB:pg_bigm

Last Updated:Oct 18, 2024

The pg_bigm extension in PolarDB for PostgreSQL creates a 2-gram Generalized Inverted Index (GIN) index that is used to accelerate full-text search.

Prerequisites

A PolarDB for PostgreSQL cluster that runs one of the following engine versions is created:

  • PostgreSQL 14 (revision version 14.5.2.0 or later)

  • PostgreSQL 11 (revision version 1.1.28 or later)

Note

You can execute one of the following statements to view the revision version of a PolarDB for PostgreSQL cluster:

  • PostgreSQL 14

    SELECT version();
  • PostgreSQL 11

    SHOW polar_version;

Comparison between the pg_bigm extension and the pg_trgm extension

The pg_trgm extension in PolarDB for PostgreSQL uses the 3-gram model to implement full-text search. The pg_bigm extension is developed based on the pg_trgm extension. The following table describes the differences between the two extensions.

Functionality

pg_trgm

pg_bigm

Phrase matching model

3-gram

2-gram

Index types

GIN and Generalized Search Tree (GiST)

GIN

Operators

LIKE, ILIKE, ~, and ~*

LIKE

Non-alphabet full-text search

Not supported

Supported

Full-text search by using keywords that contain 1 to 2 characters

Slow

Fast

Similarity search

Supported

Supported

Maximum size of an indexed column

238,609,291 bytes (approximately 228 MB)

107,374,180 bytes (approximately 102 MB)

Usage notes

  • The size of the column on which you create a GIN index cannot exceed 107,374,180 bytes, which is approximately 102 MB. Sample statement:

    CREATE TABLE t1 (description text);
    
    CREATE INDEX t1_idx ON t1 USING gin (description gin_bigm_ops);
    
    INSERT INTO t1 SELECT repeat('A', 107374181);
  • If the data in your PolarDB cluster is not encoded in the ASCII format, we recommend that you change the encoding format to UTF-8. Execute the following statement to query the encoding format of the current database:

    SELECT pg_encoding_to_char(encoding)
    FROM pg_database
    WHERE datname = current_database();

Basic operations

  • Create the pg_bigm extension.

    CREATE EXTENSION pg_bigm;
  • When you create a GIN index, you must specify the operators provided by the pg_bigm extension.

    CREATE TABLE pg_tools (tool text, description text);
    
    INSERT INTO pg_tools VALUES ('pg_hint_plan', 'Tool that allows a user to specify an optimizer HINT to PostgreSQL');
    INSERT INTO pg_tools VALUES ('pg_dbms_stats', 'Tool that allows a user to stabilize planner statistics in PostgreSQL');
    INSERT INTO pg_tools VALUES ('pg_bigm', 'Tool that provides 2-gram full text search capability in PostgreSQL');
    INSERT INTO pg_tools VALUES ('pg_trgm', 'Tool that provides 3-gram full text search capability in PostgreSQL');
    
    CREATE INDEX pg_tools_idx ON pg_tools USING gin (description gin_bigm_ops);
    CREATE INDEX pg_tools_multi_idx ON pg_tools USING gin (tool gin_bigm_ops, description gin_bigm_ops) WITH (FASTUPDATE = off);
  • Perform full-text search.

    SELECT * FROM pg_tools WHERE description LIKE '%search%';

    Sample result:

      tool   |                             description
    ---------+---------------------------------------------------------------------
     pg_bigm | Tool that provides 2-gram full text search capability in PostgreSQL
     pg_trgm | Tool that provides 3-gram full text search capability in PostgreSQL
    (2 rows)
  • Use the =% operator to perform similarity search.

    SELECT tool FROM pg_tools WHERE tool =% 'bigm';

    Sample result:

      tool   
    ---------
     pg_bigm
    (1 row)
  • Delete the pg_bigm extension.

    DROP EXTENSION pg_bigm;

Basic functions

  • likequery

    • Purpose: generates a string that can be identified based on the LIKE keyword.

    • Request parameters: one request parameter of the STRING type.

    • Return value: a string that can be identified based on the LIKE keyword.

    • Implementation:

      • Add a percent sign (%) before and after the keyword.

      • Use a backward slash (\) to escape the percent sign (%).

    • Examples:

      • SELECT likequery('pg_bigm has improved the full text search performance by 200%');

        Sample result:

                                     likequery
        -------------------------------------------------------------------
         %pg\_bigm has improved the full text search performance by 200\%%
        (1 row)
      • SELECT * FROM pg_tools WHERE description LIKE likequery('search');

        Sample result:

          tool   |                             description
        ---------+---------------------------------------------------------------------
         pg_bigm | Tool that provides 2-gram full text search capability in PostgreSQL
         pg_trgm | Tool that provides 3-gram full text search capability in PostgreSQL
        (2 rows)
  • show_bigm

    • Purpose: obtains all 2-gram elements of a string.

    • Request parameters: one request parameter of the STRING type.

    • Return value: an array that consists of all 2-gram elements of a string.

    • Implementation:

      • Add a space before and after the string.

      • Identify all 2-gram elements in the string.

    • Example:

      SELECT show_bigm('full text search');

      Sample result:

                                  show_bigm
      ------------------------------------------------------------------
       {" f"," s"," t",ar,ch,ea,ex,fu,"h ","l ",ll,rc,se,"t ",te,ul,xt}
      (1 row)
  • bigm_similarity

    • Purpose: obtains the similarity between two strings.

    • Request parameters: two request parameters of the STRING type.

    • Return value: a floating-point number that indicates the similarity between the two strings.

    • Implementation:

      • Identify the 2-gram elements that are included in both of the two strings.

      • The return value ranges from 0 to 1. A value of 0 indicates that the two strings are different. A value of 1 indicates that the two strings are the same.

      Note
      • This function adds a space before and after each string. Therefore, the similarity between the ABC string and the B string is 0, and the similarity between the ABC string and the A string is 0.25.

      • This function is case-sensitive. For example, this function determines that the similarity between the ABC string and the abc string is 0.

    • Examples:

      • SELECT bigm_similarity('full text search', 'text similarity search');

        Sample result:

         bigm_similarity 
        -----------------
                0.571429
        (1 row)
      • SELECT bigm_similarity('ABC', 'A');

        Sample result:

         bigm_similarity
        -----------------
                    0.25
        (1 row)
      • SELECT bigm_similarity('ABC', 'B');

        Sample result:

         bigm_similarity
        -----------------
                       0
        (1 row)
      • SELECT bigm_similarity('ABC', 'abc');

        Sample result:

         bigm_similarity
        -----------------
                       0
        (1 row)
  • pg_gin_pending_stats

    • Purpose: obtains the number of pages and the number of tuples in the pending list of a GIN index.

    • Request parameters: one parameter that specifies the name or OID of the GIN index.

    • Return value: the number of pages and the number of tuples in the pending list of the GIN index.

      Note

      If you set the FASTUPDATE parameter to False for a GIN index, the GIN index does not have a pending list. In this case, this function returns two values, 0 and 0.

    • Example:

      SELECT * FROM pg_gin_pending_stats('pg_tools_idx');

      Sample result:

       pages | tuples
      -------+--------
           0 |      0
      (1 row)

Behavior control parameters

  • pg_bigm.enable_recheck

    This parameter specifies whether to perform a recheck.

    Note

    We recommend that you use the default value on. This way, you can obtain accurate query results.

    Examples:

    1. Prepare test data.

      CREATE TABLE tbl (doc text);
      
      INSERT INTO tbl VALUES('He is awaiting trial');
      INSERT INTO tbl VALUES('It was a trivial mistake');
      
      CREATE INDEX tbl_idx ON tbl USING gin (doc gin_bigm_ops);
    2. Execute the following statements.

      • Use the default value of the pg_bigm.enable_recheck parameter to perform a recheck.

        SET enable_seqscan TO off;
        
        EXPLAIN ANALYZE SELECT * FROM tbl WHERE doc LIKE likequery('trial');

        Sample result:

                                                           QUERY PLAN
        -----------------------------------------------------------------------------------------------------------------
         Bitmap Heap Scan on tbl  (cost=20.00..24.01 rows=1 width=32) (actual time=0.020..0.021 rows=1 loops=1)
           Recheck Cond: (doc ~~ '%trial%'::text)
           Rows Removed by Index Recheck: 1
           Heap Blocks: exact=1
           ->  Bitmap Index Scan on tbl_idx  (cost=0.00..20.00 rows=1 width=0) (actual time=0.013..0.013 rows=2 loops=1)
                 Index Cond: (doc ~~ '%trial%'::text)
         Planning Time: 0.117 ms
         Execution Time: 0.043 ms
        (8 rows)

        Execute the following statement:

        SELECT * FROM tbl WHERE doc LIKE likequery('trial');

        Sample result:

                 doc
        ----------------------
         He is awaiting trial
        (1 row)
      • Set the pg_bigm.enable_recheck parameter to off to prevent performing a recheck.

        SET pg_bigm.enable_recheck = off;
        
        SELECT * FROM tbl WHERE doc LIKE likequery('trial');

        Sample result:

                   doc
        --------------------------
         He is awaiting trial
         It was a trivial mistake
        (2 rows)
  • pg_bigm.gin_key_limit

    This parameter specifies the maximum number of 2-gram elements that can be used to perform full-text search. The default value is 0, which specifies that all 2-gram elements are used.

    Note

    If query performance decreases due to the use of all 2-gram elements, you can decrease the value of this parameter.

  • pg_bigm.similarity_limit

    This parameter specifies the threshold for similarity. The tuples between which the similarity exceeds the specified threshold are returned as similarity search results.