All Products
Search
Document Center

PolarDB:RUM

Last Updated:Oct 27, 2024

The RUM extension expands general inverted indexes (GINs) and implements faster full-text search based on the GIN access method code.

Prerequisites

Your PolarDB for PostgreSQL cluster must meet the following requirements:

  • PostgreSQL 15 (revision version 15.7.1.1 or later).

  • PostgreSQL 14 (revision version 14.5.3.0 or later).

Note

You can execute the following statement to query the minor engine version of your PolarDB for PostgreSQL cluster:

SELECT version();

Background information

GINs support tsvector and tsquery data types to implement full-text search, but encounters the following problems:

  • Slow sorting: The location information of words is required for sorting. However, GINs do not store word locations. Therefore, an additional scan is required to retrieve the location information of words after the index scan.

  • Slow phrase queries: GINs require location information to perform phrase search.

  • Slow timestamp sorting: GINs do not store related information in indexes that contain morphemes. Therefore, an additional scan is required.

The RUM extension is based on GINs and can solve these problems by storing additional information (word location information or timestamp location information) in RUM indexes.

Note

However, the RUM extension requires more time than GIN to construct and insert indexes. This is because the RUM extension generates indexes based on WAL logs and the generated RUM indexes contain more information than the keys that are used for encryption.

Common RUM operators

The RUM extension provides the following operators.

Operator

Data type

Description

tsvector <=> tsquery

float4

Calculates the distance between tsvector and tsquery values.

timestamp <=> timestamp

float8

Calculates the distance between two timestamp values.

timestamp <=| timestamp

float8

Calculates the distance between values that are less than the current timestamp.

timestamp |=> timestamp

float8

Calculates the distance between values that are greater than the current timestamp.

The <=>, <=|, and |=> operators also apply to the following data types:

  • timestamptz

  • int2

  • int4

  • int8

  • float4

  • float8

  • money

  • oid

Use the RUM extension

Create the RUM extension

You can execute the following SQL statement to install the RUM extension.

CREATE EXTENSION rum;
Note

If the extension fails to be created when your cluster meets the version requirements, contact us.

Operator classes

  • The rum_tsvector_ops class stores tsvector phrases with location information, and supports sorting by the <=> operator and prefix search.

    Examples:

    1. Prepare data:

      CREATE TABLE test_rum(t text, a tsvector);
      CREATE TRIGGER tsvectorupdate
          BEFORE UPDATE OR INSERT ON test_rum
          FOR EACH ROW
          EXECUTE PROCEDURE tsvector_update_trigger('a', 'pg_catalog.english', 't');
      INSERT INTO test_rum(t) VALUES ('The situation is most beautiful');
      INSERT INTO test_rum(t) VALUES ('It is a beautiful');
      INSERT INTO test_rum(t) VALUES ('It looks like a beautiful place');
    2. Create RUM indexes:

      CREATE INDEX rumidx ON test_rum USING rum (a rum_tsvector_ops);
    3. Execute the following query statements:

      • Query statement 1:

        SELECT t, a <=> to_tsquery('english', 'beautiful | place') AS rank
        FROM test_rum
        WHERE a @@ to_tsquery('english', 'beautiful | place')
        ORDER BY a <=> to_tsquery('english', 'beautiful | place');

        Sample result:

                        t                |   rank
        ---------------------------------+----------
         It looks like a beautiful place |  8.22467
         The situation is most beautiful | 16.44934
         It is a beautiful               | 16.44934
        (3 rows)
      • Query statement 2:

        SELECT t, a <=> to_tsquery('english', 'place | situation') AS rank
        FROM test_rum
        WHERE a @@ to_tsquery('english', 'place | situation')
        ORDER BY a <=> to_tsquery('english', 'place | situation');

        Sample result:

                        t                |   rank
        ---------------------------------+----------
         The situation is most beautiful | 16.44934
         It looks like a beautiful place | 16.44934
        (2 rows)
  • The rum_tsvector_hash_ops class stores the hash values and location information of tsvector phrases. This class supports sorting by the <=> operator, but not prefix search.

    Note

    The rum_tsvector_hash_ops class supports sorting by the <=>, <=|, and |=> operators. It can be used with the rum_tsvector_addon_ops, rum_tsvector_hash_addon_ops, and rum_anyarray_addon_ops classes.

  • The rum_TYPE_ops class supports the following data types and operators:

    Data types: int2, int4, int8, float4, float8, money, oid, time, timetz, date, interval, macaddr, inet, cidr, text, varchar, char, bytea, bit, varbit, numeric, timestamp, and timestamptz.

    Operators: The <, <=,=, >=, and > operators are applicable to all data types. The <=>, <=|, and |=> operators are applicable to int2, int4, int8, float4, float8, money, oid, timestamp, and timestamptz data types.

  • The rum_tsvector_addon_ops class stores tsvector lexemes and any lexemes supported by the module field.

    Examples:

    1. Prepare data:

      CREATE TABLE tsts (id int, t tsvector, d timestamp);
      \copy tsts from 'external/rum/data/tsts.data'
      CREATE INDEX tsts_idx ON tsts USING rum (t rum_tsvector_addon_ops, d) WITH (attach = 'd', to = 't');
    2. Execute the following statements:

      EXPLAIN (costs off)
      SELECT id, d, d <=> '2016-05-16 14:21:25'
      FROM tsts
      WHERE t @@ 'wr&qh'
      ORDER BY d <=> '2016-05-16 14:21:25'
      LIMIT 5;
      
                                          QUERY PLAN
      -----------------------------------------------------------------------------------
       Limit
         ->  Index Scan using tsts_idx on tsts
               Index Cond: (t @@ '''wr'' & ''qh'''::tsquery)
               Order By: (d <=> '2016-05-16 14:21:25'::timestamp without time zone)
      (4 rows)
    3. Execute the following query statement:

      SELECT id, d, d <=> '2016-05-16 14:21:25'
      FROM tsts
      WHERE t @@ 'wr&qh'
      ORDER BY d <=> '2016-05-16 14:21:25'
      LIMIT 5;

      Sample result:

       id  |             d              |   ?column?
      -----+----------------------------+---------------
       355 | 2016-05-16 14:21:22.326724 |      2.673276
       354 | 2016-05-16 13:21:22.326724 |   3602.673276
       371 | 2016-05-17 06:21:22.326724 |  57597.326724
       406 | 2016-05-18 17:21:22.326724 | 183597.326724
       415 | 2016-05-19 02:21:22.326724 | 215997.326724
      (5 rows)
      Note

      Because posting trees have fixed length right bound and fixed length non-leaf posting items, RUM may be flawed in creating an index using sorting over pass-by-reference additional information.

  • The rum_tsvector_hash_addon_ops class stores the hash values of tsvector lexemes and any lexemes supported by the module field.

  • The rum_tsquery_ops class is applicable to the tsquery data type and stores branches of the query tree in additional information.

    Examples:

    1. Prepare data:

      CREATE TABLE test_array (i int2[]);
      INSERT INTO test_array VALUES ('{}'), ('{0}'), ('{1,2,3,4}'), ('{1,2,3}'), ('{1,2}'), ('{1}');
      CREATE INDEX idx_array ON test_array USING rum (i rum_anyarray_ops);
    2. Execute the following statements:

      SET enable_seqscan TO off;
      EXPLAIN (COSTS OFF)
      SELECT *
      FROM test_array
      WHERE i && '{1}'
      ORDER BY i <=> '{1}' ASC;
      
                      QUERY PLAN
      ------------------------------------------
       Index Scan using idx_array on test_array
         Index Cond: (i && '{1}'::smallint[])
         Order By: (i <=> '{1}'::smallint[])
      (3 rows)
    3. Execute the following query statement:

      SELECT *
      FROM test_array
      WHERE i && '{1}'
      ORDER BY i <=> '{1}' ASC;

      Sample result:

           i
      -----------
       {1}
       {1,2}
       {1,2,3}
       {1,2,3,4}
      (4 rows)
  • The rum_anyarray_ops class stores anyarrray elements with the array length. Supported operators include &&, @>, <@,=, and %. Sorting by the <=> operators is supported.

    Examples:

    1. Prepare data:

      CREATE TABLE test_array (i int2[]);
      INSERT INTO test_array VALUES ('{}'), ('{0}'), ('{1,2,3,4}'), ('{1,2,3}'), ('{1,2}'), ('{1}');
      CREATE INDEX idx_array ON test_array USING rum (i rum_anyarray_ops);
    2. Execute the following statements:

      SET enable_seqscan TO off;
      
      EXPLAIN (COSTS OFF)
      SELECT *
      FROM test_array
      WHERE i && '{1}'
      ORDER BY i <=> '{1}' ASC;
      
                      QUERY PLAN
      ------------------------------------------
       Index Scan using idx_array on test_array
         Index Cond: (i && '{1}'::smallint[])
         Order By: (i <=> '{1}'::smallint[])
      (3 rows)
    3. Execute the following query statement:

      SELECT *
      FROM test_array
      WHERE i && '{1}'
      ORDER BY i <=> '{1}' ASC;

      Sample result:

           i
      -----------
       {1}
       {1,2}
       {1,2,3}
       {1,2,3,4}
      (4 rows)
  • The rum_anyarray_addon_ops class stores anyarrray elements and any elements supported by the module field.

Uninstall the RUM extension

You can execute the following SQL statement to uninstall the RUM extension.

DROP EXTENSION rum;

References

For more information, see rum.