The RUM extension provides an alternative to the built-in Generalized Inverted Index (GIN) in PostgreSQL for full-text searches on large volumes of text. You can use the RUM extension to sort search results by relevance, timestamps, or other fields. Traditional GIN indexes require table lookups to sort results, which can cause performance bottlenecks. The RUM extension avoids these lookups by storing position and other required information directly in the index. This can improve query performance by several times in specific scenarios.
Applicability
PolarDB for PostgreSQL supports the following versions:
PostgreSQL 18 (minor engine version 2.0.18.1.2.0 and later)
PostgreSQL 17 (minor engine version 2.0.17.6.4.0 and later)
PostgreSQL 16 (minor engine version 2.0.16.8.3.0 and later)
PostgreSQL 15 (minor engine version 2.0.15.7.1.1 and later)
PostgreSQL 14 (minor engine version 2.0.14.5.3.0 and later)
NoteYou can view the minor engine version in the console or by running the
SHOW polardb_version;statement. If your version does not meet the requirements, upgrade the minor engine version. For more information, see View the minor engine version.Extension compatibility: The
%operator of the RUM extension conflicts with the%operator of thesmlarextension. Because of this conflict, you cannot create and use both extensions in the same database schema. Before you install the RUM extension, confirm that thesmlarextension is not used in your environment, or install the two extensions in different schemas.
Scenarios
Review the following table to understand the core differences between the RUM extension and the native GIN index. This information can help you choose the best option for your business scenario.
Comparison Dimension | GIN Index (Built-in) | RUM Index | Recommendation |
Core Advantage | Good write performance. Relatively small index size. | High sorting performance. Supports phrase searches and sorting by attached columns. | For scenarios that require sorting of search results, the RUM extension is the preferred choice. |
Sorting Performance | Slow. Requires table lookups to obtain sorting data. Performance drops sharply as the data volume increases. | Fast. Sorts directly within the index. No table lookups are needed. | Use the RUM extension if you frequently sort full-text search results, such as by relevance, time, or price. |
Phrase Search | Slow. Requires table lookups to obtain word position information to verify phrases. | Fast. Word position information is already stored in the index. No table lookups are needed. | Use the RUM extension for scenarios that require high-performance phrase searches. |
Sorting by Attached Columns | Not supported. Cannot store information from extra columns, such as timestamps, in the index. | Supported. You can attach other columns to the index for high-performance custom sorting. | The RUM extension has a significant advantage in scenarios that require sorting by fields such as article publication time or update time. |
Write Performance | Relatively fast. | Relatively slow because a more complex index schema must be maintained. | For write-intensive tables with frequent |
Index Size | Relatively small. | Relatively large because it requires extra space to store position and additional information. | Evaluate your storage costs. If the index size is a major bottleneck, consider using a RUM |
Prefix Search | Supported. | Supported by | If prefix search is a core requirement, avoid using RUM |
The RUM extension trades space for time. It increases the index size and write overhead to significantly improve performance for specific query scenarios, especially sorting. If your core business involves full-text searches with complex sorting, the RUM extension is an ideal choice. If your business is primarily write-intensive or only requires simple text matching, the GIN index is more cost-effective.
Precautions
Write performance and index size: To speed up queries, a RUM index stores extra information, such as word positions. This makes the index size larger than that of a GIN index and increases the index creation overhead during data write and update operations. Therefore, you must carefully evaluate the cost of the RUM extension in write-intensive scenarios where storage space is a concern.
Scenarios that do not support prefix search: Indexes created with the
rum_tsvector_hash_opsorrum_tsvector_hash_addon_opsoperator classes do not support prefix search. This is because they store the hash values of lexemes, not the original text.
Install and uninstall the extension
Install the extension
Run the following command in your database to create the RUM extension.
CREATE EXTENSION rum;Uninstall the extension
If you no longer need the RUM extension, run the following command to uninstall it.
DROP EXTENSION rum;Usage
The RUM extension provides multiple operator classes to support different data types and query scenarios. You can choose the appropriate operator class to create an index based on your specific needs.
An operator class defines a set of operations that a RUM index uses to process a specific data type. This allows the index to correctly store and retrieve data of that type. Each operator class includes a specific set of operators. When you use these supported operators in a WHERE or ORDER BY clause, PostgreSQL can use the RUM index to speed up the query.
Therefore, choosing the correct operator class is key to ensuring the effectiveness of the RUM index. For more information, see Operator Classes and Operator Families.
Operators
Operator | Supported Types | Return Value Type | Description |
A | Left: |
| Returns whether the full-text vector matches the query condition. It performs a distance calculation. |
A | Left: |
| Returns the distance value between the full-text vector and the query condition. A smaller value indicates higher relevance. |
|
| Returns the absolute difference between two values.
| |
A |
|
| Returns |
A |
|
| Returns |
Operator Class
Operator Class | Applicable Data Types | Key Supported Operators | Core Function and Description |
|
|
| Stores the lexemes and their position information from a |
|
|
| Stores the hash values and position information of
|
|
|
| Performs range queries and distance sorting on non-text, non-array data types. |
|
|
| Attaches data from an additional column (such as a Note The data type of the attached column must be supported by a corresponding |
|
|
| Same function as
|
|
|
| Used to index a |
|
|
| Indexes array types. Supports array operations such as contains and overlap, and supports sorting by the distance between arrays. |
|
|
| Attaches data from an additional column to an array index to support more complex query scenarios. Note The data type of the attached column must be supported by a corresponding |
Scenario 1: Accelerate relevance sorting for full-text search results
When you need to sort full-text search results by relevance, you can use a RUM index to avoid the extra sorting overhead that is required by GIN indexes and achieve high-performance sorting.
Prepare the data: First, create a test table.
CREATE TABLE t1( t text, t_vec tsvector GENERATED ALWAYS AS (to_tsvector('pg_catalog.english', t)) STORED ); -- Insert test data INSERT INTO t1(t) VALUES ('The situation is most beautiful'); INSERT INTO t1(t) VALUES ('It is a beautiful'); INSERT INTO t1(t) VALUES ('It looks like a beautiful place');Create a RUM index: Use the
rum_tsvector_opsoperator class to create a RUM index for thetsvectorcolumn.CREATE INDEX t1_t_vec_idx ON t1 USING rum (t_vec rum_tsvector_ops);Run a relevance sorting query: Use the
<=>operator to query and sort. This operator calculates the distance between the query and the text. A smaller distance indicates higher relevance. Therefore, usingORDER BYsorts the results by relevance.SET enable_seqscan TO off; SELECT t, t_vec <=> to_tsquery('english', 'beautiful | place') AS rank FROM t1 WHERE t_vec @@ to_tsquery('english', 'beautiful | place') ORDER BY t_vec <=> to_tsquery('english', 'beautiful | place');The following result is returned:
t | rank ---------------------------------+--------- It looks like a beautiful place | 8.22467 The situation is most beautiful | 16.4493 It is a beautiful | 16.4493
Scenario 2: Accelerate combined sorting of full-text search and an attached column
In scenarios such as log analysis or E-commerce search, you often need to perform a full-text search and sort the results by an additional field, such as a timestamp or price. The RUM add-on feature lets you store information from an attached column in the index. This enables efficient combined queries and sorting.
Prepare the data: Create a table that contains a
tsvectorcolumn and a timestamp column, and then insert sample data.CREATE TABLE tsts (id int, t tsvector, d timestamp); INSERT INTO tsts VALUES (354, to_tsvector('wr qh'), '2016-05-16 14:21:22.326724'), (355, to_tsvector('wr qh'), '2016-05-16 13:21:22.326724'), (356, to_tsvector('ts op'), '2016-05-16 18:21:22.326724'), (358, to_tsvector('ts op'), '2016-05-16 23:21:22.326724'), (371, to_tsvector('wr qh'), '2016-05-17 06:21:22.326724'), (406, to_tsvector('wr qh'), '2016-05-18 17:21:22.326724'), (415, to_tsvector('wr qh'), '2016-05-19 02:21:22.326724');Create a RUM index with an attached column: Use the
rum_tsvector_addon_opsoperator class and specify the attached column and the primary index column in theWITHclause.CREATE INDEX tsts_idx ON tsts USING rum (t rum_tsvector_addon_ops, d) WITH (attach = 'd', to = 't');NoteThe key syntax
WITH (attach = 'd', to = 't')attaches the values of thedcolumn (the attached column, which is a timestamp) to the index entries of thetcolumn (the primary index column of thetsvectortype). This allows the database to use the index on thetcolumn for full-text search and the attacheddcolumn information for efficient sorting in a single index scan. This process avoids table lookups and significantly improves performance.Run a combined sorting query: Query for records that contain specific words and sort them by how close their timestamp is to a target time.
SET enable_seqscan TO off; EXPLAIN (costs off) SELECT id, d, d <=> '2016-05-16 14:21:25' AS distance FROM tsts WHERE t @@ 'wr&qh' ORDER BY d <=> '2016-05-16 14:21:25' LIMIT 5;The following execution plan shows that both sorting and filtering are completed in a single index scan.
QUERY PLAN ------------------------------------------------------------------------------ Limit -> Index Scan using tsts_idx on tsts Index Cond: (t @@ '''wr'' & ''qh'''::tsquery) Order By: (d <=> '2016-05-16 14:21:25'::timestamp without time zone)The following result is returned:
id | d | distance -----+----------------------------+--------------- 354 | 2016-05-16 14:21:22.326724 | 2.673276 355 | 2016-05-16 13:21:22.326724 | 3602.673276 371 | 2016-05-17 06:21:22.326724 | 57597.326724 406 | 2016-05-18 17:21:22.326724 | 183597.326724 415 | 2016-05-19 02:21:22.326724 | 215997.326724
Scenario 3: Accelerate array queries and similarity sorting
For scenarios such as tagging systems or user personas, you need to efficiently query arrays that contain specific elements and sort them by the degree of overlap or similarity.
Prepare the data:
CREATE TABLE test_array (id serial, i int2[]); INSERT INTO test_array(i) VALUES ('{}'), ('{0}'), ('{1,2,3,4}'), ('{1,2,3}'), ('{1,2}'), ('{1}');Create a RUM index for an array: Use the
rum_anyarray_opsoperator class to create an index for the array column.CREATE INDEX idx_array ON test_array USING rum (i rum_anyarray_ops);Run an array query and sort: Query for records that contain the element
1and sort them by similarity to{1}.SELECT * FROM test_array WHERE i && '{1}' -- The '&&' operator indicates array overlap. ORDER BY i <=> '{1}' ASC; -- The '<=>' operator calculates the distance between arrays. A smaller value indicates greater similarity.The following result is returned:
i ----------- {1} {1,2} {1,2,3} {1,2,3,4}
Scenario 4: Inverted index to quickly match query rules
When building systems for user subscriptions or alert rule matching, you need to quickly match new data, such as an article, against many existing query rules, such as user-subscribed keywords. The RUM extension supports creating an index on the tsquery type to perform efficient inverted matching.
Prepare the query rule data:
CREATE TABLE query (id serial, q tsquery, tag text); INSERT INTO query (q, tag) VALUES ('supernova & star', 'sn'), ('black', 'color'), ('big & bang & black & hole', 'bang'), ('spiral & galaxy', 'shape'), ('black & hole', 'color');Create a RUM index for
tsquery:CREATE INDEX query_idx ON query USING rum(q rum_tsquery_ops);Run an inverted match query: Use the
tsvectorof a new article to match all qualifyingtsqueryrules.SELECT * FROM query WHERE to_tsvector('black holes never exists before we think about them') @@ q;The following result is returned:
id | q | tag -----+----------+------- 2 | 'black' | color