By digoal
PolarDB's is a cloud-native database that adopts a computing-storage separation architecture. It offers affordable data storage, efficient scalability, high-speed multi-machine parallel computing capabilities, and fast data search and processing. By integrating computing algorithms, PolarDB drives the generation of business data value and transforms data into productivity.
This article describes how open source PolarDB uses rum to implement efficient searches and sorting.
The test environment used in this article is macos + docker. For more information on deploying PolarDB, please see:
When designing a search function, it is crucial to consider that users often have multiple results that meet the search criteria and require sorting based on other conditions. In a typical database, achieving such searches can only be done through full table scanning, resulting in extremely poor performance. A few queries can consume the CPU, memory, and I/O resources of the database, impacting business operations. Database administrators are often unfairly blamed, when in reality, it is the database itself that lacks efficient algorithms and storage structures.
According to the theory of intergenerational transfer, we are consuming the future's creativity, and the outdated approach of brute force (heap machines) for problem-solving must be replaced with more efficient and energy-saving methods.
This is precisely what the article aims to introduce today. The performance can be easily improved by up to 70 times (even greater for large datasets).
rum utilizes an inverted structure but includes additional storage for addon values within elements. Each matching element entry sorts and arranges its ctid(s) based on the addon column value.
Additionally, since equivalent matching of multi-value columns is common, rum offers a variant where elements can be hashed for storage, thereby improving index storage efficiency. After hashing, only equivalent matching of elements is supported, similar to hash indexes.
1. rum storage:
1.1 ops
k,v structure: ele,ctid(s)
1.2 addon_ops
k,v structure: ele,kv(s)
kv(s) structure: addon,ctid(s) (construct btree based on addon values)
1.3 hash_ops
k,v structure: ele_hash:ctid(s)
1.4 hash_addon_ops
k,v structure: ele_hash:kv(s)
kv(s) structure: addon:ctid(s) (construct btree based on addon values)
Hash ops and hash addon ops do not support searching for ele by prefix, because elements have been converted into hash values and stored in the index, and can only be used for equivalent matching.
Hash ops supports sorting by distance (<=>
). The distance calculation relies on the proportion of elements matched by equivalent values. It should be noted that sorting and searching by direction, such as prefix search and prefix sort (<=| and |=>
), are not supported.
2. sort compute:
2.1 ele key sort:
ele distance : dist_fun(ctid , $)
2.2 addon ele sort:
addon distance : dist_fun(ctid , $)
Sorting by ele or by addon after matching the ele criteria
Hash ops can be sorted by addon (prefix supports <=>, <=| and |=> ), because the addon column is not hashed.
git clone --depth 1 https://github.com/postgrespro/rum
cd rum
USE_PGXS=1 make
USE_PGXS=1 make install
USE_PGXS=1 make installcheck
[postgres@aa25c5be9681 rum]$ USE_PGXS=1 make installcheck
/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/pgxs/src/makefiles/../../src/test/regress/pg_regress --inputdir=./ --bindir='/home/postgres/tmp_basedir_polardb_pg_1100_bld/bin' --dbname=contrib_regression security rum rum_validate rum_hash ruminv timestamp orderby orderby_hash altorder altorder_hash limits int2 int4 int8 float4 float8 money oid time timetz date interval macaddr inet cidr text varchar char bytea bit varbit numeric rum_weight array
(using postmaster on 127.0.0.1, default port)
============== dropping database "contrib_regression" ==============
NOTICE: database "contrib_regression" does not exist, skipping
DROP DATABASE
============== creating database "contrib_regression" ==============
CREATE DATABASE
ALTER DATABASE
============== running regression test queries ==============
test security ... ok
test rum ... ok
test rum_validate ... ok
test rum_hash ... ok
test ruminv ... ok
test timestamp ... ok
test orderby ... ok
test orderby_hash ... ok
test altorder ... ok
test altorder_hash ... ok
test limits ... ok
test int2 ... ok
test int4 ... ok
test int8 ... ok
test float4 ... ok
test float8 ... ok
test money ... ok
test oid ... ok
test time ... ok
test timetz ... ok
test date ... ok
test interval ... ok
test macaddr ... ok
test inet ... ok
test cidr ... ok
test text ... ok
test varchar ... ok
test char ... ok
test bytea ... ok
test bit ... ok
test varbit ... ok
test numeric ... ok
test rum_weight ... ok
test array ... ok
===========================================================
All 34 tests passed.
POLARDB:
All 34 tests, 0 tests in ignore, 0 tests in polar ignore.
===========================================================
psql
create extension rum;
Profile business and search business:
• Tag matching + weight sorting
• Label matching + time sorting
For example, article search and algorithm example:
• Pay attention to the relevance of the content of the article, and return the first 10 articles in the order of publication time.
• Pay attention to the relevance of the content of the article, and return the first 10 articles in the order of the weight of the article (such as the weight calculated by advertising fee).
1. Create a test table
create table tbl (id int, info tsvector, weight float4);
2. Write test data
insert into tbl select id, to_tsvector('hello i am tom lane, i love postgresql'), random()*100 from generate_series(1,100000) id;
insert into tbl select id, to_tsvector('hello i am digoal, i love polardb at aliyun at china.'), random()*100 from generate_series(1,2000) id;
3. Create a rum index and attach the weight to the ctid(s) of the rum index.
create index on tbl using rum (info rum_tsvector_hash_addon_ops, weight) with (attach = 'weight', to = 'info');
4. Use rum to search and sort efficiently
4.1 Match the string. Is the greater the weight, the better?
select *, weight <=| '100000'::float4 from tbl where info @@ 'digoal&polardb' ORDER BY weight <=| '100000'::float4 limit 10;
id | info | weight | ?column?
------+-----------------------------------------------------------------+---------+------------------
1078 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 99.9273 | 99900.0727005005
877 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 99.8128 | 99900.187171936
118 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 99.8049 | 99900.1951217651
881 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 99.4699 | 99900.5300979614
1257 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 99.4317 | 99900.5682678223
459 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 99.3815 | 99900.6185073853
1306 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 99.3271 | 99900.6729354858
300 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 99.2863 | 99900.713722229
313 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 99.2731 | 99900.7268676758
618 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 99.2371 | 99900.7629318237
(10 rows)
select *, weight <=| '100000'::float4 from tbl where info @@ 'digoal' ORDER BY weight <=| '100000'::float4 limit 10;
id | info | weight | ?column?
------+-----------------------------------------------------------------+---------+------------------
1078 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 99.9273 | 99900.0727005005
877 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 99.8128 | 99900.187171936
118 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 99.8049 | 99900.1951217651
881 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 99.4699 | 99900.5300979614
1257 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 99.4317 | 99900.5682678223
459 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 99.3815 | 99900.6185073853
1306 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 99.3271 | 99900.6729354858
300 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 99.2863 | 99900.713722229
313 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 99.2731 | 99900.7268676758
618 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 99.2371 | 99900.7629318237
(10 rows)
postgres=# explain (analyze,verbose,timing,costs,buffers) select *, weight <=| '100000'::float4 from tbl where info @@ 'digoal&polardb' ORDER BY weight <=| '100000'::float4 limit 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=24.00..40.06 rows=3 width=48) (actual time=2.081..2.097 rows=10 loops=1)
Output: id, info, weight, ((weight <=| '100000'::real))
Buffers: shared hit=21
-> Index Scan using tbl_info_weight_idx on public.tbl (cost=24.00..40.06 rows=3 width=48) (actual time=2.079..2.093 rows=10 loops=1)
Output: id, info, weight, (weight <=| '100000'::real)
Index Cond: (tbl.info @@ '''digoal'' & ''polardb'''::tsquery)
Order By: (tbl.weight <=| '100000'::real)
Buffers: shared hit=21
Planning Time: 0.160 ms
Execution Time: 2.149 ms
(10 rows)
postgres=# explain (analyze,verbose,timing,costs,buffers) select *, weight <=| '100000'::float4 from tbl where info @@ 'digoal' ORDER BY weight <=| '100000'::float4 limit 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=12.00..35.81 rows=10 width=89) (actual time=1.598..1.616 rows=10 loops=1)
Output: id, info, weight, ((weight <=| '100000'::real))
Buffers: shared hit=15
-> Index Scan using tbl_info_weight_idx on public.tbl (cost=12.00..4869.90 rows=2040 width=89) (actual time=1.596..1.612 rows=10 loops=1)
Output: id, info, weight, (weight <=| '100000'::real)
Index Cond: (tbl.info @@ '''digoal'''::tsquery)
Order By: (tbl.weight <=| '100000'::real)
Buffers: shared hit=15
Planning Time: 0.104 ms
Execution Time: 1.655 ms
(10 rows)
4.2 Is the smaller the weight, the better?
select *, weight |=> '-1'::float4 from tbl where info @@ 'digoal' ORDER BY weight |=> '-1'::float4 limit 10;
id | info | weight | ?column?
------+-----------------------------------------------------------------+-----------+------------------
554 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 0.0363963 | 1.03639627248049
192 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 0.0421133 | 1.04211333394051
757 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 0.124864 | 1.12486390769482
1855 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 0.125145 | 1.1251448392868
191 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 0.134997 | 1.1349972486496
60 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 0.179037 | 1.1790367513895
1580 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 0.21992 | 1.21991994976997
1432 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 0.244062 | 1.24406225979328
719 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 0.244155 | 1.2441546022892
81 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 0.329849 | 1.32984939217567
(10 rows)
postgres=# explain select *, weight |=> '-1'::float4 from tbl where info @@ 'digoal' ORDER BY weight |=> '-1'::float4 limit 10;
QUERY PLAN
---------------------------------------------------------------------------------------------
Limit (cost=12.00..35.53 rows=10 width=89)
-> Index Scan using tbl_info_weight_idx on tbl (cost=12.00..4955.27 rows=2101 width=89)
Index Cond: (info @@ '''digoal'''::tsquery)
Order By: (weight |=> '-1'::real)
(4 rows)
4.3 Is it better to get as close to a designated weight as possible?
postgres=# select *, weight <=> '50'::float4 from tbl where info @@ 'digoal' ORDER BY weight <=> '50'::float4 limit 10;
id | info | weight | ?column?
------+-----------------------------------------------------------------+---------+--------------------
38 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 49.9803 | 0.0197181701660156
1590 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 50.1099 | 0.109916687011719
1153 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 50.1187 | 0.118724822998047
884 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 50.1466 | 0.146591186523438
1329 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 50.1551 | 0.155113220214844
303 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 49.8312 | 0.168792724609375
568 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 50.1816 | 0.181587219238281
1706 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 49.8142 | 0.185768127441406
1136 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 49.8095 | 0.190475463867188
1838 | 'aliyun':9 'china':11 'digoal':4 'hello':1 'love':6 'polardb':7 | 49.7829 | 0.217105865478516
(10 rows)
postgres=# explain select *, weight <=> '50'::float4 from tbl where info @@ 'digoal' ORDER BY weight <=> '50'::float4 limit 10;
QUERY PLAN
---------------------------------------------------------------------------------------------
Limit (cost=12.00..35.53 rows=10 width=89)
-> Index Scan using tbl_info_weight_idx on tbl (cost=12.00..4955.27 rows=2101 width=89)
Index Cond: (info @@ '''digoal'''::tsquery)
Order By: (weight <=> '50'::real)
(4 rows)
Without rum, this search requires a full table match, resulting in very poor performance.
Using only the GIN index allows for searching only multi-value elements, which cannot accelerate the index for weighted sorting. Sorting needs to be done back in the table, leading to poor performance.
postgres=# select relpages from pg_class where relname='tbl';
relpages
----------
1462
(1 row)
postgres=# set enable_indexscan = off;
postgres=# set enable_bitmapscan = off;
postgres=# explain (analyze,verbose,timing,costs,buffers) select *, weight <=| '100000'::float4 from tbl where info @@ 'digoal' ORDER BY weight <=| '100000'::float4 limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Limit (cost=2787.65..2787.68 rows=10 width=89) (actual time=27.140..27.144 rows=10 loops=1)
Output: id, info, weight, ((weight <=| '100000'::real))
Buffers: shared hit=1462
-> Sort (cost=2787.65..2792.91 rows=2101 width=89) (actual time=27.139..27.141 rows=10 loops=1)
Output: id, info, weight, ((weight <=| '100000'::real))
Sort Key: ((tbl.weight <=| '100000'::real))
Sort Method: top-N heapsort Memory: 27kB
Buffers: shared hit=1462
-> Seq Scan on public.tbl (cost=0.00..2742.25 rows=2101 width=89) (actual time=25.883..26.810 rows=2000 loops=1)
Output: id, info, weight, (weight <=| '100000'::real)
Filter: (tbl.info @@ '''digoal'''::tsquery)
Rows Removed by Filter: 100000
Buffers: shared hit=1462
Planning Time: 0.078 ms
Execution Time: 27.171 ms
(15 rows)
rum enables precise searching of multi-value columns and supports intra-index sorting, providing the best performance.
In the above example, only 102,000 records were processed. The data blocks scanned by rum were reduced by 70 times. If the dataset grows larger, even fewer data blocks will be scanned, resulting in a more significant improvement in performance.
Open Source PolarDB Uses pg_rational Extension to Support Stern-Brocot Trees
digoal - May 28, 2024
Alibaba Clouder - December 11, 2017
digoal - September 12, 2019
digoal - September 12, 2019
digoal - March 20, 2019
digoal - December 11, 2019
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAlibaba Cloud PolarDB for Xscale (PolarDB-X) is a cloud-native high-performance distributed database service independently developed by Alibaba Cloud.
Learn MoreAlibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn MoreMulti-source metrics are aggregated to monitor the status of your business and services in real time.
Learn MoreMore Posts by digoal