By digoal
PolarDB's is a cloud-native database that adopts a computing-storage separation architecture. It offers affordable data storage, efficient scalability, high-speed multi-machine parallel computing capabilities, and fast data search and processing. By integrating computing algorithms, PolarDB drives the generation of business data value and transforms data into productivity.
This article focuses on how the open-source version of PolarDB utilizes parray_gin to implement efficient fuzzy searches for elements in arrays and JSON.
The test environment used in this article is macos + docker. For more information on deploying PolarDB, please see:
PG fuzzy searches use Generalized Inverted Index (GIN) indexes. The pg_trgm extension is applied to add a space before a string and two spaces after it. Then, the string is split into tokens based on consecutive groups of three characters, and inverted indexes of tokens and ctids are created.
During fuzzy searches, the search string can be split in the same manner, allowing for the quick location of the corresponding CTID based on the inverted search.
Even without using parray_gin, we can still implement fuzzy searches for elements in arrays or JSON by formatting them as large strings and utilizing the pg_trgm extension. For example:
array['abc','aaa','hello']
In this case, the element content's sep char and quote char are escaped, and 'abc', 'aaa', 'hello' are treated as strings. By enabling the pg_trgm extension and creating a GIN index, fuzzy searches can be performed.
If you need to specify a prefix or suffix for searching an element, you can include the sep char and quote char. However, using parray_gin simplifies the process.
In the following example, PolarDB + parray_gin is used to implement fuzzy searches for elements in an array.
git clone --depth 1 http://github.com/theirix/parray_gin/
cd parray_gin/
USE_PGXS=1 make
USE_PGXS=1 make install
export PGHOST=localhost
[postgres@1bbb8082aa60 parray_gin]$ psql
psql (11.9)
Type "help" for help.
postgres=# \q
[postgres@1bbb8082aa60 parray_gin]$ USE_PGXS=1 make installcheck
/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/pgxs/src/makefiles/../../src/test/regress/pg_regress --inputdir=./ --bindir='/home/postgres/tmp_basedir_polardb_pg_1100_bld/bin' --inputdir=test --dbname=contrib_regression op index
(using postmaster on localhost, default port)
============== dropping database "contrib_regression" ==============
NOTICE: database "contrib_regression" does not exist, skipping
DROP DATABASE
============== creating database "contrib_regression" ==============
CREATE DATABASE
ALTER DATABASE
============== running regression test queries ==============
test op ... ok
test index ... ok
==========================================================
All 2 tests passed.
POLARDB:
All 2 tests, 0 tests in ignore, 0 tests in polar ignore.
==========================================================
create table t (id int, info text[]);
create or replace function gen_text_arr(int) returns text[] as $$
select array(select md5(random()::text) from generate_series(1,$1));
$$ language sql strict;
postgres=# select gen_text_arr(10);
-[ RECORD 1 ]+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
gen_text_arr | {4134ee81fcdc29da486df37a1725e1cc,d0bb424307f93a6374d1af5a4b1c0451,def4b4bc24bc6aefb084df8a1571d773,aff17d39b2c3e8ccebf1059c2cd466dc,3988cb3f89372081c6444b7f8a825cf6,77d3a12d9a5159bd2e11fac1782eaf90,0ecac2cd508f60221b31934ea1128223,622819cfa7c3e3e600f70ed90265edaa,e9311e8d6f23be74b2e73eae4408aaa8,207eb23a50212cb101f83a6041211d90}
postgres=# insert into t select id , gen_text_arr(10) from generate_series(1,1000) id;
INSERT 0 1000
postgres=# select * from t where info @@> array['%4b1%'];
id |
info
-----+----------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------
14 | {745b761d7145edb79904c5217c0ec0b4,eab9d9d4de9afc8c7a2bc4cdcd3bcb2a,3116cd48046936709c56e952f5d50380,642eec5d3c17721dadb89759ac116821,49ba14c3c71b73c0a3b8
6aa6f20a4f9c,01632c5889d4ae642422fea8620187e1,078ea7bf29a6f8bf53c6abcec98df5ad,2548e08ad3cb87dfcfe55a86e47cc60f,0c7002203e72d854f9c0643bec6c59b7,cfdd57d32f4bcee
8b4b1adfe11a08a81}
33 | {639e7f990ef271b24b1ac1a1f154476b,5c0dd44f87821cf555fb579f2dd9871d,b3118d34a6f788ad9c9d3343743900bc,798abd4aece1cbe604e608294227dde6,f08757d02fd0db9d08c9
2240c55ec14b,54f206220cf2097f0e2a6f630a7871be,585d04664a022ab49607d0d6ff18fc89,f5681d20b2b923973652f9952df6b71d,1d204241c105c78ba0514bdf1dba6bbb,5f427b5c2b65e0d
e41b70e804dfcc41d}
...
postgres=# select * from t where info @@> array['%4b1ac%'];
id |
info
----+-----------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------
33 | {639e7f990ef271b24b1ac1a1f154476b,5c0dd44f87821cf555fb579f2dd9871d,b3118d34a6f788ad9c9d3343743900bc,798abd4aece1cbe604e608294227dde6,f08757d02fd0db9d08c92
240c55ec14b,54f206220cf2097f0e2a6f630a7871be,585d04664a022ab49607d0d6ff18fc89,f5681d20b2b923973652f9952df6b71d,1d204241c105c78ba0514bdf1dba6bbb,5f427b5c2b65e0de
41b70e804dfcc41d}
(1 row)
postgres=# select * from t where info @@> array['%4b1acd%'];
id | info
----+------
(0 rows)
postgres=# explain select * from t where info @@> array['%4b1ac%'];
QUERY PLAN
--------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=28.01..32.02 rows=1 width=36)
Recheck Cond: (info @@> '{%4b1ac%}'::text[])
-> Bitmap Index Scan on t_info_idx (cost=0.00..28.01 rows=1 width=0)
Index Cond: (info @@> '{%4b1ac%}'::text[])
(4 rows)
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t where info @@> array['%4b1ac%', '%8fc89'];
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.t (cost=60.01..64.02 rows=1 width=36) (actual time=0.121..0.122 rows=1 loops=1)
Output: id, info
Recheck Cond: (t.info @@> '{%4b1ac%,%8fc89}'::text[])
Heap Blocks: exact=1
Buffers: shared hit=16
-> Bitmap Index Scan on t_info_idx (cost=0.00..60.01 rows=1 width=0) (actual time=0.109..0.109 rows=1 loops=1)
Index Cond: (t.info @@> '{%4b1ac%,%8fc89}'::text[])
Buffers: shared hit=15
Planning Time: 0.075 ms
Execution Time: 0.144 ms
(10 rows)
Time: 0.699 ms
postgres=# select * from t where info @@> array['%4b1ac%', '%8fc89'];
id |
info
----+-----------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------
33 | {639e7f990ef271b24b1ac1a1f154476b,5c0dd44f87821cf555fb579f2dd9871d,b3118d34a6f788ad9c9d3343743900bc,798abd4aece1cbe604e608294227dde6,f08757d02fd0db9d08c92
240c55ec14b,54f206220cf2097f0e2a6f630a7871be,585d04664a022ab49607d0d6ff18fc89,f5681d20b2b923973652f9952df6b71d,1d204241c105c78ba0514bdf1dba6bbb,5f427b5c2b65e0de
41b70e804dfcc41d}
(1 row)
Time: 0.733 ms
postgres=# insert into t select id , gen_text_arr(10) from generate_series(1,120000) id;
INSERT 0 100000
Time: 9242.877 ms (00:09.243)
postgres=# \dt+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------+-------+----------+-------+-------------
public | t | table | postgres | 50 MB |
(1 row)
postgres=# select * from t where info @@> array['%4b1ac%', '%8fc89'];
id |
info
----+-----------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------
33 | {639e7f990ef271b24b1ac1a1f154476b,5c0dd44f87821cf555fb579f2dd9871d,b3118d34a6f788ad9c9d3343743900bc,798abd4aece1cbe604e608294227dde6,f08757d02fd0db9d08c92
240c55ec14b,54f206220cf2097f0e2a6f630a7871be,585d04664a022ab49607d0d6ff18fc89,f5681d20b2b923973652f9952df6b71d,1d204241c105c78ba0514bdf1dba6bbb,5f427b5c2b65e0de
41b70e804dfcc41d}
(1 row)
Time: 4.783 ms
postgres=# explain (analyze,timing,costs,buffers,verbose) select * from t where info @@> array['%4b1ac%', '%8fc89'];
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.t (cost=96.94..529.04 rows=121 width=36) (actual time=4.114..4.115 rows=1 loops=1)
Output: id, info
Recheck Cond: (t.info @@> '{%4b1ac%,%8fc89}'::text[])
Heap Blocks: exact=1
Buffers: shared hit=48
-> Bitmap Index Scan on t_info_idx (cost=0.00..96.91 rows=121 width=0) (actual time=4.103..4.103 rows=1 loops=1)
Index Cond: (t.info @@> '{%4b1ac%,%8fc89}'::text[])
Buffers: shared hit=47
Planning Time: 0.090 ms
Execution Time: 4.170 ms
(10 rows)
The performance of full table scans is several dozen times better.
postgres=# set enable_bitmapscan =off;
SET
Time: 0.473 ms
postgres=# explain (analyze,timing,costs,buffers,verbose) select * from t where info @@> array['%4b1ac%', '%8fc89'];
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Seq Scan on public.t (cost=0.00..7881.50 rows=121 width=36) (actual time=0.632..193.929 rows=1 loops=1)
Output: id, info
Filter: (t.info @@> '{%4b1ac%,%8fc89}'::text[])
Rows Removed by Filter: 120999
Buffers: shared hit=6229 read=140
Planning Time: 0.081 ms
Execution Time: 193.947 ms
(7 rows)
Time: 194.697 ms
postgres=# select * from t where info @@> array['%4b1ac%', '%8fc89'];
id |
info
----+-----------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------
33 | {639e7f990ef271b24b1ac1a1f154476b,5c0dd44f87821cf555fb579f2dd9871d,b3118d34a6f788ad9c9d3343743900bc,798abd4aece1cbe604e608294227dde6,f08757d02fd0db9d08c92
240c55ec14b,54f206220cf2097f0e2a6f630a7871be,585d04664a022ab49607d0d6ff18fc89,f5681d20b2b923973652f9952df6b71d,1d204241c105c78ba0514bdf1dba6bbb,5f427b5c2b65e0de
41b70e804dfcc41d}
(1 row)
Time: 199.342 ms
With parray_gin, the data structure can be designed more flexibly. For example, the "multiple tags and multiple information" of "a time period, a group and an object" can be packed into a row array for storage. If the array is searched for elements, the qualified "a time period, a group and an object" can be quickly matched.
Open Source PolarDB Uses duckdb_fdw to Support Parquet and Efficient OLAP
digoal - December 11, 2019
digoal - September 12, 2019
digoal - December 6, 2023
digoal - December 11, 2019
digoal - September 12, 2019
digoal - October 18, 2023
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAlibaba Cloud PolarDB for Xscale (PolarDB-X) is a cloud-native high-performance distributed database service independently developed by Alibaba Cloud.
Learn MoreAlibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn MoreMulti-source metrics are aggregated to monitor the status of your business and services in real time.
Learn MoreMore Posts by digoal