×
Community Blog Open Source PolarDB Uses parray_gin to Implement Efficient Fuzzy Searches for Elements in Arrays and JSON

Open Source PolarDB Uses parray_gin to Implement Efficient Fuzzy Searches for Elements in Arrays and JSON

This article focuses on how the open-source version of PolarDB utilizes parray_gin to implement efficient fuzzy searches for elements in arrays and JSON.

By digoal

Background

PolarDB's is a cloud-native database that adopts a computing-storage separation architecture. It offers affordable data storage, efficient scalability, high-speed multi-machine parallel computing capabilities, and fast data search and processing. By integrating computing algorithms, PolarDB drives the generation of business data value and transforms data into productivity.

This article focuses on how the open-source version of PolarDB utilizes parray_gin to implement efficient fuzzy searches for elements in arrays and JSON.

The test environment used in this article is macos + docker. For more information on deploying PolarDB, please see:

Test

PG fuzzy searches use Generalized Inverted Index (GIN) indexes. The pg_trgm extension is applied to add a space before a string and two spaces after it. Then, the string is split into tokens based on consecutive groups of three characters, and inverted indexes of tokens and ctids are created.

During fuzzy searches, the search string can be split in the same manner, allowing for the quick location of the corresponding CTID based on the inverted search.

Even without using parray_gin, we can still implement fuzzy searches for elements in arrays or JSON by formatting them as large strings and utilizing the pg_trgm extension. For example:

array['abc','aaa','hello']  

In this case, the element content's sep char and quote char are escaped, and 'abc', 'aaa', 'hello' are treated as strings. By enabling the pg_trgm extension and creating a GIN index, fuzzy searches can be performed.

If you need to specify a prefix or suffix for searching an element, you can include the sep char and quote char. However, using parray_gin simplifies the process.

In the following example, PolarDB + parray_gin is used to implement fuzzy searches for elements in an array.

git clone --depth 1 http://github.com/theirix/parray_gin/  
  
cd parray_gin/  
  
USE_PGXS=1 make  
  
USE_PGXS=1 make install  
  
export PGHOST=localhost  
  
[postgres@1bbb8082aa60 parray_gin]$ psql  
psql (11.9)  
Type "help" for help.  
  
postgres=# \q  
[postgres@1bbb8082aa60 parray_gin]$ USE_PGXS=1 make installcheck  
/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/pgxs/src/makefiles/../../src/test/regress/pg_regress --inputdir=./ --bindir='/home/postgres/tmp_basedir_polardb_pg_1100_bld/bin'      --inputdir=test --dbname=contrib_regression op index  
(using postmaster on localhost, default port)  
============== dropping database "contrib_regression" ==============  
NOTICE:  database "contrib_regression" does not exist, skipping  
DROP DATABASE  
============== creating database "contrib_regression" ==============  
CREATE DATABASE  
ALTER DATABASE  
============== running regression test queries        ==============  
test op                           ... ok  
test index                        ... ok  
  
  
==========================================================  
 All 2 tests passed.   
  
 POLARDB:  
 All 2 tests, 0 tests in ignore, 0 tests in polar ignore.   
==========================================================  
create table t (id int, info text[]);  
  
create or replace function gen_text_arr(int) returns text[] as $$  
  select array(select md5(random()::text) from generate_series(1,$1));  
$$ language sql strict;  
  
postgres=# select gen_text_arr(10);  
-[ RECORD 1 ]+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
gen_text_arr | {4134ee81fcdc29da486df37a1725e1cc,d0bb424307f93a6374d1af5a4b1c0451,def4b4bc24bc6aefb084df8a1571d773,aff17d39b2c3e8ccebf1059c2cd466dc,3988cb3f89372081c6444b7f8a825cf6,77d3a12d9a5159bd2e11fac1782eaf90,0ecac2cd508f60221b31934ea1128223,622819cfa7c3e3e600f70ed90265edaa,e9311e8d6f23be74b2e73eae4408aaa8,207eb23a50212cb101f83a6041211d90}  
  
postgres=# insert into t select id , gen_text_arr(10) from generate_series(1,1000) id;  
INSERT 0 1000  
  
postgres=# select * from t where info @@> array['%4b1%'];  
 id  |                                                                                                                                                            
          info                                                                                                                                                    
                     
-----+----------------------------------------------------------------------------------------------------------------------------------------------------------  
----------------------------------------------------------------------------------------------------------------------------------------------------------------  
-------------------  
  14 | {745b761d7145edb79904c5217c0ec0b4,eab9d9d4de9afc8c7a2bc4cdcd3bcb2a,3116cd48046936709c56e952f5d50380,642eec5d3c17721dadb89759ac116821,49ba14c3c71b73c0a3b8  
6aa6f20a4f9c,01632c5889d4ae642422fea8620187e1,078ea7bf29a6f8bf53c6abcec98df5ad,2548e08ad3cb87dfcfe55a86e47cc60f,0c7002203e72d854f9c0643bec6c59b7,cfdd57d32f4bcee  
8b4b1adfe11a08a81}  
  33 | {639e7f990ef271b24b1ac1a1f154476b,5c0dd44f87821cf555fb579f2dd9871d,b3118d34a6f788ad9c9d3343743900bc,798abd4aece1cbe604e608294227dde6,f08757d02fd0db9d08c9  
2240c55ec14b,54f206220cf2097f0e2a6f630a7871be,585d04664a022ab49607d0d6ff18fc89,f5681d20b2b923973652f9952df6b71d,1d204241c105c78ba0514bdf1dba6bbb,5f427b5c2b65e0d  
e41b70e804dfcc41d}  
...  
  
postgres=# select * from t where info @@> array['%4b1ac%'];  
 id |                                                                                                                                                             
         info                                                                                                                                                     
                    
----+-----------------------------------------------------------------------------------------------------------------------------------------------------------  
----------------------------------------------------------------------------------------------------------------------------------------------------------------  
------------------  
 33 | {639e7f990ef271b24b1ac1a1f154476b,5c0dd44f87821cf555fb579f2dd9871d,b3118d34a6f788ad9c9d3343743900bc,798abd4aece1cbe604e608294227dde6,f08757d02fd0db9d08c92  
240c55ec14b,54f206220cf2097f0e2a6f630a7871be,585d04664a022ab49607d0d6ff18fc89,f5681d20b2b923973652f9952df6b71d,1d204241c105c78ba0514bdf1dba6bbb,5f427b5c2b65e0de  
41b70e804dfcc41d}  
(1 row)  
  
postgres=# select * from t where info @@> array['%4b1acd%'];  
 id | info   
----+------  
(0 rows)  
  
postgres=# explain select * from t where info @@> array['%4b1ac%'];  
                                QUERY PLAN                                  
--------------------------------------------------------------------------  
 Bitmap Heap Scan on t  (cost=28.01..32.02 rows=1 width=36)  
   Recheck Cond: (info @@> '{%4b1ac%}'::text[])  
   ->  Bitmap Index Scan on t_info_idx  (cost=0.00..28.01 rows=1 width=0)  
         Index Cond: (info @@> '{%4b1ac%}'::text[])  
(4 rows)  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t where info @@> array['%4b1ac%', '%8fc89'];  
                                                     QUERY PLAN                                                       
--------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.t  (cost=60.01..64.02 rows=1 width=36) (actual time=0.121..0.122 rows=1 loops=1)  
   Output: id, info  
   Recheck Cond: (t.info @@> '{%4b1ac%,%8fc89}'::text[])  
   Heap Blocks: exact=1  
   Buffers: shared hit=16  
   ->  Bitmap Index Scan on t_info_idx  (cost=0.00..60.01 rows=1 width=0) (actual time=0.109..0.109 rows=1 loops=1)  
         Index Cond: (t.info @@> '{%4b1ac%,%8fc89}'::text[])  
         Buffers: shared hit=15  
 Planning Time: 0.075 ms  
 Execution Time: 0.144 ms  
(10 rows)  
  
Time: 0.699 ms  
postgres=# select * from t where info @@> array['%4b1ac%', '%8fc89'];  
 id |                                                                                                                                                             
         info                                                                                                                                                     
                    
----+-----------------------------------------------------------------------------------------------------------------------------------------------------------  
----------------------------------------------------------------------------------------------------------------------------------------------------------------  
------------------  
 33 | {639e7f990ef271b24b1ac1a1f154476b,5c0dd44f87821cf555fb579f2dd9871d,b3118d34a6f788ad9c9d3343743900bc,798abd4aece1cbe604e608294227dde6,f08757d02fd0db9d08c92  
240c55ec14b,54f206220cf2097f0e2a6f630a7871be,585d04664a022ab49607d0d6ff18fc89,f5681d20b2b923973652f9952df6b71d,1d204241c105c78ba0514bdf1dba6bbb,5f427b5c2b65e0de  
41b70e804dfcc41d}  
(1 row)  
  
Time: 0.733 ms  
  
postgres=# insert into t select id , gen_text_arr(10) from generate_series(1,120000) id;  
INSERT 0 100000  
Time: 9242.877 ms (00:09.243)  
postgres=# \dt+  
                   List of relations  
 Schema | Name | Type  |  Owner   | Size  | Description   
--------+------+-------+----------+-------+-------------  
 public | t    | table | postgres | 50 MB |   
(1 row)  
  
  
  
postgres=# select * from t where info @@> array['%4b1ac%', '%8fc89'];  
 id |                                                                                                                                                             
         info                                                                                                                                                     
                    
----+-----------------------------------------------------------------------------------------------------------------------------------------------------------  
----------------------------------------------------------------------------------------------------------------------------------------------------------------  
------------------  
 33 | {639e7f990ef271b24b1ac1a1f154476b,5c0dd44f87821cf555fb579f2dd9871d,b3118d34a6f788ad9c9d3343743900bc,798abd4aece1cbe604e608294227dde6,f08757d02fd0db9d08c92  
240c55ec14b,54f206220cf2097f0e2a6f630a7871be,585d04664a022ab49607d0d6ff18fc89,f5681d20b2b923973652f9952df6b71d,1d204241c105c78ba0514bdf1dba6bbb,5f427b5c2b65e0de  
41b70e804dfcc41d}  
(1 row)  
  
Time: 4.783 ms  
  
postgres=# explain (analyze,timing,costs,buffers,verbose) select * from t where info @@> array['%4b1ac%', '%8fc89'];  
                                                      QUERY PLAN                                                        
----------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.t  (cost=96.94..529.04 rows=121 width=36) (actual time=4.114..4.115 rows=1 loops=1)  
   Output: id, info  
   Recheck Cond: (t.info @@> '{%4b1ac%,%8fc89}'::text[])  
   Heap Blocks: exact=1  
   Buffers: shared hit=48  
   ->  Bitmap Index Scan on t_info_idx  (cost=0.00..96.91 rows=121 width=0) (actual time=4.103..4.103 rows=1 loops=1)  
         Index Cond: (t.info @@> '{%4b1ac%,%8fc89}'::text[])  
         Buffers: shared hit=47  
 Planning Time: 0.090 ms  
 Execution Time: 4.170 ms  
(10 rows)  

The performance of full table scans is several dozen times better.

postgres=# set enable_bitmapscan =off;  
SET  
Time: 0.473 ms  
postgres=# explain (analyze,timing,costs,buffers,verbose) select * from t where info @@> array['%4b1ac%', '%8fc89'];  
                                                QUERY PLAN                                                  
----------------------------------------------------------------------------------------------------------  
 Seq Scan on public.t  (cost=0.00..7881.50 rows=121 width=36) (actual time=0.632..193.929 rows=1 loops=1)  
   Output: id, info  
   Filter: (t.info @@> '{%4b1ac%,%8fc89}'::text[])  
   Rows Removed by Filter: 120999  
   Buffers: shared hit=6229 read=140  
 Planning Time: 0.081 ms  
 Execution Time: 193.947 ms  
(7 rows)  
  
Time: 194.697 ms  
postgres=# select * from t where info @@> array['%4b1ac%', '%8fc89'];  
 id |                                                                                                                                                             
         info                                                                                                                                                     
                    
----+-----------------------------------------------------------------------------------------------------------------------------------------------------------  
----------------------------------------------------------------------------------------------------------------------------------------------------------------  
------------------  
 33 | {639e7f990ef271b24b1ac1a1f154476b,5c0dd44f87821cf555fb579f2dd9871d,b3118d34a6f788ad9c9d3343743900bc,798abd4aece1cbe604e608294227dde6,f08757d02fd0db9d08c92  
240c55ec14b,54f206220cf2097f0e2a6f630a7871be,585d04664a022ab49607d0d6ff18fc89,f5681d20b2b923973652f9952df6b71d,1d204241c105c78ba0514bdf1dba6bbb,5f427b5c2b65e0de  
41b70e804dfcc41d}  
(1 row)  
  
Time: 199.342 ms   

With parray_gin, the data structure can be designed more flexibly. For example, the "multiple tags and multiple information" of "a time period, a group and an object" can be packed into a row array for storage. If the array is searched for elements, the qualified "a time period, a group and an object" can be quickly matched.

References

http://github.com/theirix/parray_gin/

0 0 0
Share on

digoal

281 posts | 24 followers

You may also like

Comments

digoal

281 posts | 24 followers

Related Products