By Digoal
PostgreSQL uses pg_trgm plug-in to support regular expressions and LIKE for fully fuzzy searches (prefix/suffix-free fuzzy searches). However, to support searches for Chinese characters, you need to ensure that lc_ctype <> C.
Although in terms of semantics, the meanings of the following two searches are the same.
select * from test where col like '%xxxxxx%';
select * from test where col ~ 'xxxxxx';
However, different processing logic is used in the internal database processing, which respectively correspond to the following code.
This may cause some performance differences, but the performance of LIKE will be much better.
The processing logic of regular expressions is more complex. Therefore, it is recommended to use LIKE when regular expressions are not necessitated, else use regular expressions.
The following two tests illustrate the performance comparison.
create or replace function gen_hanzi(int) returns text as $$
declare
res text;
begin
if $1 >=1 then
select string_agg(chr(19968+(random()*20901)::int), '') into res from generate_series(1,$1);
return res;
end if;
return null;
end;
$$ language plpgsql strict;
postgres=# create table test(id int, info text);
CREATE TABLE
postgres=# insert into test select generate_series(1,100000), gen_hanzi(100);
INSERT 0 100000
postgres=# create index idx_test_1 on test using gin (info gin_trgm_ops);
CREATE INDEX
Although an index is used, the syntax of regex searches is currently not effective for wchar characters, and the entire GIN tree is scanned.
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~ '婐绷乂畳';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test (cost=45261409.28..45261421.30 rows=10 width=36) (actual time=583.810..816.503 rows=1 loops=1)
Output: id, info
Recheck Cond: (test.info ~ '婐绷乂畳'::text)
Rows Removed by Index Recheck: 99999
Heap Blocks: exact=4167
Buffers: shared hit=59783
-> Bitmap Index Scan on idx_test_1 (cost=0.00..45261409.28 rows=10 width=0) (actual time=583.237..583.237 rows=100000 loops=1)
Index Cond: (test.info ~ '婐绷乂畳'::text)
Buffers: shared hit=55616
Planning time: 0.150 ms
Execution time: 816.545 ms
(11 rows)
It is important to note that the syntax of regex searches works well for ASCII characters.
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~ '123';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test (cost=39.40..2897.60 rows=4000 width=36) (actual time=0.046..0.046 rows=0 loops=1)
Output: id, info
Recheck Cond: (test.info ~ '123'::text)
Buffers: shared hit=4
-> Bitmap Index Scan on idx_test_1 (cost=0.00..38.40 rows=4000 width=0) (actual time=0.043..0.043 rows=0 loops=1)
Index Cond: (test.info ~ '123'::text)
Buffers: shared hit=4
Planning time: 0.146 ms
Execution time: 0.072 ms
(9 rows)
On the other hand, the syntax of LIKE works well for both ASCII and wchar characters.
-- wchar
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like '%婐绷乂畳%';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test (cost=13.28..25.30 rows=10 width=36) (actual time=0.042..0.042 rows=1 loops=1)
Output: id, info
Recheck Cond: (test.info ~~ '%婐绷乂畳%'::text)
Heap Blocks: exact=1
Buffers: shared hit=8
-> Bitmap Index Scan on idx_test_1 (cost=0.00..13.27 rows=10 width=0) (actual time=0.027..0.027 rows=1 loops=1)
Index Cond: (test.info ~~ '%婐绷乂畳%'::text)
Buffers: shared hit=7
Planning time: 0.110 ms
Execution time: 0.108 ms
(10 rows)
-- ascii
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~~ '%123%';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test (cost=39.40..2897.60 rows=4000 width=36) (actual time=0.018..0.018 rows=0 loops=1)
Output: id, info
Recheck Cond: (test.info ~~ '%123%'::text)
Buffers: shared hit=4
-> Bitmap Index Scan on idx_test_1 (cost=0.00..38.40 rows=4000 width=0) (actual time=0.015..0.015 rows=0 loops=1)
Index Cond: (test.info ~~ '%123%'::text)
Buffers: shared hit=4
Planning time: 0.091 ms
Execution time: 0.046 ms
(9 rows)
The preceding two tests show that the operators used for LIKE and regular expressions are different.
List of operators
Schema | Name | Left arg type | Right arg type | Result type | Function | Description
------------+------+---------------+----------------+-------------+-------------+-------------------------
pg_catalog | ~~ | text | text | boolean | textlike | matches LIKE expression
pg_catalog | ~ | text | text | boolean | textregexeq | matches regular expression, case-sensitive
The operators correspond to textlike and textregexeq, respectively and the code for both is as follows.
To achieve the optimal search effect for fully fuzzy searches in the current scenario, it is recommended to use LIKE expression or ~~ expression. Do not use the syntax of regular expressions.
Build your own PostgreSQL solution on Alibaba Cloud with ApsaraDB for RDS PostgreSQL.
PostgreSQL Responds to Fuzzy or Regular Expression-based Searches within Seconds
digoal - December 11, 2019
digoal - December 11, 2019
digoal - October 23, 2018
digoal - September 12, 2019
digoal - September 12, 2019
digoal - December 11, 2019
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAn online MPP warehousing service based on the Greenplum Database open source program
Learn MoreOpenSearch helps develop intelligent search services.
Learn MoreAn intelligent image search service with product search and generic search features to help users resolve image search requests.
Learn MoreMore Posts by digoal