PolarDB's cloud-native computing-storage separation architecture offers cost-effective data storage, scalable operations, high-speed multi-machine parallel computing, and efficient data search and processing. By combining PolarDB with computing algorithms, it enables businesses to derive value from their data and turn it into productivity.
This article describes how open source PolarDB uses pg_trgm GIN indexes to implement efficient fuzzy queries like '%xxx%'.
The test environment used for this demonstration is macOS + Docker. For detailed instructions on PolarDB deployment, please refer to the following article: Simple Deployment of PolarDB
pg_trgm adds two spaces before and one space after the string, then splits the string into groups of three consecutive characters to generate a set of tokens. For example, hello is split into {" h"," he",ell,hel,llo,"lo "}
.
postgres=# select show_trgm('hello');
show_trgm
---------------------------------
{" h"," he",ell,hel,llo,"lo "}
(1 row)
When you create a GIN index for these tokens and perform a fuzzy search (or even support a regular search), you can apply tokenization to the conditions of the target string without adding spaces before and after the target string, unless prefix or suffix restrictions are specified. This allows you to quickly match the target row using the GIN index.
1. Create a table and generate 2 million test texts.
create table tbl (id int, info text);
insert into tbl select id, md5(random()::text) from generate_series(1,1000000) id;
insert into tbl select id, md5(random()::text) from generate_series(1,1000000) id;
2. In the absence of indexes, a full table scan is required to perform a fuzzy query, which is time-consuming.
explain (analyze,verbose,timing,costs,buffers) select * from tbl where info like '%abcd%';
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where info like '%abcd%';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Seq Scan on public.tbl (cost=0.00..41665.50 rows=200 width=37) (actual time=2.505..522.958 rows=851 loops=1)
Output: id, info
Filter: (tbl.info ~~ '%abcd%'::text)
Rows Removed by Filter: 1999149
Buffers: shared hit=16645 read=22 dirtied=8334
Planning Time: 1.643 ms
Execution Time: 523.138 ms
(7 rows)
3. Create the pg_trgm plug-in and the GIN index.
postgres=# create extension pg_trgm ;
CREATE EXTENSION
create index on tbl using gin (info gin_trgm_ops);
4. Use pg_trgm GIN index to realize high-efficiency fuzzy query like '%xxx%'
.
explain (analyze,verbose,timing,costs,buffers) select * from tbl where info like '%abcd%';
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where info like '%abcd%';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.tbl (cost=29.55..762.82 rows=200 width=37) (actual time=2.445..3.962 rows=851 loops=1)
Output: id, info
Recheck Cond: (tbl.info ~~ '%abcd%'::text)
Rows Removed by Index Recheck: 96
Heap Blocks: exact=926
Buffers: shared hit=946
-> Bitmap Index Scan on tbl_info_idx (cost=0.00..29.50 rows=200 width=0) (actual time=2.287..2.288 rows=947 loops=1)
Index Cond: (tbl.info ~~ '%abcd%'::text)
Buffers: shared hit=20
Planning Time: 0.239 ms
Execution Time: 4.112 ms
(11 rows)
We can see that the performance is improved by more than 100 times.
Open Source PolarDB Enhances Time Series Data with TimescaleDB
Open Source PolarDB Uses duckdb_fdw to Support Parquet and Efficient OLAP
digoal - November 28, 2023
digoal - December 11, 2019
digoal - September 12, 2019
digoal - December 11, 2019
digoal - December 11, 2019
digoal - December 11, 2019
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAlibaba Cloud PolarDB for Xscale (PolarDB-X) is a cloud-native high-performance distributed database service independently developed by Alibaba Cloud.
Learn MoreAlibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn MoreMulti-source metrics are aggregated to monitor the status of your business and services in real time.
Learn MoreMore Posts by digoal