When the data size for a similarity search exceeds the processing capability of a single machine, you need to horizontally split the data to improve the search capability. PostgreSQL supports such similarity search scenarios.
Alternatively, you can use Alibaba Cloud's POLARDB for PostgreSQL. Similar to Oracle RAC, POLARDB for PostgreSQL allows you to add computing nodes. This solution is superior to horizontal database sharing because it allows to share and store data without splitting.
In the case of a horizontal database sharing scenario, the critical question is how to perform a parallel query when a database is divided into multiple sub-databases.
You can perform a parallel query by using DBLink asynchronous calls. The architecture is designed as follows.
In fact, many examples of running a parallel query by using DBLink asynchronous calls are available.
The following section demonstrates how to implement parallel similarity search in multiple databases using asynchronous calls.
In this example, four local databases are used as remote databases. These databases can be installed on remote computers. The approach in this example is just for test convenience.
Following shows the name of the local database.
postgres
Name of the remote databases is as follows.
db0
db1
db2
db3
Step 1. First, create a user and test databases.
create role test login encrypted password 'secret';
create database db0 with owner test;
create database db1 with owner test;
create database db2 with owner test;
create database db3 with owner test;
Step 2. Create a DBLink plug-in a local database.
create extension dblink;
Step 3. Create a server to connect to the remote databases.
CREATE SERVER db0 FOREIGN DATA WRAPPER dblink_fdw OPTIONS (hostaddr '127.0.0.1', dbname 'db0');
CREATE SERVER db1 FOREIGN DATA WRAPPER dblink_fdw OPTIONS (hostaddr '127.0.0.1', dbname 'db1');
CREATE SERVER db2 FOREIGN DATA WRAPPER dblink_fdw OPTIONS (hostaddr '127.0.0.1', dbname 'db2');
CREATE SERVER db3 FOREIGN DATA WRAPPER dblink_fdw OPTIONS (hostaddr '127.0.0.1', dbname 'db3');
Step 4. Configure the user password used for connecting to the remote databases. The user password can only be used to connect to the remote databases but not the local database.
CREATE USER MAPPING FOR postgres SERVER db0 OPTIONS (user 'test', password 'secret');
CREATE USER MAPPING FOR postgres SERVER db1 OPTIONS (user 'test', password 'secret');
CREATE USER MAPPING FOR postgres SERVER db2 OPTIONS (user 'test', password 'secret');
CREATE USER MAPPING FOR postgres SERVER db3 OPTIONS (user 'test', password 'secret');
Create a test table in each remote database, write test data to the table, and create a similarity search function.
Step 1. Create the required plug-in. (Note that the database name in the following script must be replaced with the actual database name.)
\c db3 postgres
create extension pg_trgm;
create extension dblink;
Step 2. Create the index on which the primary table and similarity search depends. (Note that the database name in the following script must be replaced with the actual database name.)
\c db3 test
create unlogged table tbl(id int primary key, info text);
create index idx_tbl_info on tbl using gin (info gin_trgm_ops);
-- alter table tbl set (parallel_workers =64);
Step 3. Now, Create partitions. The following is only for the test.
do language plpgsql $$
declare
begin
for i in 0..63
loop
execute format('drop table if exists tbl%s ', i);
execute format('create unlogged table tbl%s (like tbl including all) inherits(tbl)', i);
-- 提前设置好表级并行度,方便后面做并行测试
-- execute format('alter table tbl%s set (parallel_workers =64)', i);
end loop;
end;
$$;
Step 4. Create a connection function as shown below.
create or replace function conn(
name, -- dblink名字
text -- 连接串,URL
) returns void as $$
declare
begin
perform dblink_connect($1, $2);
return;
exception when others then
return;
end;
$$ language plpgsql strict;
Step 5. Next, create a function that generates a random function.
- Generate random Chinese strings.
create or replace function gen_hanzi(int) returns text as $$
declare
res text;
begin
if $1 >=1 then
select string_agg(chr(19968+(random()*20901)::int), '') into res from generate_series(1,$1);
return res;
end if;
return null;
end;
$$ language plpgsql strict;
Step 6. Write the test data and random text. (Note that the database name in the following script must be replaced with the actual database name.)
do language plpgsql $$
declare
dbname name := 'db3';
begin
for i in 0..63
loop
perform conn('link'||i, 'hostaddr=127.0.0.1 user=test password=secret dbname='||dbname);
perform dblink_send_query('link'||i, format('insert into tbl%s select generate_series(1, 15625), gen_hanzi(64); analyze tbl%s;', i, i));
end loop;
end;
$$;
Step 7. Create a UDF for the similarity search.
create or replace function get_res(
text, -- 要按相似搜的文本
int8, -- 限制返回多少条
float4 default 0.3, -- 相似度阈值,低于这个值不再搜搜
float4 default 0.1 -- 相似度递减步长,直至阈值
) returns setof record as $$
declare
lim float4 := 1;
begin
-- 判定
if not ($3 <= 1 and $3 > 0) then
raise notice '$3 must >0 and <=1';
return;
end if;
if not ($4 > 0 and $4 < 1) then
raise notice '$4 must >0 and <=1';
return;
end if;
loop
-- 设置相似度阈值
perform set_limit(lim);
-- 查看当前阈值下,有没有相似记录
perform similarity(info, $1) as sml, * from tbl where info % $1 limit 1;
-- 如果有,则返回N条
if found then
return query select similarity(info, $1) as sml, * from tbl where info % $1 order by sml desc limit $2;
return;
end if;
-- 否则继续,降低阈值
-- 当阈值小于0.3时,不再降阈值搜索,认为没有相似。
if lim < $3 then
return;
else
lim := lim - $4;
end if;
end loop;
end;
$$ language plpgsql strict;
Create a function that sets up a remote connection.
create or replace function conn(
name, -- dblink名字
text -- 连接串,URL
) returns void as $$
declare
begin
perform dblink_connect($1, $2);
return;
exception when others then
return;
end;
$$ language plpgsql strict;
Define UDF1 to return the cursor. If a large number of records are returned, it is recommended to use the cursor because PLPGSQL returns the records only after all records are retrieved. The response time (RT) is long when the records are returned. Refer the following example.
create or replace function get_res_cursor(
text, -- 要按相似搜的文本
int8, -- 限制返回多少条
float4 default 0.3, -- 相似度阈值,低于这个值不再搜搜
float4 default 0.1 -- 相似度递减步长,直至阈值
) returns setof refcursor as $$
declare
i int := 1;
ref refcursor[];
res refcursor;
dbname name[] := array['db0', 'db1', 'db2', 'db3']; -- 定义集群
db name;
begin
foreach db in array dbname
loop
ref[i] := 'link'||i;
res := ref[i];
perform conn('link'||i, db);
perform dblink_open('link'||i, 'link'||i, format('select * from get_res(%L, %s, %s, %s) as t(sml real, id int, info text)', $1, $2, $3, $4));
return next res;
i := i+1;
end loop;
end;
$$ language plpgsql strict;
The following example illustrates the usage.
postgres=# begin;
BEGIN
Time: 0.045 ms
postgres=# select * from get_res_cursor('怮媕苸淏倍椡帪暀虻爴荡巒讉輶魂馜虑范噞蠭鲧烳渃麠钸趥剘偣瑴鑪颭蚢佚簀哌內霡擷槧緸褫齈跊甏軙襧漆疅泅睤帍槇驗縐棂', 10, 0.1, 0.05);
get_res_cursor
----------------
link1
link2
link3
link4
(4 rows)
Time: 18.624 ms
postgres=# select * from dblink_fetch('link1','link1',10) as t(sml real, id int, info text);
sml | id | info
-----+----+------
(0 rows)
Time: 219.972 ms
postgres=# select * from dblink_fetch('link1','link1',10) as t(sml real, id int, info text);
sml | id | info
-----+----+------
(0 rows)
Time: 0.252 ms
postgres=# select * from dblink_fetch('link2','link2',10) as t(sml real, id int, info text);
sml | id | info
-----+----+------
(0 rows)
Time: 215.891 ms
postgres=# select * from dblink_fetch('link3','link3',10) as t(sml real, id int, info text);
sml | id | info
-----+----+------
(0 rows)
Time: 215.188 ms
postgres=# select * from dblink_fetch('link4','link4',10) as t(sml real, id int, info text);
sml | id | info
----------+----+----------------------------------------------------------------------------------------------------------------------------------
0.779412 | 1 | 递陊怮媕苸淏倍椡帪暀虻爴荡巒讉輶魂馜虑范噞蠭鲧烳渃麠钸趥剘偣瑴鑪颭蚢佚簀哌內霡擷槧緸褫齈跊甏軙襧漆疅泅睤帍槇驗縐棂轪氐洚重銄懟諔
(1 row)
Time: 106.692 ms
Define UDF2 to return the record. It is recommended that you specify the maximum number of returned records because PLPGSQL returns the records only after all records are retrieved. Refer the following example.
create or replace function get_res_record(
text, -- 要按相似搜的文本
int8, -- 限制返回多少条
float4 default 0.3, -- 相似度阈值,低于这个值不再搜搜
float4 default 0.1 -- 相似度递减步长,直至阈值
) returns setof record as $$
declare
i int;
ref refcursor[];
res refcursor;
dbname name[] := array['db0', 'db1', 'db2', 'db3']; -- 定义集群
db name;
begin
i := 1;
foreach db in array dbname
loop
perform conn('link'||i, db);
perform 1 from dblink_get_result('link'||i) as t(sml real, id int, info text);
perform dblink_send_query('link'||i, format('select * from get_res(%L, %s, %s, %s) as t(sml real, id int, info text)', $1, $2, $3, $4));
i := i+1;
end loop;
i := 1;
foreach db in array dbname
loop
return query SELECT * FROM dblink_get_result('link'||i) as t(sml real, id int, info text);
i := i+1;
end loop;
end;
$$ language plpgsql strict;
The following example illustrates the usage.
postgres=# select * from get_res_record('怮媕苸淏倍椡帪暀虻爴荡巒讉輶魂馜虑范噞蠭鲧烳渃麠钸趥剘偣瑴鑪颭蚢佚簀哌內霡擷槧緸褫齈跊甏軙襧漆疅泅睤帍槇驗縐棂', 10, 0.77, 0.4) as (sml real, id int, info text);
sml | id | info
----------+----+----------------------------------------------------------------------------------------------------------------------------------
0.779412 | 1 | 递陊怮媕苸淏倍椡帪暀虻爴荡巒讉輶魂馜虑范噞蠭鲧烳渃麠钸趥剘偣瑴鑪颭蚢佚簀哌內霡擷槧緸褫齈跊甏軙襧漆疅泅睤帍槇驗縐棂轪氐洚重銄懟諔
(1 row)
Time: 32.329 ms
Applying the method discussed in this article, you can use multiple PostgreSQL databases as one PostgreSQL database to linearly scale-out the parallel similarity search performance. Also, DBLink asynchronous calls can be used to linearly scale out the search performance of similar texts, without degrading the performance.
Build your own PostgreSQL solution on Alibaba Cloud with ApsaraDB for RDS PostgreSQL.
PostgreSQL Multi-meaning Array Retrieval and Condition Filtering
Shopping Guide System Optimization: Application of E-Commerce Content Deduplication and Filtering
digoal - February 5, 2020
digoal - October 12, 2022
digoal - December 14, 2018
digoal - July 4, 2019
digoal - August 20, 2021
digoal - March 20, 2019
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAn online MPP warehousing service based on the Greenplum Database open source program
Learn MoreAn on-demand database hosting service for PostgreSQL with automated monitoring, backup and disaster recovery capabilities
Learn MoreLeverage cloud-native database solutions dedicated for FinTech.
Learn MoreMore Posts by digoal