By Digoal
The goal of this article is to design a global ID assignment service, using three different methods, in accordance with the following requirements:
Now, let's proceed with three different designs in accordance with the above requirements.
In this design, we have a group ID with one sequence for each group, where the sequence and text are unique within a single group.
create or replace function get_per_gp_id(
Text, -- Sequence name prefix
int -- The group ID serves as the sequence name suffix
) returns int8 as
$$
declare
begin
return nextval(($1||$2)::regclass);
exception when others then
execute 'create sequence if not exists '||$1||$2||' start with 0 minvalue 0' ;
return nextval(($1||$2)::regclass);
end;
$$
language plpgsql strict;
create table tbl1(
gid int, -- Group ID
ts text, -- Text
sn int8, -- Auto-increment sequence value
unique(gid,ts),
unique(gid,sn)
);
create or replace function ins1(
int, -- Group ID
text -- Text
) returns int8 as
$$
declare
res int8;
begin
-- Checks whether the text already exists in this group
select sn into res from tbl1 where gid=$1 and ts=$2;
if found then
return res;
else
-- If it does not exist, an ID is generated
insert into tbl1 (gid,ts,sn) values ($1, $2, get_per_gp_id('seq_', $1)) returning sn into res;
return res;
end if;
exception when others then
-- An exception may be thrown when another parallel session is generating the sequence. Query again, and return the SN.
select sn into res from tbl1 where gid=$1 and ts=$2;
if found then
return res;
else
raise ;
end if;
end;
$$
language plpgsql strict;
In this design, we will not have a group ID, and the text and sequence are globally unique.
create sequence seq_tbl2_sn start with 0 minvalue 0;
create table tbl2(
ts text unique, -- Text
sn int8 default nextval('public.seq_tbl2_sn'::regclass) unique -- Sequence
);
create or replace function ins2(
text
) returns int8 as
$$
declare
res int8;
begin
-- Check whether the text already exists
select sn into res from tbl2 where ts=$1;
if found then
return res;
else
-- If it does not exist, an ID is generated
insert into tbl2 (ts) values ($1) returning sn into res;
return res;
end if;
exception when others then
-- An exception may be thrown when another parallel session is generating the sequence. Query again, and return the SN.
select sn into res from tbl2 where ts=$1;
if found then
return res;
else
raise ;
end if;
end;
$$
language plpgsql strict;
The third approach uses a globally unique group ID. If the dictionary contains 4 billion values or less, use INT4. If the dictionary exceeds 4 billion values, we need to use INT8.
create sequence seq_tbl_dict minvalue -2147483648 start with -2147483648;
create table tbl_dict(
gid int2, -- Group ID
ts text, -- Text
sn int4 default nextval('public.seq_tbl_dict'::regclass), -- Sequence
unique (gid,ts),
unique (sn)
);
create or replace function get_sn(int2, text) returns int as
$$
declare
res int;
begin
-- Optimistic query
select sn into res from tbl_dict where gid=$1 and ts=$2;
if found then
return res;
end if;
-- Inserts one if not found
insert into tbl_dict values($1,$2,nextval('public.seq_tbl_dict'::regclass)) on conflict (gid,ts) do nothing returning sn into res;
if found then
return res;
-- In the case of insertion conflicts, it continues the query and returns sn
else
select sn into res from tbl_dict where gid=$1 and ts=$2;
return res;
end if;
end;
$$
language plpgsql strict;
select ins1(gid, ts) from (values (),(),.....()) as t(gid, ts);
select ins2(ts) from (values (),(),.....()) as t(ts);
Example and performance: it takes about 2 milliseconds to assign IDs to 100 text values
select ins1(id, 'test'||id) from generate_series(1,100) t(id);
...........
0
(100 rows)
Time: 1.979 ms
vi test1.sql
\set gid random(1,10)
\set ts random(1,100000000)
select ins1(:gid, md5(:ts::text));
pgbench -M prepared -n -r -P 1 -f ./test1.sql -c 56 -j 56 -T 120
transaction type: ./test1.sql
scaling factor: 1
query mode: prepared
number of clients: 56
number of threads: 56
duration: 120 s
number of transactions actually processed: 18082960
latency average = 0.232 ms
latency stddev = 0.517 ms
tps = 150680.114138 (including connections establishing)
tps = 150687.227354 (excluding connections establishing)
statement latencies in milliseconds:
0.001 \set gid random(1,10)
0.000 \set ts random(1,100000000)
0.230 select ins1(:gid, md5(:ts::text));
vi test2.sql
\set ts random(1,100000000)
select ins2(md5(:ts::text));
pgbench -M prepared -n -r -P 1 -f ./test2.sql -c 56 -j 56 -T 120
transaction type: ./test2.sql
scaling factor: 1
query mode: prepared
number of clients: 56
number of threads: 56
duration: 120 s
number of transactions actually processed: 11515008
latency average = 0.584 ms
latency stddev = 0.766 ms
tps = 95613.170828 (including connections establishing)
tps = 95618.249995 (excluding connections establishing)
statement latencies in milliseconds:
0.001 \set ts random(1,100000000)
0.582 select ins2(md5(:ts::text));
vi test3.sql
\set gid random(1,10)
\set ts random(1,100000000)
select get_sn(:gid, md5(:ts::text));
pgbench -M prepared -n -r -P 1 -f ./test3.sql -c 56 -j 56 -T 120
transaction type: ./test3.sql
scaling factor: 1
query mode: prepared
number of clients: 56
number of threads: 56
duration: 120 s
number of transactions actually processed: 7665708
latency average = 0.877 ms
latency stddev = 0.666 ms
tps = 63868.058538 (including connections establishing)
tps = 63875.166407 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.001 \set gid random(1,10)
0.000 \set ts random(1,100000000)
0.875 select get_sn(:gid, md5(:ts::text));
postgres=# select * from tbl_dict limit 10;
gid | ts | sn
-----+----------------------------------+-------------
9 | 8021bdb598f73577a063b50bdf0cef31 | -2147483648
3 | e1988c3c7a80dcd1b1c1bdcf2ac31fe7 | -2147483646
7 | 6ee09b73df8ae9bb97a4ebd4c51bd212 | -2147483647
1 | fa8303da6ea2b6e995a1e090fb9cd9f2 | -2147483645
7 | ca1c614104f1ad3af92d8d9a2911a5b6 | -2147483643
8 | 4641dd1162f46e8be5f643facc85df94 | -2147483644
6 | 88250e10f0d27cdebbf5c5eb4a7032a3 | -2147483641
2 | 5718da726fd20d8fd12d56e9bf2d7e9e | -2147483642
1 | 687e553016fe6bd1dba3ca6126b8b5b8 | -2147483639
10 | a4707645d604dd1ad9ba96ff303cf9d9 | -2147483638
(10 rows)
Cause of gaps: irreversible use of the sequence. Even if a transaction fails, the consumed sequence value cannot be returned.
It meets the requirements according to the test.
postgres=# select gid,count(*),min(sn),max(sn),((max(sn)+1)/count(*)::float8-1)*100||' %' from tbl1 group by gid;
gid | count | min | max | ?column?
-----+---------+-----+---------+----------
1 | 1790599 | 0 | 1790598 | 0 %
2 | 1793384 | 0 | 1793383 | 0 %
3 | 1791533 | 0 | 1791532 | 0 %
4 | 1792755 | 0 | 1792754 | 0 %
5 | 1793897 | 0 | 1793896 | 0 %
6 | 1794786 | 0 | 1794785 | 0 %
7 | 1792282 | 0 | 1792281 | 0 %
8 | 1790630 | 0 | 1790629 | 0 %
9 | 1791303 | 0 | 1791302 | 0 %
10 | 1790307 | 0 | 1790306 | 0 %
(10 rows)
postgres=# select count(*),min(sn),max(sn),((max(sn)+1)/count(*)::float8-1)*100||' %' from tbl2;
count | min | max | ?column?
----------+-----+----------+------------------------
10877124 | 0 | 10877128 | 4.59680334685686e-05 %
(1 row)
As long as it is full, it just returns SN. So you can just slightly modify the stress test script
vi test1.sql
\set gid random(1,10)
\set ts random(1,10000)
select ins1(:gid, md5(:ts::text));
pgbench -M prepared -n -r -P 1 -f ./test1.sql -c 56 -j 56 -T 120
transaction type: ./test1.sql
scaling factor: 1
query mode: prepared
number of clients: 56
number of threads: 56
duration: 120 s
number of transactions actually processed: 69229025
latency average = 0.097 ms
latency stddev = 0.040 ms
tps = 574906.288558 (including connections establishing)
tps = 575063.117108 (excluding connections establishing)
statement latencies in milliseconds:
0.002 \set gid random(1,10)
0.001 \set ts random(1,10000)
0.098 select ins1(:gid, md5(:ts::text));
You can also use SELECT
vi test3.sql
\set gid random(1,10)
\set ts random(1,10000)
select * from tbl1 where gid=:gid and ts=md5(:ts::text);
pgbench -M prepared -n -r -P 1 -f ./test3.sql -c 56 -j 56 -T 120
transaction type: ./test3.sql
scaling factor: 1
query mode: prepared
number of clients: 56
number of threads: 56
duration: 120 s
number of transactions actually processed: 90985807
latency average = 0.074 ms
latency stddev = 0.009 ms
tps = 758067.503368 (including connections establishing)
tps = 758109.672642 (excluding connections establishing)
statement latencies in milliseconds:
0.001 \set gid random(1,10)
0.001 \set ts random(1,10000)
0.074 select * from tbl1 where gid=:gid and ts=md5(:ts::text);
The database itself has not been optimized. The use of the ECS virtual machine environment provides room for performance improvement. You can also shard the database by GID, and easily achieve 1 million QPS.
Using UDF, sequence and other PostgreSQL functions, you can implement the "Global ID Assignment Service" design as required at the beginning of this article.
Note that the PG instance used here is an ECS virtual machine instance. Its read performance is only half that of physical machines. It can easily achieve a read performance of 1 million QPS when running on physical machines.
For the sake of future scalability, you can assign GID to different instances to achieve horizontal expansion. This allows you to achieve a read performance of 1 million QPS for each single PG instance, and 1 million*N TPS for multiple instances.
Why don't we use the hash function to generate a global dictionary? Wouldn't it be faster? The reason is still global uniqueness. There may be conflicts when using the HASH function. Even the INT 8 HASH function cannot guarantee global uniqueness and the one-to-one relationship between the group ID and the sequence. (Another reason is that the dictionary corresponds to the number of unique values, and does not cause any gaps in sequence values. Therefore, we can use a smaller integer (INT4 is used in this example). Smaller integer means smaller size and faster access speed).
Postgrespro introduced a kernel layer dictionary in json, which makes it unnecessary for the business layer to care about the dictionary.
Performance of PostgreSQL Multi-Field Random Combination Searches
Tagging Trillions of Users in Milliseconds with varbitx on PostgreSQL
digoal - January 14, 2019
digoal - March 20, 2019
digoal - March 20, 2019
Alibaba Clouder - July 5, 2019
Alibaba Cloud Community - February 8, 2023
digoal - February 5, 2020
An on-demand database hosting service for PostgreSQL with automated monitoring, backup and disaster recovery capabilities
Learn MoreAlibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAn on-demand database hosting service for MySQL with automated monitoring, backup and disaster recovery capabilities
Learn MoreApsaraDB RDS for MariaDB supports multiple storage engines, including MySQL InnoDB to meet different user requirements.
Learn MoreMore Posts by digoal