By digoal
This is a data structure introduction.
Simplified Example:
create extension hll;
create table tab1 (
vid int8, -- 有重复值, 一个表里面每个VID 可能有100条
score numeric, -- 同一个vid有多条记录, 同一个vid的多条记录score完全相同. 不同vid的score可能相同也可能不同
other_col int -- 其他属性
);
Write 100 million test data records:
insert into tab1 select generate_series(1,1000000),
generate_series(1,1000000)/10000000000000::numeric ,
random()*10000 from generate_series(1,100);
Write several duplicate score records with different vid values:
insert into tab1 select generate_series(1000001,1000009),
(1000000-9)/10000000000000::numeric,
random()*10000 from generate_series(1,100);
Create table B and use hll to store unique values as filtering conditions for test queries in this article:
create table tab2 (uid int8, vids hll);
The IDs contained in hll are filtered out during queries:
insert into tab2 values
(1,
hll_empty() || hll_hash_bigint(1000000) || hll_hash_bigint(1000000-1) || hll_hash_bigint(1000000-5) || hll_hash_bigint(1000000-8)
);
Query Requirements:
Create an index:
create index idx_tab1_1 on tab1 (score desc,vid);
The first part can be queried like this:
select distinct on (score,vid) vid,score from tab1 order by score desc,vid limit 20;
select vid,score from tab1 group by score,vid order by score desc,vid limit 20;
vid | score
---------+----------------------------
1000000 | 0.000000100000000000000000
999999 | 0.000000099999900000000000
999998 | 0.000000099999800000000000
999997 | 0.000000099999700000000000
999996 | 0.000000099999600000000000
999995 | 0.000000099999500000000000
999994 | 0.000000099999400000000000
999993 | 0.000000099999300000000000
999992 | 0.000000099999200000000000
999991 | 0.000000099999100000000000
1000001 | 0.000000099999100000000000
1000002 | 0.000000099999100000000000
1000003 | 0.000000099999100000000000
1000004 | 0.000000099999100000000000
1000005 | 0.000000099999100000000000
1000006 | 0.000000099999100000000000
1000007 | 0.000000099999100000000000
1000008 | 0.000000099999100000000000
1000009 | 0.000000099999100000000000
999990 | 0.000000099999000000000000
(20 rows)
postgres=> explain (analyze,verbose,timing,costs,buffers)
select distinct on (score,vid) vid,score
from tab1 order by score desc,vid limit 20;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.57..70.50 rows=20 width=16) (actual time=0.017..1.041 rows=20 loops=1)
Output: vid, score
Buffers: shared hit=1076
-> Result (cost=0.57..34965244.98 rows=10000090 width=16) (actual time=0.016..1.035 rows=20 loops=1)
Output: vid, score
Buffers: shared hit=1076
-> Unique (cost=0.57..34965244.98 rows=10000090 width=16) (actual time=0.015..1.030 rows=20 loops=1)
Output: score, vid
Buffers: shared hit=1076
-> Index Only Scan using idx_tab1_1 on public.tab1 (cost=0.57..34465240.50 rows=100000896 width=16) (actual time=0.013..0.683 rows=1901 loops=1)
Output: score, vid
Heap Fetches: 1901
Buffers: shared hit=1076
Planning Time: 0.092 ms
Execution Time: 1.062 ms
(15 rows)
postgres=> explain (analyze,verbose,timing,costs,buffers)
select vid,score from tab1
group by score,vid order by score desc,vid limit 20;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.57..70.50 rows=20 width=16) (actual time=0.016..1.037 rows=20 loops=1)
Output: vid, score
Buffers: shared hit=1076
-> Group (cost=0.57..34965244.98 rows=10000090 width=16) (actual time=0.016..1.032 rows=20 loops=1)
Output: vid, score
Group Key: tab1.score, tab1.vid
Buffers: shared hit=1076
-> Index Only Scan using idx_tab1_1 on public.tab1 (cost=0.57..34465240.50 rows=100000896 width=16) (actual time=0.014..0.692 rows=1901 loops=1)
Output: score, vid
Heap Fetches: 1901
Buffers: shared hit=1076
Planning Time: 0.091 ms
Execution Time: 1.055 ms
(13 rows)
According to the second part, the SQL statements are listed below:
explain (analyze,verbose,timing,costs,buffers)
with tmp as (select vids from tab2 where uid=1)
select t1.vid,t1.score from tab1 t1,tmp
where (tmp.vids || hll_hash_bigint(t1.vid) <> tmp.vids)
group by t1.score,t1.vid
order by t1.score desc,t1.vid limit 20;
vid | score
---------+----------------------------
999998 | 0.000000099999800000000000
999997 | 0.000000099999700000000000
999996 | 0.000000099999600000000000
999994 | 0.000000099999400000000000
999993 | 0.000000099999300000000000
999991 | 0.000000099999100000000000
1000001 | 0.000000099999100000000000
1000002 | 0.000000099999100000000000
1000003 | 0.000000099999100000000000
1000004 | 0.000000099999100000000000
1000005 | 0.000000099999100000000000
1000006 | 0.000000099999100000000000
1000007 | 0.000000099999100000000000
1000008 | 0.000000099999100000000000
1000009 | 0.000000099999100000000000
999990 | 0.000000099999000000000000
999989 | 0.000000099998900000000000
999988 | 0.000000099998800000000000
999987 | 0.000000099998700000000000
999986 | 0.000000099998600000000000
(20 rows)
Limit (cost=25.57..133.47 rows=20 width=16) (actual time=0.626..3.477 rows=20 loops=1)
Output: t1.vid, t1.score
Buffers: shared hit=1475 read=4 dirtied=1
I/O Timings: read=0.619
CTE tmp
-> Seq Scan on public.tab2 (cost=0.00..25.00 rows=6 width=32) (actual time=0.309..0.310 rows=1 loops=1)
Output: tab2.vids
Filter: (tab2.uid = 1)
Buffers: shared read=1 dirtied=1
I/O Timings: read=0.299
-> Group (cost=0.57..53950415.15 rows=10000090 width=16) (actual time=0.625..3.471 rows=20 loops=1)
Output: t1.vid, t1.score
Group Key: t1.score, t1.vid
Buffers: shared hit=1475 read=4 dirtied=1
I/O Timings: read=0.619
-> Nested Loop (cost=0.57..50965388.40 rows=597005349 width=16) (actual time=0.624..3.098 rows=1901 loops=1)
Output: t1.vid, t1.score
Join Filter: ((tmp.vids || hll_hash_bigint(t1.vid, 0)) <> tmp.vids)
Rows Removed by Join Filter: 400
Buffers: shared hit=1475 read=4 dirtied=1
I/O Timings: read=0.619
-> Index Only Scan using idx_tab1_1 on public.tab1 t1 (cost=0.57..34465240.50 rows=100000896 width=16) (actual time=0.012..1.190 rows=2301 loops=1)
Output: t1.score, t1.vid
Heap Fetches: 2301
Buffers: shared hit=1475 read=3
I/O Timings: read=0.320
-> CTE Scan on tmp (cost=0.00..0.12 rows=6 width=32) (actual time=0.000..0.000 rows=1 loops=2301)
Output: tmp.vids
Buffers: shared read=1 dirtied=1
I/O Timings: read=0.299
Planning Time: 0.164 ms
Execution Time: 3.541 ms
(32 rows)
1. The process is repeated too many times because each record in each group is stored externally, which wastes resources.
loops=2301
2. The block is scanned too many times: 1,475
Each vid value has multiple replicates. When grouping, all heaps (restricted by limit) corresponding to the specified group are scanned.
The larger the limit, the more waste
SQL statements seem to be very fast in execution, but they waste a lot (especially if there are many duplicate vid values and the limit is too large). Can it be optimized further, especially under high concurrency?
Recursion doesn't seem to work because the same score may have multiple vid values, so score and vid need to be considered. One condition is good for the recursion boundary, which refers to gradual jumps (similar to skip index scan).
If it is clear that the vid values for different score fields are different, the jump boundary condition is score but not vid.
Method:
We can put vid as the weight in score without affecting the original sorting order by score. We can weigh vid into score through expressions without changing the original sorting order by score.
For example, if the value of score is greater than 1/1e21, we can weigh vid to the decimal place of 1/1e21. The precision of the data type must ensure that it can store the weighted decimal places. For example, we use the numeric type, which has no specific precision. It is okay to be accurate to e25. Otherwise, we must specify the decimal place number to be greater than or equal to 25.
The vid value is converted to a numeric value. If the vid value is not a numeric value, you can use the built-in hash function of PostgreSQL to convert it to a numeric value and perform the modulo operation (control the vid collision space and the decimal place number to store the weighted value).
As shown in the following code, 9999 requires 4 decimal places to store the weighted value, so it is divided by 1e25 (21+4) to create the following index:
create index idx_tab1_2 on tab1 ( (score + (mod(vid,9999)/1e25)) );
explain (analyze,verbose,timing,costs,buffers)
with recursive tmp as ( -- 递归
(
select array[vid::numeric, score, (score + (mod(vid,9999)/1e25))] as r
from tab1
order by (score + (mod(vid,9999)/1e25)) desc limit 1 -- 取最大score一条
)
union all
(
select
(select array[vid::numeric, score, (score + (mod(vid,9999)/1e25))] as r from tab1 t1
where (t1.score + (mod(t1.vid,9999)/1e25)) < (tmp.r)[3] -- 取下一个最大score, vid不同但是score相同的记录也会被获取, (仅指当vid的模不同时)
and (t1.score + (mod(t1.vid,9999)/1e25)) is not null
order by (t1.score + (mod(t1.vid,9999)/1e25)) desc limit 1 -- 取一条
)
from tmp where (tmp.r)[3] is not null
)
),
tmp2 as
(
select vids from tab2 where uid = 1
)
select (v.r)[1]::int8 as vid, (v.r)[2] as score from tmp v, tmp2 t
where (t.vids || hll_hash_bigint(((v.r)[1])::int8) <> t.vids)
limit 20
;
Defect:
When score values of multiple vid fields are the same, and the values of vid fields with the modulo 9999 are the same, only one vid is retrieved.
vid | score
---------+----------------------------
999998 | 0.000000099999800000000000
999997 | 0.000000099999700000000000
999996 | 0.000000099999600000000000
999994 | 0.000000099999400000000000
999993 | 0.000000099999300000000000
1000009 | 0.000000099999100000000000
1000008 | 0.000000099999100000000000
1000007 | 0.000000099999100000000000
1000006 | 0.000000099999100000000000
1000005 | 0.000000099999100000000000
1000004 | 0.000000099999100000000000
1000003 | 0.000000099999100000000000
1000002 | 0.000000099999100000000000
1000001 | 0.000000099999100000000000
999991 | 0.000000099999100000000000
999990 | 0.000000099999000000000000
999989 | 0.000000099998900000000000
999988 | 0.000000099998800000000000
999987 | 0.000000099998700000000000
999986 | 0.000000099998600000000000
(20 rows)
Limit (cost=117.68..118.37 rows=20 width=40) (actual time=0.077..0.336 rows=20 loops=1)
Output: ((v.r[1])::bigint), (v.r[2])
Buffers: shared hit=121
CTE tmp
-> Recursive Union (cost=0.57..92.68 rows=101 width=32) (actual time=0.021..0.292 rows=24 loops=1)
Buffers: shared hit=120
-> Subquery Scan on "*SELECT* 1" (cost=0.57..0.88 rows=1 width=32) (actual time=0.021..0.022 rows=1 loops=1)
Output: "*SELECT* 1".r
Buffers: shared hit=5
-> Limit (cost=0.57..0.87 rows=1 width=64) (actual time=0.020..0.021 rows=1 loops=1)
Output: (ARRAY[(tab1.vid)::numeric, tab1.score, (tab1.score + ((mod(tab1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric))]), ((tab1.score + ((mod(tab1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric)))
Buffers: shared hit=5
-> Index Scan Backward using idx_tab1_2 on public.tab1 (cost=0.57..29864985.67 rows=100000896 width=64) (actual time=0.020..0.020 rows=1 loops=1)
Output: ARRAY[(tab1.vid)::numeric, tab1.score, (tab1.score + ((mod(tab1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric))], (tab1.score + ((mod(tab1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric))
Buffers: shared hit=5
-> WorkTable Scan on tmp (cost=0.00..8.98 rows=10 width=32) (actual time=0.011..0.011 rows=1 loops=23)
Output: (SubPlan 1)
Filter: (tmp.r[3] IS NOT NULL)
Buffers: shared hit=115
SubPlan 1
-> Limit (cost=0.57..0.88 rows=1 width=64) (actual time=0.010..0.010 rows=1 loops=23)
Output: (ARRAY[(t1.vid)::numeric, t1.score, (t1.score + ((mod(t1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric))]), ((t1.score + ((mod(t1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric)))
Buffers: shared hit=115
-> Index Scan Backward using idx_tab1_2 on public.tab1 t1 (cost=0.57..10296397.54 rows=33166964 width=64) (actual time=0.010..0.010 rows=1 loops=23)
Output: ARRAY[(t1.vid)::numeric, t1.score, (t1.score + ((mod(t1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric))], (t1.score + ((mod(t1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric))
Index Cond: (((t1.score + ((mod(t1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric)) < (tmp.r)[3]) AND ((t1.score + ((mod(t1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric)) IS NOT NULL))
Buffers: shared hit=115
CTE tmp2
-> Seq Scan on public.tab2 (cost=0.00..25.00 rows=6 width=32) (actual time=0.005..0.005 rows=1 loops=1)
Output: tab2.vids
Filter: (tab2.uid = 1)
Buffers: shared hit=1
-> Nested Loop (cost=0.00..20.82 rows=603 width=40) (actual time=0.076..0.331 rows=20 loops=1)
Output: (v.r[1])::bigint, v.r[2]
Join Filter: ((t.vids || hll_hash_bigint((v.r[1])::bigint, 0)) <> t.vids)
Rows Removed by Join Filter: 4
Buffers: shared hit=121
-> CTE Scan on tmp2 t (cost=0.00..0.12 rows=6 width=32) (actual time=0.007..0.007 rows=1 loops=1)
Output: t.vids
Buffers: shared hit=1
-> CTE Scan on tmp v (cost=0.00..2.02 rows=101 width=32) (actual time=0.022..0.300 rows=24 loops=1)
Output: v.r
Buffers: shared hit=120
Planning Time: 0.224 ms
Execution Time: 0.386 ms
(45 rows)
Optimization Result:
Before | After | Performance Improved | |
---|---|---|---|
Number of Cycles | 2,301 | 23 | 100 times |
Number of Blocks Scanned | 1,475 | 121 | 12 times |
Time Consumed | 3.541 ms | 0.386 ms | 9 times |
Users with technical pursuits enjoy PostgreSQL the more they use it. It can also save a lot of money for the company.
digoal - March 25, 2020
Alibaba Clouder - May 30, 2018
digoal - June 1, 2022
digoal - May 28, 2019
zhuodao - July 30, 2020
digoal - July 4, 2019
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAn online MPP warehousing service based on the Greenplum Database open source program
Learn MoreAn on-demand database hosting service for PostgreSQL with automated monitoring, backup and disaster recovery capabilities
Learn MoreLeverage cloud-native database solutions dedicated for FinTech.
Learn MoreMore Posts by digoal