PostgreSQL Sorting Deduplication and Limit Query Optimization – Recursion vs. Group

By digoal

Background

This is a data structure introduction.

Simplified Example:

create extension hll;   
  
create table tab1 (  
  vid int8, -- 有重复值, 一个表里面每个VID 可能有100条  
  score numeric, -- 同一个vid有多条记录, 同一个vid的多条记录score完全相同. 不同vid的score可能相同也可能不同  
  other_col int -- 其他属性  
);

Write 100 million test data records:

insert into tab1 select generate_series(1,1000000),   
generate_series(1,1000000)/10000000000000::numeric ,   
random()*10000 from generate_series(1,100);

Write several duplicate score records with different vid values:

insert into tab1 select generate_series(1000001,1000009),   
(1000000-9)/10000000000000::numeric,   
random()*10000 from generate_series(1,100);

Create table B and use hll to store unique values as filtering conditions for test queries in this article:

create table tab2 (uid int8, vids hll);

The IDs contained in hll are filtered out during queries:

insert into tab2 values   
(1,  
hll_empty() || hll_hash_bigint(1000000) || hll_hash_bigint(1000000-1) || hll_hash_bigint(1000000-5) || hll_hash_bigint(1000000-8)  
);

Query Requirements:

The first part must descend by score. Only one entry is retrieved for each VID.
The second part must use tab2.vids to filter vid values from tab1.

Create an index:

create index idx_tab1_1 on tab1 (score desc,vid);

The first part can be queried like this:

select distinct on (score,vid) vid,score from tab1 order by score desc,vid limit 20;  
  
select vid,score from tab1 group by score,vid order by score desc,vid limit 20;

   vid   |           score              
---------+----------------------------  
 1000000 | 0.000000100000000000000000  
  999999 | 0.000000099999900000000000  
  999998 | 0.000000099999800000000000  
  999997 | 0.000000099999700000000000  
  999996 | 0.000000099999600000000000  
  999995 | 0.000000099999500000000000  
  999994 | 0.000000099999400000000000  
  999993 | 0.000000099999300000000000  
  999992 | 0.000000099999200000000000  
  999991 | 0.000000099999100000000000  
 1000001 | 0.000000099999100000000000  
 1000002 | 0.000000099999100000000000  
 1000003 | 0.000000099999100000000000  
 1000004 | 0.000000099999100000000000  
 1000005 | 0.000000099999100000000000  
 1000006 | 0.000000099999100000000000  
 1000007 | 0.000000099999100000000000  
 1000008 | 0.000000099999100000000000  
 1000009 | 0.000000099999100000000000  
  999990 | 0.000000099999000000000000  
(20 rows)

postgres=> explain (analyze,verbose,timing,costs,buffers)   
select distinct on (score,vid) vid,score   
from tab1 order by score desc,vid limit 20;  
                                                                            QUERY PLAN       
---------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=0.57..70.50 rows=20 width=16) (actual time=0.017..1.041 rows=20 loops=1)  
   Output: vid, score  
   Buffers: shared hit=1076  
   ->  Result  (cost=0.57..34965244.98 rows=10000090 width=16) (actual time=0.016..1.035 rows=20 loops=1)  
         Output: vid, score  
         Buffers: shared hit=1076  
         ->  Unique  (cost=0.57..34965244.98 rows=10000090 width=16) (actual time=0.015..1.030 rows=20 loops=1)  
               Output: score, vid  
               Buffers: shared hit=1076  
               ->  Index Only Scan using idx_tab1_1 on public.tab1  (cost=0.57..34465240.50 rows=100000896 width=16) (actual time=0.013..0.683 rows=1901 loops=1)  
                     Output: score, vid  
                     Heap Fetches: 1901  
                     Buffers: shared hit=1076  
 Planning Time: 0.092 ms  
 Execution Time: 1.062 ms  
(15 rows)  
  
  
postgres=> explain (analyze,verbose,timing,costs,buffers)   
select vid,score from tab1   
group by score,vid order by score desc,vid limit 20;  
                                                                         QUERY PLAN     
---------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=0.57..70.50 rows=20 width=16) (actual time=0.016..1.037 rows=20 loops=1)  
   Output: vid, score  
   Buffers: shared hit=1076  
   ->  Group  (cost=0.57..34965244.98 rows=10000090 width=16) (actual time=0.016..1.032 rows=20 loops=1)  
         Output: vid, score  
         Group Key: tab1.score, tab1.vid  
         Buffers: shared hit=1076  
         ->  Index Only Scan using idx_tab1_1 on public.tab1  (cost=0.57..34465240.50 rows=100000896 width=16) (actual time=0.014..0.692 rows=1901 loops=1)  
               Output: score, vid  
               Heap Fetches: 1901  
               Buffers: shared hit=1076  
 Planning Time: 0.091 ms  
 Execution Time: 1.055 ms  
(13 rows)

According to the second part, the SQL statements are listed below:

explain (analyze,verbose,timing,costs,buffers)   
with tmp as (select vids from tab2 where uid=1)   
select t1.vid,t1.score from tab1 t1,tmp   
where (tmp.vids || hll_hash_bigint(t1.vid) <> tmp.vids)   
group by t1.score,t1.vid    
order by t1.score desc,t1.vid limit 20;

   vid   |           score              
---------+----------------------------  
  999998 | 0.000000099999800000000000  
  999997 | 0.000000099999700000000000  
  999996 | 0.000000099999600000000000  
  999994 | 0.000000099999400000000000  
  999993 | 0.000000099999300000000000  
  999991 | 0.000000099999100000000000  
 1000001 | 0.000000099999100000000000  
 1000002 | 0.000000099999100000000000  
 1000003 | 0.000000099999100000000000  
 1000004 | 0.000000099999100000000000  
 1000005 | 0.000000099999100000000000  
 1000006 | 0.000000099999100000000000  
 1000007 | 0.000000099999100000000000  
 1000008 | 0.000000099999100000000000  
 1000009 | 0.000000099999100000000000  
  999990 | 0.000000099999000000000000  
  999989 | 0.000000099998900000000000  
  999988 | 0.000000099998800000000000  
  999987 | 0.000000099998700000000000  
  999986 | 0.000000099998600000000000  
(20 rows)

 Limit  (cost=25.57..133.47 rows=20 width=16) (actual time=0.626..3.477 rows=20 loops=1)  
   Output: t1.vid, t1.score  
   Buffers: shared hit=1475 read=4 dirtied=1  
   I/O Timings: read=0.619  
   CTE tmp  
     ->  Seq Scan on public.tab2  (cost=0.00..25.00 rows=6 width=32) (actual time=0.309..0.310 rows=1 loops=1)  
           Output: tab2.vids  
           Filter: (tab2.uid = 1)  
           Buffers: shared read=1 dirtied=1  
           I/O Timings: read=0.299  
   ->  Group  (cost=0.57..53950415.15 rows=10000090 width=16) (actual time=0.625..3.471 rows=20 loops=1)  
         Output: t1.vid, t1.score  
         Group Key: t1.score, t1.vid  
         Buffers: shared hit=1475 read=4 dirtied=1  
         I/O Timings: read=0.619  
         ->  Nested Loop  (cost=0.57..50965388.40 rows=597005349 width=16) (actual time=0.624..3.098 rows=1901 loops=1)  
               Output: t1.vid, t1.score  
               Join Filter: ((tmp.vids || hll_hash_bigint(t1.vid, 0)) <> tmp.vids)  
               Rows Removed by Join Filter: 400  
               Buffers: shared hit=1475 read=4 dirtied=1  
               I/O Timings: read=0.619  
               ->  Index Only Scan using idx_tab1_1 on public.tab1 t1  (cost=0.57..34465240.50 rows=100000896 width=16) (actual time=0.012..1.190 rows=2301 loops=1)  
                     Output: t1.score, t1.vid  
                     Heap Fetches: 2301  
                     Buffers: shared hit=1475 read=3  
                     I/O Timings: read=0.320  
               ->  CTE Scan on tmp  (cost=0.00..0.12 rows=6 width=32) (actual time=0.000..0.000 rows=1 loops=2301)  
                     Output: tmp.vids  
                     Buffers: shared read=1 dirtied=1  
                     I/O Timings: read=0.299  
 Planning Time: 0.164 ms  
 Execution Time: 3.541 ms  
(32 rows)

What Are the Performance Problems?

1. The process is repeated too many times because each record in each group is stored externally, which wastes resources.

loops=2301

2. The block is scanned too many times: 1,475

Each vid value has multiple replicates. When grouping, all heaps (restricted by limit) corresponding to the specified group are scanned.

The larger the limit, the more waste

SQL statements seem to be very fast in execution, but they waste a lot (especially if there are many duplicate vid values and the limit is too large). Can it be optimized further, especially under high concurrency?

Solution

Recursion doesn't seem to work because the same score may have multiple vid values, so score and vid need to be considered. One condition is good for the recursion boundary, which refers to gradual jumps (similar to skip index scan).

If it is clear that the vid values for different score fields are different, the jump boundary condition is score but not vid.

Method:

We can put vid as the weight in score without affecting the original sorting order by score. We can weigh vid into score through expressions without changing the original sorting order by score.

For example, if the value of score is greater than 1/1e21, we can weigh vid to the decimal place of 1/1e21. The precision of the data type must ensure that it can store the weighted decimal places. For example, we use the numeric type, which has no specific precision. It is okay to be accurate to e25. Otherwise, we must specify the decimal place number to be greater than or equal to 25.

The vid value is converted to a numeric value. If the vid value is not a numeric value, you can use the built-in hash function of PostgreSQL to convert it to a numeric value and perform the modulo operation (control the vid collision space and the decimal place number to store the weighted value).

As shown in the following code, 9999 requires 4 decimal places to store the weighted value, so it is divided by 1e25 (21+4) to create the following index:

create index idx_tab1_2 on tab1 ( (score + (mod(vid,9999)/1e25)) );

explain (analyze,verbose,timing,costs,buffers)  
with recursive tmp as (        -- 递归                                                                                                            
(                                                                                                                                             
  select array[vid::numeric, score, (score + (mod(vid,9999)/1e25))] as r   
  from tab1   
  order by (score + (mod(vid,9999)/1e25)) desc limit 1   -- 取最大score一条    
)                                                                                                                        
union all                                                                                                               
(                                                                                                  
select                                                                    
  (select array[vid::numeric, score, (score + (mod(vid,9999)/1e25))] as r  from tab1 t1   
    where (t1.score + (mod(t1.vid,9999)/1e25)) < (tmp.r)[3]    -- 取下一个最大score, vid不同但是score相同的记录也会被获取, (仅指当vid的模不同时)    
    and (t1.score + (mod(t1.vid,9999)/1e25)) is not null   
    order by (t1.score + (mod(t1.vid,9999)/1e25)) desc limit 1   -- 取一条  
  )   
  from tmp where (tmp.r)[3] is not null   
)                                                                                                                                     
),                                                                                                                                     
tmp2 as   
(  
  select vids from tab2 where uid = 1   
)                         
select (v.r)[1]::int8 as vid, (v.r)[2] as score from tmp v, tmp2 t   
where (t.vids || hll_hash_bigint(((v.r)[1])::int8) <> t.vids)   
limit 20   
;

Defect:

When score values of multiple vid fields are the same, and the values of vid fields with the modulo 9999 are the same, only one vid is retrieved.

   vid   |           score              
---------+----------------------------  
  999998 | 0.000000099999800000000000  
  999997 | 0.000000099999700000000000  
  999996 | 0.000000099999600000000000  
  999994 | 0.000000099999400000000000  
  999993 | 0.000000099999300000000000  
 1000009 | 0.000000099999100000000000  
 1000008 | 0.000000099999100000000000  
 1000007 | 0.000000099999100000000000  
 1000006 | 0.000000099999100000000000  
 1000005 | 0.000000099999100000000000  
 1000004 | 0.000000099999100000000000  
 1000003 | 0.000000099999100000000000  
 1000002 | 0.000000099999100000000000  
 1000001 | 0.000000099999100000000000  
  999991 | 0.000000099999100000000000  
  999990 | 0.000000099999000000000000  
  999989 | 0.000000099998900000000000  
  999988 | 0.000000099998800000000000  
  999987 | 0.000000099998700000000000  
  999986 | 0.000000099998600000000000  
(20 rows)

 Limit  (cost=117.68..118.37 rows=20 width=40) (actual time=0.077..0.336 rows=20 loops=1)  
   Output: ((v.r[1])::bigint), (v.r[2])  
   Buffers: shared hit=121  
   CTE tmp  
     ->  Recursive Union  (cost=0.57..92.68 rows=101 width=32) (actual time=0.021..0.292 rows=24 loops=1)  
           Buffers: shared hit=120  
           ->  Subquery Scan on "*SELECT* 1"  (cost=0.57..0.88 rows=1 width=32) (actual time=0.021..0.022 rows=1 loops=1)  
                 Output: "*SELECT* 1".r  
                 Buffers: shared hit=5  
                 ->  Limit  (cost=0.57..0.87 rows=1 width=64) (actual time=0.020..0.021 rows=1 loops=1)  
                       Output: (ARRAY[(tab1.vid)::numeric, tab1.score, (tab1.score + ((mod(tab1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric))]), ((tab1.score + ((mod(tab1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric)))  
                       Buffers: shared hit=5  
                       ->  Index Scan Backward using idx_tab1_2 on public.tab1  (cost=0.57..29864985.67 rows=100000896 width=64) (actual time=0.020..0.020 rows=1 loops=1)  
                             Output: ARRAY[(tab1.vid)::numeric, tab1.score, (tab1.score + ((mod(tab1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric))], (tab1.score + ((mod(tab1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric))  
                             Buffers: shared hit=5  
           ->  WorkTable Scan on tmp  (cost=0.00..8.98 rows=10 width=32) (actual time=0.011..0.011 rows=1 loops=23)  
                 Output: (SubPlan 1)  
                 Filter: (tmp.r[3] IS NOT NULL)  
                 Buffers: shared hit=115  
                 SubPlan 1  
                   ->  Limit  (cost=0.57..0.88 rows=1 width=64) (actual time=0.010..0.010 rows=1 loops=23)  
                         Output: (ARRAY[(t1.vid)::numeric, t1.score, (t1.score + ((mod(t1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric))]), ((t1.score + ((mod(t1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric)))  
                         Buffers: shared hit=115  
                         ->  Index Scan Backward using idx_tab1_2 on public.tab1 t1  (cost=0.57..10296397.54 rows=33166964 width=64) (actual time=0.010..0.010 rows=1 loops=23)  
                               Output: ARRAY[(t1.vid)::numeric, t1.score, (t1.score + ((mod(t1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric))], (t1.score + ((mod(t1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric))  
                               Index Cond: (((t1.score + ((mod(t1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric)) < (tmp.r)[3]) AND ((t1.score + ((mod(t1.vid, '9999'::bigint))::numeric / '10000000000000000000000000'::numeric)) IS NOT NULL))  
                               Buffers: shared hit=115  
   CTE tmp2  
     ->  Seq Scan on public.tab2  (cost=0.00..25.00 rows=6 width=32) (actual time=0.005..0.005 rows=1 loops=1)  
           Output: tab2.vids  
           Filter: (tab2.uid = 1)  
           Buffers: shared hit=1  
   ->  Nested Loop  (cost=0.00..20.82 rows=603 width=40) (actual time=0.076..0.331 rows=20 loops=1)  
         Output: (v.r[1])::bigint, v.r[2]  
         Join Filter: ((t.vids || hll_hash_bigint((v.r[1])::bigint, 0)) <> t.vids)  
         Rows Removed by Join Filter: 4  
         Buffers: shared hit=121  
         ->  CTE Scan on tmp2 t  (cost=0.00..0.12 rows=6 width=32) (actual time=0.007..0.007 rows=1 loops=1)  
               Output: t.vids  
               Buffers: shared hit=1  
         ->  CTE Scan on tmp v  (cost=0.00..2.02 rows=101 width=32) (actual time=0.022..0.300 rows=24 loops=1)  
               Output: v.r  
               Buffers: shared hit=120  
 Planning Time: 0.224 ms  
 Execution Time: 0.386 ms  
(45 rows)

Optimization Result:

Before	After	Performance Improved
Number of Cycles	2,301	23	100 times
Number of Blocks Scanned	1,475	121	12 times
Time Consumed	3.541 ms	0.386 ms	9 times

Summary

Users with technical pursuits enjoy PostgreSQL the more they use it. It can also save a lot of money for the company.

Community

PostgreSQL Sorting Deduplication and Limit Query Optimization – Recursion vs. Group

Background

What Are the Performance Problems?

Solution

Summary

Read previous post:

Read next post:

digoal

You may also like

Comments

digoal

Related Products

PolarDB for PostgreSQL

AnalyticDB for PostgreSQL

ApsaraDB RDS for PostgreSQL

Database for FinTech Solution