How to Solve the Performance Problem of Deduplication Caused by Multi-Column Grouping in Window Queries

By digoal

Background

Construct Data:

create or replace function gen_res() returns setof numeric as $$  
declare  
  s numeric := random();  
begin  
  return query select s from generate_series(1,100);  
end;  
$$ language plpgsql strict;

create table a (uid int, score numeric, class text);   
  
insert into a select id, gen_res(), md5(ceil(random()*100)::text) from generate_series(1,100000) t(id);

Build 100 UIDs of the same score:

with a1 as (select max(score) score from a)  
insert into a select id, score , md5(ceil(random()*100)::text) from generate_series(100001,100100) t(id) , (select score from a1,generate_series(1,100)) t1;

postgres=# select * from a limit 10;  
 uid |       score       |              class                 
-----+-------------------+----------------------------------  
   2 | 0.979678114877171 | 44f683a84163b3523afe57c2e008bc8c  
   2 | 0.979678114877171 | d645920e395fedad7bbbed0eca3fe2e0  
   2 | 0.979678114877171 | fe9fc289c3ff0af142b6d3bead98a923  
   2 | 0.979678114877171 | d09bf41544a3365a46c9077ebb5e35c3  
   2 | 0.979678114877171 | 1679091c5a880faf6fb5e6087eb1b2dc  
   2 | 0.979678114877171 | 34173cb38f07f89ddbebc2ac9128303f  
   2 | 0.979678114877171 | 70efdf2ec9b086079795c442636b55fb  
   2 | 0.979678114877171 | fc490ca45c00b1249bbe3554a4fdf6fb  
   2 | 0.979678114877171 | fe9fc289c3ff0af142b6d3bead98a923  
   2 | 0.979678114877171 | a5771bce93e200c36f7cd9dfd0e5deaa  
(10 rows)

A uid has multiple records, and each record has the same score. The scores for different UIDs are different in most cases, but some UID scores are the same. Query UID in descending order of the score. One entry is returned for each UID. The simplest and most direct method is to use a window query.

Window: Group data by score and UID (not only by score because some UIDs may be the same as score). Then, sort the data by score (descending order), as shown below:

create index idx_a_1 on a (score desc,uid);

explain (analyze,verbose,timing,costs,buffers)  
select uid,score from (  
select row_number() over (partition by score,uid order by score desc) as rn,    
uid, score from a) t  
where t.rn=1 limit 50;  
  
                                                                       QUERY PLAN                                                                          
---------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=0.56..607.79 rows=50 width=36) (actual time=0.081..4.117 rows=50 loops=1)  
   Output: t.uid, t.score  
   Buffers: shared hit=4613  
   ->  Subquery Scan on t  (cost=0.56..607839.84 rows=50050 width=36) (actual time=0.081..4.111 rows=50 loops=1)  
         Output: t.uid, t.score  
         Filter: (t.rn = 1)  
         Rows Removed by Filter: 4851  
         Buffers: shared hit=4613  
         ->  WindowAgg  (cost=0.56..482714.24 rows=10010048 width=44) (actual time=0.080..3.853 rows=4901 loops=1)  
               Output: row_number() OVER (?), a.uid, a.score  
               Buffers: shared hit=4613  
               ->  Index Only Scan using idx_a_1 on public.a  (cost=0.56..282513.28 rows=10010048 width=36) (actual time=0.018..1.532 rows=5001 loops=1)  
                     Output: a.uid, a.score  
                     Heap Fetches: 5001  
                     Buffers: shared hit=4613  
 Planning Time: 0.074 ms  
 Execution Time: 4.135 ms  
(17 rows)

The performance seems to be good. You can find that this SQL only needs to return 50 records, but it scans 5001 rows. (If the window query uses the index for ordered traversal instead of using skip scan, can you only scan 50 rows?)

The result is listed below:

  uid   |       score         
--------+-------------------  
  31536 | 0.999998891470238  
 100001 | 0.999998891470238  
 100002 | 0.999998891470238  
 100003 | 0.999998891470238  
 100004 | 0.999998891470238  
 100005 | 0.999998891470238  
 100006 | 0.999998891470238  
 100007 | 0.999998891470238  
 100008 | 0.999998891470238  
 100009 | 0.999998891470238  
 100010 | 0.999998891470238  
 100011 | 0.999998891470238  
 100012 | 0.999998891470238  
 100013 | 0.999998891470238  
 100014 | 0.999998891470238  
 100015 | 0.999998891470238  
 100016 | 0.999998891470238  
 100017 | 0.999998891470238  
 100018 | 0.999998891470238  
 100019 | 0.999998891470238  
 100020 | 0.999998891470238  
 100021 | 0.999998891470238  
 100022 | 0.999998891470238  
 100023 | 0.999998891470238  
 100024 | 0.999998891470238  
 100025 | 0.999998891470238  
 100026 | 0.999998891470238  
 100027 | 0.999998891470238  
 100028 | 0.999998891470238  
 100029 | 0.999998891470238  
 100030 | 0.999998891470238  
 100031 | 0.999998891470238  
 100032 | 0.999998891470238  
 100033 | 0.999998891470238  
 100034 | 0.999998891470238  
 100035 | 0.999998891470238  
 100036 | 0.999998891470238  
 100037 | 0.999998891470238  
 100038 | 0.999998891470238  
 100039 | 0.999998891470238  
 100040 | 0.999998891470238  
 100041 | 0.999998891470238  
 100042 | 0.999998891470238  
 100043 | 0.999998891470238  
 100044 | 0.999998891470238  
 100045 | 0.999998891470238  
 100046 | 0.999998891470238  
 100047 | 0.999998891470238  
 100048 | 0.999998891470238  
 100049 | 0.999998891470238  
(50 rows)

If you want the query to only scan 50 rows, use skip scan instead of traversing the entire index.You can use a recursive implementation:

If you specify the limit clause by score, recursive attempts may lose the UID of the same score. Valid values:

create index idx_a_3 on a (score desc) include (uid) where score is not null;

with recursive tmp as (  
(  
select array[uid::numeric, score] as r from  a  
        WHERE score  is not null  
        order by score desc limit 1  
)  
union all  
(  
  select  
  (select array[uid::numeric, score] as r from a t1  
        where t1.score < (tmp.r)[2]  
        and t1.score is not null  
        order by t1.score desc limit 1  
  )  
  from tmp where (tmp.r)[2] is not null  
)  
)  
select (tmp.r)[1],(tmp.r)[2] from tmp   
where tmp.* is not null  
limit 50;

                                                                            QUERY PLAN                                                                              
------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=55.78..56.79 rows=50 width=64) (actual time=0.013..0.539 rows=50 loops=1)  
   Output: (tmp.r[1]), (tmp.r[2])  
   Buffers: shared hit=241  
   CTE tmp  
     ->  Recursive Union  (cost=0.50..55.78 rows=101 width=32) (actual time=0.009..0.507 rows=50 loops=1)  
           Buffers: shared hit=241  
           ->  Subquery Scan on "*SELECT* 1"  (cost=0.50..0.52 rows=1 width=32) (actual time=0.009..0.009 rows=1 loops=1)  
                 Output: "*SELECT* 1".r  
                 Buffers: shared hit=5  
                 ->  Limit  (cost=0.50..0.51 rows=1 width=43) (actual time=0.008..0.008 rows=1 loops=1)  
                       Output: (ARRAY[(a.uid)::numeric, a.score]), a.score  
                       Buffers: shared hit=5  
                       ->  Index Only Scan using idx_a_3 on public.a  (cost=0.50..125126.42 rows=10009993 width=43) (actual time=0.008..0.008 rows=1 loops=1)  
                             Output: ARRAY[(a.uid)::numeric, a.score], a.score  
                             Heap Fetches: 0  
                             Buffers: shared hit=5  
           ->  WorkTable Scan on tmp tmp_1  (cost=0.00..5.33 rows=10 width=32) (actual time=0.010..0.010 rows=1 loops=49)  
                 Output: (SubPlan 1)  
                 Filter: (tmp_1.r[2] IS NOT NULL)  
                 Buffers: shared hit=236  
                 SubPlan 1  
                   ->  Limit  (cost=0.50..0.51 rows=1 width=43) (actual time=0.009..0.009 rows=1 loops=49)  
                         Output: (ARRAY[(t1.uid)::numeric, t1.score]), t1.score  
                         Buffers: shared hit=236  
                         ->  Index Only Scan using idx_a_3 on public.a t1  (cost=0.50..41709.81 rows=3336664 width=43) (actual time=0.009..0.009 rows=1 loops=49)  
                               Output: ARRAY[(t1.uid)::numeric, t1.score], t1.score  
                               Index Cond: (t1.score < (tmp_1.r)[2])  
                               Heap Fetches: 0  
                               Buffers: shared hit=236  
   ->  CTE Scan on tmp  (cost=0.00..2.02 rows=100 width=64) (actual time=0.012..0.533 rows=50 loops=1)  
         Output: tmp.r[1], tmp.r[2]  
         Filter: (tmp.* IS NOT NULL)  
         Buffers: shared hit=241  
 Planning Time: 0.090 ms  
 Execution Time: 0.559 ms  
(35 rows)

The result is different from the window query. The UID with the same score is lost. Only one UID is returned for each score, which is inconsistent with the business logic.

   r   |         r           
-------+-------------------  
 31536 | 0.999998891470238  
 83720 | 0.999991911982715  
 55722 | 0.999975578407255  
 23814 | 0.999972620235976  
 67381 | 0.999961911696577  
 29892 | 0.999943494870404  
 90079 | 0.999925486089911  
 21214 | 0.999922738312673  
 59001 |  0.99991736006243  
 87204 | 0.999917295777077  
  7030 | 0.999915557360705  
 47014 | 0.999902190202047  
 32264 | 0.999901911139158  
 38647 | 0.999860249737811  
 32674 | 0.999837253269899  
 87613 | 0.999832584158813  
 71232 |   0.9998249229434  
 41469 | 0.999812869635608  
 95274 | 0.999804819593745  
 70312 | 0.999779352434143  
 70923 | 0.999774257144811  
 26243 | 0.999741038483439  
 79093 | 0.999711161650342  
 51332 | 0.999708612428986  
 70293 | 0.999701119867272  
   749 | 0.999676152526778  
 82356 | 0.999673161844026  
 24750 |  0.99965793726258  
 24520 | 0.999657836562118  
 94013 | 0.999634740712228  
 32113 | 0.999620645922992  
 32524 | 0.999576429034658  
 64496 | 0.999547735428802  
 99351 | 0.999533980413151  
 74897 | 0.999506182824039  
 57650 | 0.999505328313486  
 12643 | 0.999502105912729  
 62484 | 0.999499621461251  
 98690 |  0.99949818874131  
 78253 | 0.999480342172379  
 12805 | 0.999474363236803  
 24470 | 0.999473037063549  
 42317 | 0.999456625347026  
 17058 | 0.999453040472755  
 67604 | 0.999448579359846  
 14985 |  0.99943607450459  
 72078 | 0.999425638849996  
 43290 | 0.999415581298248  
 10890 | 0.999408682154407  
 33988 | 0.999400768473588  
(50 rows)

Solution: Add a weight to the score. The value is different for each UID and does not affect the final results of order by score desc.

The first few digits before the fraction remain unchanged and are appended later. For example, if the valid position of the score parameter is 15 digits after the decimal point, and the UID parameter has a total of 10 digits, move the UID decimal point by 25 digits. It will not affect the final sorting results.

If you think this movement is too long, you can also use a simple method. (There may still be some duplicates, but the probability of duplication is very low.) Model the UID as 999 and move at least 18 places backward.

order by (score::numeric + (mod(uid,999)/1e25)) desc

The following is an example:

create index idx_a_2 on a ((score + (mod(uid,999)/1e25))) include (uid) where (score + (mod(uid,999)/1e25)) is not null;

with recursive tmp as (  
(  
select array[uid::numeric, score, (score::numeric + (mod(uid,999)/1e25))] as r from  a  
        WHERE (score + (mod(uid,999)/1e25)) is not null  
        order by (score + (mod(uid,999)/1e25)) desc limit 1  
)  
union all  
(  
  select  
  (select array[uid::numeric, score, (score + (mod(uid,999)/1e25))] as r from a t1  
        where (t1.score + (mod(t1.uid,999)/1e25)) < (tmp.r)[3]  
        and (t1.score + (mod(t1.uid,999)/1e25)) is not null  
        order by (t1.score + (mod(t1.uid,999)/1e25)) desc limit 1  
  )  
  from tmp where (tmp.r)[3] is not null  
)  
)  
select (tmp.r)[1],(tmp.r)[2] from tmp   
where tmp.* is not null  
limit 50;

The same results as the window query are obtained, with no missing UID from the duplicate score.

   r    |         r           
--------+-------------------  
  31536 | 0.999998891470238  
 100100 | 0.999998891470238  
 100099 | 0.999998891470238  
 100098 | 0.999998891470238  
 100097 | 0.999998891470238  
 100096 | 0.999998891470238  
 100095 | 0.999998891470238  
 100094 | 0.999998891470238  
 100093 | 0.999998891470238  
 100092 | 0.999998891470238  
 100091 | 0.999998891470238  
 100090 | 0.999998891470238  
 100089 | 0.999998891470238  
 100088 | 0.999998891470238  
 100087 | 0.999998891470238  
 100086 | 0.999998891470238  
 100085 | 0.999998891470238  
 100084 | 0.999998891470238  
 100083 | 0.999998891470238  
 100082 | 0.999998891470238  
 100081 | 0.999998891470238  
 100080 | 0.999998891470238  
 100079 | 0.999998891470238  
 100078 | 0.999998891470238  
 100077 | 0.999998891470238  
 100076 | 0.999998891470238  
 100075 | 0.999998891470238  
 100074 | 0.999998891470238  
 100073 | 0.999998891470238  
 100072 | 0.999998891470238  
 100071 | 0.999998891470238  
 100070 | 0.999998891470238  
 100069 | 0.999998891470238  
 100068 | 0.999998891470238  
 100067 | 0.999998891470238  
 100066 | 0.999998891470238  
 100065 | 0.999998891470238  
 100064 | 0.999998891470238  
 100063 | 0.999998891470238  
 100062 | 0.999998891470238  
 100061 | 0.999998891470238  
 100060 | 0.999998891470238  
 100059 | 0.999998891470238  
 100058 | 0.999998891470238  
 100057 | 0.999998891470238  
 100056 | 0.999998891470238  
 100055 | 0.999998891470238  
 100054 | 0.999998891470238  
 100053 | 0.999998891470238  
 100052 | 0.999998891470238  
(50 rows)

  
                                                                  QUERY PLAN       
-------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=60.64..61.65 rows=50 width=64) (actual time=0.016..0.478 rows=50 loops=1)  
   Output: (tmp.r[1]), (tmp.r[2])  
   Buffers: shared hit=250  
   CTE tmp  
     ->  Recursive Union  (cost=0.50..60.64 rows=101 width=32) (actual time=0.012..0.445 rows=50 loops=1)  
           Buffers: shared hit=250  
           ->  Subquery Scan on "*SELECT* 1"  (cost=0.50..0.55 rows=1 width=32) (actual time=0.011..0.012 rows=1 loops=1)  
                 Output: "*SELECT* 1".r  
                 Buffers: shared hit=5  
                 ->  Limit  (cost=0.50..0.54 rows=1 width=64) (actual time=0.011..0.011 rows=1 loops=1)  
                       Output: (ARRAY[(a.uid)::numeric, a.score, (a.score + ((mod(a.uid, 999))::numeric / '10000000000000000000000000'::numeric))]), ((a.score + ((mod(a.uid, 999))::numeric / '10000000000000000000000000'::numeric)))  
                       Buffers: shared hit=5  
                       ->  Index Scan Backward using idx_a_2 on public.a  (cost=0.50..417341.60 rows=9959943 width=64) (actual time=0.011..0.011 rows=1 loops=1)  
                             Output: ARRAY[(a.uid)::numeric, a.score, (a.score + ((mod(a.uid, 999))::numeric / '10000000000000000000000000'::numeric))], (a.score + ((mod(a.uid, 999))::numeric / '10000000000000000000000000'::numeric))  
                             Buffers: shared hit=5  
           ->  WorkTable Scan on tmp tmp_1  (cost=0.00..5.81 rows=10 width=32) (actual time=0.008..0.008 rows=1 loops=49)  
                 Output: (SubPlan 1)  
                 Filter: (tmp_1.r[3] IS NOT NULL)  
                 Buffers: shared hit=245  
                 SubPlan 1  
                   ->  Limit  (cost=0.50..0.56 rows=1 width=64) (actual time=0.008..0.008 rows=1 loops=49)  
                         Output: (ARRAY[(t1.uid)::numeric, t1.score, (t1.score + ((mod(t1.uid, 999))::numeric / '10000000000000000000000000'::numeric))]), ((t1.score + ((mod(t1.uid, 999))::numeric / '10000000000000000000000000'::numeric)))  
                         Buffers: shared hit=245  
                         ->  Index Scan Backward using idx_a_2 on public.a t1  (cost=0.50..201702.36 rows=3319981 width=64) (actual time=0.008..0.008 rows=1 loops=49)  
                               Output: ARRAY[(t1.uid)::numeric, t1.score, (t1.score + ((mod(t1.uid, 999))::numeric / '10000000000000000000000000'::numeric))], (t1.score + ((mod(t1.uid, 999))::numeric / '10000000000000000000000000'::numeric))  
                               Index Cond: ((t1.score + ((mod(t1.uid, 999))::numeric / '10000000000000000000000000'::numeric)) < (tmp_1.r)[3])  
                               Buffers: shared hit=245  
   ->  CTE Scan on tmp  (cost=0.00..2.02 rows=100 width=64) (actual time=0.015..0.473 rows=50 loops=1)  
         Output: tmp.r[1], tmp.r[2]  
         Filter: (tmp.* IS NOT NULL)  
         Buffers: shared hit=250  
 Planning Time: 0.127 ms  
 Execution Time: 0.502 ms  
(33 rows)

It reduced from 4.1 milliseconds to 0.5 milliseconds after recursion, and only 50 rows need to be scanned with the limit of 50.

Community

How to Solve the Performance Problem of Deduplication Caused by Multi-Column Grouping in Window Queries

Background

Read previous post:

Read next post:

digoal

You may also like

Comments

digoal

Related Products

PolarDB for PostgreSQL

AnalyticDB for PostgreSQL

Elastic High Performance Computing Solution

ApsaraDB RDS for PostgreSQL