General Usage of PostgreSQL hll in Retention and UV Statistics

By digoal

Background

Retention assessment focuses on assessing the number of users on each retention day. We excluded query conditions from other dimensions and compared the efficiency of the three scenarios.

For example, if the first login is 6-1, then the 6-2 and 6-6 login refers to the retention of 1 day and 5 days.

The purpose is to count the number of users on each retention day.

Scenario 1

-- 100万用户, 每个用户100个number  (52核 pg 12)   
CREATE TABLE user_retain (  
  user_id serial PRIMARY KEY,  
  fst_login_date date,  
  pay_retained_num int []  -- 数组存储, 表示留存日.   
);  
  
create or replace function gen_rand() returns int[] as $$  
  select array_AGG((ceil(random()*365)::int)) from generate_series(1,100);  
$$ language sql strict;  
  
insert into user_retain select generate_series(1,1000000), now(), gen_rand();

explain (analyze,verbose,timing,costs,buffers)
select unnest(pay_retained_num),count(*) from user_retain group by 1;

18 seconds

Why 18 seconds?

When unnest is used, data is not sent in parallel. Only functions are sent in parallel. The aggregate operation is not sent in parallel. On the contrary, parallel processing is slower than non-parallel processing due to the use of the gather command. In my opinion, PG can be optimized at the kernel level.

                                                                      QUERY PLAN                                                                         
-------------------------------------------------------------------------------------------------------------------------------------------------------  
 HashAggregate  (cost=70326.50..86732.89 rows=1000008 width=12) (actual time=29802.202..29806.633 rows=365 loops=1)  
   Output: (unnest(pay_retained_num)), count(*)  
   Group Key: (unnest(user_retain.pay_retained_num))  
   Planned Partitions: 8  
   Peak Memory Usage: 6193 kB  
   Buffers: shared read=58824  
   ->  Gather  (cost=0.00..61420.18 rows=1000008 width=4) (actual time=4.392..16521.356 rows=100000000 loops=1)  
         Output: (unnest(pay_retained_num))  
         Workers Planned: 26  
         Workers Launched: 26  
         Buffers: shared read=58824  
         ->  ProjectSet  (cost=0.00..61420.18 rows=384620 width=4) (actual time=0.040..284.310 rows=3846154 loops=26)  
               Output: unnest(pay_retained_num)  
               Buffers: shared read=58824  
               Worker 0:  actual time=0.035..226.005 rows=3107600 loops=1  
                 Buffers: shared read=1828  
               Worker 1:  actual time=0.037..327.656 rows=4547500 loops=1  
                 Buffers: shared read=2675  
               Worker 2:  actual time=0.036..512.300 rows=6907100 loops=1  
                 Buffers: shared read=4063  
......  
                 Buffers: shared read=1908  
               Worker 22:  actual time=0.038..262.740 rows=3689000 loops=1  
                 Buffers: shared read=2170  
               Worker 23:  actual time=0.040..203.822 rows=2828800 loops=1  
                 Buffers: shared read=1664  
               Worker 24:  actual time=0.042..161.236 rows=2249100 loops=1  
                 Buffers: shared read=1323  
               Worker 25:  actual time=0.046..340.726 rows=4622300 loops=1  
                 Buffers: shared read=2719  
               ->  Parallel Seq Scan on public.user_retain  (cost=0.00..59208.62 rows=38462 width=424) (actual time=0.033..25.329 rows=38462 loops=26)  
                     Output: user_id, fst_login_date, pay_retained_num  
                     Buffers: shared read=58824  
                     Worker 0:  actual time=0.029..21.766 rows=31076 loops=1  
                       Buffers: shared read=1828  
......  
                       Buffers: shared read=1664  
                     Worker 24:  actual time=0.035..16.011 rows=22491 loops=1  
                       Buffers: shared read=1323  
                     Worker 25:  actual time=0.036..30.783 rows=46223 loops=1  
                       Buffers: shared read=2719  
 Planning Time: 0.265 ms  
   Buffers: shared hit=34 read=4  
 Execution Time: 29809.325 ms  
(124 rows)

postgres=# explain (analyze,verbose,timing,costs,buffers)     
select unnest(pay_retained_num),count(*) from user_retain group by 1;  
                                                                 QUERY PLAN                                                                   
--------------------------------------------------------------------------------------------------------------------------------------------  
 HashAggregate  (cost=215387.75..379451.57 rows=10000080 width=12) (actual time=18522.487..18524.227 rows=365 loops=1)  
   Output: (unnest(pay_retained_num)), count(*)  
   Group Key: unnest(user_retain.pay_retained_num)  
   Planned Partitions: 64  
   Peak Memory Usage: 6193 kB  
   Buffers: shared hit=58824  
   ->  ProjectSet  (cost=0.00..126324.54 rows=10000080 width=4) (actual time=0.012..6471.708 rows=100000000 loops=1)  
         Output: unnest(pay_retained_num)  
         Buffers: shared hit=58824  
         ->  Seq Scan on public.user_retain  (cost=0.00..68824.08 rows=1000008 width=424) (actual time=0.009..102.847 rows=1000000 loops=1)  
               Output: user_id, fst_login_date, pay_retained_num  
               Buffers: shared hit=58824  
 Planning Time: 0.065 ms  
 Execution Time: 18525.843 ms  
(14 rows)

Scenario 2

Expand array storage:

CREATE TABLE user_retain1 (  
  user_id serial ,             
  fst_login_date date,  
  pay_retained_num int  
);  
  
insert into user_retain1 select user_id,fst_login_date,unnest(pay_retained_num) from user_retain;  
  
alter table user_retain1 set (parallel_workers =26);

26 parallel, 0.8 seconds:

max_worker_processes = 32
max_parallel_workers_per_gather = 26
parallel_leader_participation = off
max_parallel_workers = 32
parallel_tuple_cost = 0
parallel_setup_cost = 0
min_parallel_table_scan_size = 0
min_parallel_index_scan_size = 0


explain (analyze,verbose,timing,costs,buffers)   
select pay_retained_num,count(*) from user_retain1 group by 1;  
  
                                                                     QUERY PLAN                                                                        
-----------------------------------------------------------------------------------------------------------------------------------------------------  
 Finalize HashAggregate  (cost=598284.46..598288.11 rows=365 width=12) (actual time=859.487..859.525 rows=365 loops=1)  
   Group Key: pay_retained_num  
   Peak Memory Usage: 61 kB  
   ->  Gather  (cost=598233.36..598237.01 rows=9490 width=12) (actual time=856.866..858.653 rows=9490 loops=1)  
         Workers Planned: 26  
         Workers Launched: 26  
         ->  Partial HashAggregate  (cost=598233.36..598237.01 rows=365 width=12) (actual time=851.471..851.508 rows=365 loops=26)  
               Group Key: pay_retained_num  
               Peak Memory Usage: 0 kB  
               ->  Parallel Seq Scan on user_retain1  (cost=0.00..579002.57 rows=3846157 width=4) (actual time=0.033..294.534 rows=3846154 loops=26)  
 Planning Time: 0.100 ms  
 Execution Time: 860.386 ms  
(12 rows)

Scenario 3

Valuation Calculation: Record each retention day and write the UID corresponding to the retention days to the hll type:

create extension hll;  
  
create table t_hll (  
  pay_retained_num int primary key,   
  u_hll hll   
);  
  
insert into t_hll select pay_retained_num, hll_add_agg(hll_hash_integer(user_id)) from user_retain1 group by pay_retained_num;    
  
select pay_retained_num, # u_hll from t_hll order by 1;

3 milliseconds:

                                                       QUERY PLAN                                                         
------------------------------------------------------------------------------------------------------------------------  
 Index Scan using t_hll_pkey on t_hll  (cost=0.15..81.53 rows=365 width=12) (actual time=0.020..3.101 rows=365 loops=1)  
 Planning Time: 0.050 ms  
 Execution Time: 3.121 ms  
(3 rows)

Differences in valuation calculations:

with a as (  
select pay_retained_num, count(*) as cnt from user_retain1 group by 1),  
b as (select pay_retained_num, # u_hll as cnt from t_hll)  
select a.pay_retained_num, a.cnt, b.cnt from a,b where a.pay_retained_num=b.pay_retained_num order by abs(a.cnt-b.cnt);  
  
 pay_retained_num |  cnt   |        cnt           
------------------+--------+--------------------  
               18 | 273741 | 242072.93394329798  
              226 | 273461 |   241358.471066233  
              257 | 273317 | 241062.44806733675  
              202 | 274096 | 241617.25101208987  
              245 | 273489 | 240817.31849724415  
              319 | 273263 |   240535.045171427  
... ...

Community

General Usage of PostgreSQL hll in Retention and UV Statistics

Background

Scenario 1

Scenario 2

Scenario 3

Read previous post:

Read next post:

digoal

You may also like

Comments

digoal

Related Products

AnalyticDB for PostgreSQL

PolarDB for PostgreSQL

AnalyticDB for MySQL

Database for FinTech Solution