By digoal
Retention assessment focuses on assessing the number of users on each retention day. We excluded query conditions from other dimensions and compared the efficiency of the three scenarios.
For example, if the first login is 6-1, then the 6-2 and 6-6 login refers to the retention of 1 day and 5 days.
The purpose is to count the number of users on each retention day.
-- 100万用户, 每个用户100个number (52核 pg 12)
CREATE TABLE user_retain (
user_id serial PRIMARY KEY,
fst_login_date date,
pay_retained_num int [] -- 数组存储, 表示留存日.
);
create or replace function gen_rand() returns int[] as $$
select array_AGG((ceil(random()*365)::int)) from generate_series(1,100);
$$ language sql strict;
insert into user_retain select generate_series(1,1000000), now(), gen_rand();
explain (analyze,verbose,timing,costs,buffers)
select unnest(pay_retained_num),count(*) from user_retain group by 1;
18 seconds
Why 18 seconds?
When unnest is used, data is not sent in parallel. Only functions are sent in parallel. The aggregate operation is not sent in parallel. On the contrary, parallel processing is slower than non-parallel processing due to the use of the gather command. In my opinion, PG can be optimized at the kernel level.
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=70326.50..86732.89 rows=1000008 width=12) (actual time=29802.202..29806.633 rows=365 loops=1)
Output: (unnest(pay_retained_num)), count(*)
Group Key: (unnest(user_retain.pay_retained_num))
Planned Partitions: 8
Peak Memory Usage: 6193 kB
Buffers: shared read=58824
-> Gather (cost=0.00..61420.18 rows=1000008 width=4) (actual time=4.392..16521.356 rows=100000000 loops=1)
Output: (unnest(pay_retained_num))
Workers Planned: 26
Workers Launched: 26
Buffers: shared read=58824
-> ProjectSet (cost=0.00..61420.18 rows=384620 width=4) (actual time=0.040..284.310 rows=3846154 loops=26)
Output: unnest(pay_retained_num)
Buffers: shared read=58824
Worker 0: actual time=0.035..226.005 rows=3107600 loops=1
Buffers: shared read=1828
Worker 1: actual time=0.037..327.656 rows=4547500 loops=1
Buffers: shared read=2675
Worker 2: actual time=0.036..512.300 rows=6907100 loops=1
Buffers: shared read=4063
......
Buffers: shared read=1908
Worker 22: actual time=0.038..262.740 rows=3689000 loops=1
Buffers: shared read=2170
Worker 23: actual time=0.040..203.822 rows=2828800 loops=1
Buffers: shared read=1664
Worker 24: actual time=0.042..161.236 rows=2249100 loops=1
Buffers: shared read=1323
Worker 25: actual time=0.046..340.726 rows=4622300 loops=1
Buffers: shared read=2719
-> Parallel Seq Scan on public.user_retain (cost=0.00..59208.62 rows=38462 width=424) (actual time=0.033..25.329 rows=38462 loops=26)
Output: user_id, fst_login_date, pay_retained_num
Buffers: shared read=58824
Worker 0: actual time=0.029..21.766 rows=31076 loops=1
Buffers: shared read=1828
......
Buffers: shared read=1664
Worker 24: actual time=0.035..16.011 rows=22491 loops=1
Buffers: shared read=1323
Worker 25: actual time=0.036..30.783 rows=46223 loops=1
Buffers: shared read=2719
Planning Time: 0.265 ms
Buffers: shared hit=34 read=4
Execution Time: 29809.325 ms
(124 rows)
postgres=# explain (analyze,verbose,timing,costs,buffers)
select unnest(pay_retained_num),count(*) from user_retain group by 1;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=215387.75..379451.57 rows=10000080 width=12) (actual time=18522.487..18524.227 rows=365 loops=1)
Output: (unnest(pay_retained_num)), count(*)
Group Key: unnest(user_retain.pay_retained_num)
Planned Partitions: 64
Peak Memory Usage: 6193 kB
Buffers: shared hit=58824
-> ProjectSet (cost=0.00..126324.54 rows=10000080 width=4) (actual time=0.012..6471.708 rows=100000000 loops=1)
Output: unnest(pay_retained_num)
Buffers: shared hit=58824
-> Seq Scan on public.user_retain (cost=0.00..68824.08 rows=1000008 width=424) (actual time=0.009..102.847 rows=1000000 loops=1)
Output: user_id, fst_login_date, pay_retained_num
Buffers: shared hit=58824
Planning Time: 0.065 ms
Execution Time: 18525.843 ms
(14 rows)
Expand array storage:
CREATE TABLE user_retain1 (
user_id serial ,
fst_login_date date,
pay_retained_num int
);
insert into user_retain1 select user_id,fst_login_date,unnest(pay_retained_num) from user_retain;
alter table user_retain1 set (parallel_workers =26);
26 parallel, 0.8 seconds:
max_worker_processes = 32
max_parallel_workers_per_gather = 26
parallel_leader_participation = off
max_parallel_workers = 32
parallel_tuple_cost = 0
parallel_setup_cost = 0
min_parallel_table_scan_size = 0
min_parallel_index_scan_size = 0
explain (analyze,verbose,timing,costs,buffers)
select pay_retained_num,count(*) from user_retain1 group by 1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Finalize HashAggregate (cost=598284.46..598288.11 rows=365 width=12) (actual time=859.487..859.525 rows=365 loops=1)
Group Key: pay_retained_num
Peak Memory Usage: 61 kB
-> Gather (cost=598233.36..598237.01 rows=9490 width=12) (actual time=856.866..858.653 rows=9490 loops=1)
Workers Planned: 26
Workers Launched: 26
-> Partial HashAggregate (cost=598233.36..598237.01 rows=365 width=12) (actual time=851.471..851.508 rows=365 loops=26)
Group Key: pay_retained_num
Peak Memory Usage: 0 kB
-> Parallel Seq Scan on user_retain1 (cost=0.00..579002.57 rows=3846157 width=4) (actual time=0.033..294.534 rows=3846154 loops=26)
Planning Time: 0.100 ms
Execution Time: 860.386 ms
(12 rows)
Valuation Calculation: Record each retention day and write the UID corresponding to the retention days to the hll type:
create extension hll;
create table t_hll (
pay_retained_num int primary key,
u_hll hll
);
insert into t_hll select pay_retained_num, hll_add_agg(hll_hash_integer(user_id)) from user_retain1 group by pay_retained_num;
select pay_retained_num, # u_hll from t_hll order by 1;
3 milliseconds:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Index Scan using t_hll_pkey on t_hll (cost=0.15..81.53 rows=365 width=12) (actual time=0.020..3.101 rows=365 loops=1)
Planning Time: 0.050 ms
Execution Time: 3.121 ms
(3 rows)
Differences in valuation calculations:
with a as (
select pay_retained_num, count(*) as cnt from user_retain1 group by 1),
b as (select pay_retained_num, # u_hll as cnt from t_hll)
select a.pay_retained_num, a.cnt, b.cnt from a,b where a.pay_retained_num=b.pay_retained_num order by abs(a.cnt-b.cnt);
pay_retained_num | cnt | cnt
------------------+--------+--------------------
18 | 273741 | 242072.93394329798
226 | 273461 | 241358.471066233
257 | 273317 | 241062.44806733675
202 | 274096 | 241617.25101208987
245 | 273489 | 240817.31849724415
319 | 273263 | 240535.045171427
... ...
PostgreSQL Multi-Tenant Usage – Multi-Schema Batch Execution of DDL Scripts
PostgreSQL plpgsql Debug - Black Screen and Text Mode Storage Procedure Debugging
digoal - December 6, 2023
digoal - September 20, 2019
digoal - June 26, 2019
digoal - September 17, 2019
digoal - December 14, 2018
digoal - June 26, 2019
An online MPP warehousing service based on the Greenplum Database open source program
Learn MoreAlibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAnalyticDB for MySQL is a real-time data warehousing service that can process petabytes of data with high concurrency and low latency.
Learn MoreLeverage cloud-native database solutions dedicated for FinTech.
Learn MoreMore Posts by digoal