By digoal
Random sampling is used to randomly extract some records from the resource pool and return them to different clients.
The built-in sampling method named system shown below can access up to 10% of data blocks randomly to find records that meet the conditions and stop scanning when ten pieces of limits are found. It will scan up to 10% of video table data blocks. It will return even if there are not ten results that meet the conditions.
explain (analyze,verbose,timing,costs,buffers) select id,play_count from video as v1 TABLESAMPLE system(10) where play_count>=2000 limit 10;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..1.73 rows=10 width=12) (actual time=0.006..0.011 rows=10 loops=1)
Output: id, play_count
Buffers: shared hit=1
-> Sample Scan on public.video v1 (cost=0.00..3856.95 rows=22351 width=12) (actual time=0.006..0.009 rows=10 loops=1)
Output: id, play_count
Sampling: system ('10'::real)
Filter: (v1.play_count >= 2000)
Buffers: shared hit=1
Planning Time: 0.151 ms
Execution Time: 0.024 ms
(10 rows)
Is there a more refined sampling threshold control? If the call frequency and the sampling ratio are very high, but there are few qualified records, it may consume more resources and cause avalanches.
How can we optimize it?
Block-level random sampling method extension:
create extension tsm_system_rows ;
CREATE EXTENSION
create extension tsm_system_time ;
CREATE EXTENSION
It samples for ten milliseconds at most, and ten entries that meet the play_count>=2000 condition are returned. (If ten entries meet the criteria soon, it stops scanning. Therefore, it is quick.)
explain (analyze,verbose,timing,costs,buffers) select id,play_count from video as v1 TABLESAMPLE system_time(10) where play_count>=2000 limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..1.73 rows=10 width=12) (actual time=0.009..0.030 rows=10 loops=1)
Output: id, play_count
Buffers: shared hit=3
-> Sample Scan on public.video v1 (cost=0.00..10.71 rows=62 width=12) (actual time=0.008..0.028 rows=10 loops=1)
Output: id, play_count
Sampling: system_time ('10'::double precision)
Filter: (v1.play_count >= 2000)
Buffers: shared hit=3
Planning Time: 0.052 ms
Execution Time: 0.042 ms
(10 rows)
It samples 100 entries at most, and ten entries that meet the play_count>=2000 condition are returned. (If ten entries meet the criteria soon, it stops scanning. Therefore, it is quick.)
explain (analyze,verbose,timing,costs,buffers) select id,play_count from video as v1 TABLESAMPLE system_rows (100) where play_count>=2000 limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..1.75 rows=10 width=12) (actual time=0.023..0.031 rows=10 loops=1)
Output: id, play_count
Buffers: shared hit=1
-> Sample Scan on public.video v1 (cost=0.00..16.65 rows=95 width=12) (actual time=0.022..0.028 rows=10 loops=1)
Output: id, play_count
Sampling: system_rows ('100'::bigint)
Filter: (v1.play_count >= 2000)
Buffers: shared hit=1
Planning Time: 0.076 ms
Execution Time: 0.051 ms
(10 rows)
When the time and the cost are controllable to prevent sampling avalanches, the sampling randomness is guaranteed.
If the percentage of records of the where condition is very small, the number of limits cannot be returned after the upper sampling limit is reached. Therefore, you can choose several methods:
How PostgreSQL Returns a Dynamic Number of Columns: Row-Column Convert
digoal - December 14, 2018
digoal - October 8, 2022
digoal - March 25, 2020
digoal - March 25, 2020
ApsaraDB - August 7, 2023
Alibaba Clouder - October 15, 2020
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAn online MPP warehousing service based on the Greenplum Database open source program
Learn MoreA fully managed NoSQL cloud database service that enables storage of massive amount of structured and semi-structured data
Learn MoreAn on-demand database hosting service for PostgreSQL with automated monitoring, backup and disaster recovery capabilities
Learn MoreMore Posts by digoal