PostgreSQL Random Query Sampling – Table Sample Method

By digoal

Background

There are N results that meet the conditions, and M results are randomly returned. What are the requirements in that scenario? What is recommended?

1. BERNOULLI Sampling: Scan the whole table and return N% of the total sampling records.

If limit N is used to intercept N returns, the efficiency will be improved, but you will find that the data is not random enough, and it is basically concentrated in the blocks in the head of the data file:

postgres=# select ctid,* from tbl TABLESAMPLE BERNOULLI (1)  limit 10;        
  ctid   |  id  | loc  |  beginid  |   endid           
---------+------+------+-----------+-----------        
 (1,6)   |   76 | 2006 | 100034889 | 100035096        
 (3,8)   |  218 | 3280 | 100105401 | 100105571        
 (4,19)  |  299 | 708  | 100145769 | 100146449        
 (6,25)  |  445 | 3195 | 100220431 | 100221192        
 (7,3)   |  493 | 1867 | 100247252 | 100248048        
 (9,5)   |  635 | 2125 | 100318350 | 100319087        
 (10,51) |  751 | 1151 | 100374936 | 100375883        
 (12,20) |  860 | 2302 | 100430532 | 100430674        
 (12,33) |  873 | 15   | 100438908 | 100439548        
 (17,15) | 1205 | 2540 | 100607913 | 100608198        
(10 rows)        
        
Time: 0.387 ms        
postgres=# select ctid,* from tbl TABLESAMPLE BERNOULLI (1)  limit 10;        
  ctid   | id  | loc  |  beginid  |   endid           
---------+-----+------+-----------+-----------        
 (1,18)  |  88 | 1195 | 100038559 | 100039159        
 (2,20)  | 160 | 4137 | 100076107 | 100076919        
 (4,62)  | 342 | 2480 | 100169826 | 100170454        
 (6,2)   | 422 | 392  | 100209331 | 100209964        
 (6,4)   | 424 | 4239 | 100210066 | 100211053        
 (8,52)  | 612 | 3086 | 100304147 | 100304853        
 (10,9)  | 709 | 2137 | 100354108 | 100354314        
 (10,45) | 745 | 1808 | 100371819 | 100372439        
 (13,19) | 929 | 3888 | 100471124 | 100471669        
 (13,27) | 937 | 609  | 100476280 | 100476579        
(10 rows)        
        
Time: 0.301 ms        
postgres=# select ctid,* from tbl TABLESAMPLE BERNOULLI (1)  limit 10;        
  ctid   |  id  | loc  |  beginid  |   endid           
---------+------+------+-----------+-----------        
 (0,38)  |   38 | 486  | 100015564 | 100015931        
 (0,51)  |   51 | 1244 | 100022368 | 100022860        
 (1,29)  |   99 | 4326 | 100044116 | 100044999        
 (1,63)  |  133 | 449  | 100061611 | 100062197        
 (4,10)  |  290 | 755  | 100141357 | 100142271        
 (5,13)  |  363 | 960  | 100178662 | 100179447        
 (5,64)  |  414 | 3800 | 100204948 | 100205770        
 (11,22) |  792 | 831  | 100395860 | 100396113        
 (14,17) |  997 | 2487 | 100504353 | 100504505        
 (16,41) | 1161 | 4053 | 100587279 | 100587984        
(10 rows)        
        
Time: 0.507 ms

2. SYSTEM Sampling: Scan the whole BLOCKS of the table and return N% of the total sampling blocks. The efficiency is much higher than BERNOULLI sampling, but all records in each sampled block will be returned.

If limit N is used to intercept N returns, you will find that the data is not random enough, and all records in each sampled block will be returned.

postgres=# select ctid,* from tbl TABLESAMPLE system (5)  limit 10;        
  ctid   | id  | loc  |  beginid  |   endid           
---------+-----+------+-----------+-----------        
 (10,1)  | 701 | 2675 | 100348960 | 100349937        
 (10,2)  | 702 | 4307 | 100349937 | 100350353        
 (10,3)  | 703 | 475  | 100350353 | 100351093        
 (10,4)  | 704 | 1611 | 100351093 | 100351171        
 (10,5)  | 705 | 4307 | 100351171 | 100351692        
 (10,6)  | 706 | 2841 | 100351692 | 100352448        
 (10,7)  | 707 | 3680 | 100352448 | 100353372        
 (10,8)  | 708 | 1085 | 100353372 | 100354108        
 (10,9)  | 709 | 2137 | 100354108 | 100354314        
 (10,10) | 710 | 3381 | 100354314 | 100354905        
(10 rows)        
        
Time: 0.547 ms        
        
postgres=# select ctid,* from tbl TABLESAMPLE system (5)  limit 10;        
  ctid   |  id  | loc  |  beginid  |   endid           
---------+------+------+-----------+-----------        
 (30,1)  | 2101 | 2279 | 101047339 | 101047873        
 (30,2)  | 2102 | 4113 | 101047873 | 101048273        
 (30,3)  | 2103 | 47   | 101048273 | 101048691        
 (30,4)  | 2104 | 3766 | 101048691 | 101049398        
 (30,5)  | 2105 | 4133 | 101049398 | 101049842        
 (30,6)  | 2106 | 2915 | 101049842 | 101050766        
 (30,7)  | 2107 | 2489 | 101050766 | 101050799        
 (30,8)  | 2108 | 4325 | 101050799 | 101051401        
 (30,9)  | 2109 | 2204 | 101051401 | 101052374        
 (30,10) | 2110 | 1397 | 101052374 | 101052726        
(10 rows)        
        
Time: 0.457 ms        
postgres=# select ctid,* from tbl TABLESAMPLE system (5)  limit 10;        
  ctid  | id | loc  |  beginid  |   endid           
--------+----+------+-----------+-----------        
 (0,1)  |  1 | 234  | 100000000 | 100000295        
 (0,2)  |  2 | 1153 | 100000295 | 100000793        
 (0,3)  |  3 | 4127 | 100000793 | 100001203        
 (0,4)  |  4 | 3287 | 100001203 | 100001450        
 (0,5)  |  5 | 2156 | 100001450 | 100002286        
 (0,6)  |  6 | 1097 | 100002286 | 100002351        
 (0,7)  |  7 | 2302 | 100002351 | 100002625        
 (0,8)  |  8 | 1768 | 100002625 | 100003382        
 (0,9)  |  9 | 39   | 100003382 | 100003630        
 (0,10) | 10 | 887  | 100003630 | 100004209        
(10 rows)        
        
Time: 0.427 ms        
postgres=# select ctid,* from tbl TABLESAMPLE system (5)  limit 10;        
  ctid   |  id  | loc  |  beginid  |   endid           
---------+------+------+-----------+-----------        
 (43,1)  | 3011 | 4174 | 101493150 | 101493628        
 (43,2)  | 3012 | 798  | 101493628 | 101494207        
 (43,3)  | 3013 | 1334 | 101494207 | 101495189        
 (43,4)  | 3014 | 4870 | 101495189 | 101495937        
 (43,5)  | 3015 | 1920 | 101495937 | 101496065        
 (43,6)  | 3016 | 2561 | 101496065 | 101496782        
 (43,7)  | 3017 | 1617 | 101496782 | 101497469        
 (43,8)  | 3018 | 3226 | 101497469 | 101497864        
 (43,9)  | 3019 | 4582 | 101497864 | 101498854        
 (43,10) | 3020 | 2110 | 101498854 | 101499388        
(10 rows)        
        
Time: 1.528 ms

3. SYSTEM + random probability filtering solves the problem that the records returned in system sampling are not discrete enough.

Performance is also good. The larger the table is, the more outstanding the performance is.

The dispersion depends on the random probability condition, and the returned results are very discrete.

postgres=# select ctid,* from tbl TABLESAMPLE system (5) where random()<0.01 limit 10;        
   ctid   |  id   | loc  |  beginid  |   endid           
----------+-------+------+-----------+-----------        
 (3,19)   |   229 | 2106 | 100110347 | 100110461        
 (3,45)   |   255 | 4334 | 100124594 | 100125503        
 (149,43) | 10473 | 4125 | 105216168 | 105216240        
 (160,12) | 11212 | 787  | 105591453 | 105591718        
 (160,36) | 11236 | 106  | 105600343 | 105600824        
 (160,47) | 11247 | 4053 | 105606511 | 105607286        
 (203,48) | 14258 | 3078 | 107103358 | 107104264        
 (278,29) | 19489 | 3268 | 109723508 | 109723920        
 (286,10) | 20030 | 1048 | 109999654 | 110000489        
 (315,31) | 22081 | 2826 | 111013999 | 111014368        
(10 rows)        
        
Time: 2.837 ms        
        
postgres=# select ctid,* from tbl TABLESAMPLE system (5) where random()<0.01 limit 10;        
   ctid   |  id   | loc  |  beginid  |   endid           
----------+-------+------+-----------+-----------        
 (69,26)  |  4856 | 7    | 102417440 | 102417597        
 (71,65)  |  5035 | 1092 | 102510582 | 102511527        
 (77,39)  |  5429 | 1120 | 102704924 | 102705238        
 (152,31) | 10671 | 1466 | 105323336 | 105324170        
 (176,27) | 12347 | 4757 | 106150494 | 106151243        
 (176,59) | 12379 | 1195 | 106167151 | 106167416        
 (206,70) | 14490 | 3052 | 107221746 | 107222087        
 (229,46) | 16076 | 463  | 107995891 | 107996698        
 (242,29) | 16969 | 1865 | 108461627 | 108461736        
 (242,62) | 17002 | 4863 | 108477461 | 108478270        
(10 rows)        
        
Time: 2.667 ms        
postgres=# select ctid,* from tbl TABLESAMPLE system (5) where random()<0.01 limit 10;        
   ctid   |  id   | loc  |  beginid  |   endid           
----------+-------+------+-----------+-----------        
 (1,54)   |   124 | 1346 | 100056875 | 100057832        
 (9,69)   |   699 | 2227 | 100347977 | 100348861        
 (14,54)  |  1034 | 3879 | 100520998 | 100521176        
 (22,60)  |  1600 | 2226 | 100802698 | 100803154        
 (28,30)  |  1990 | 1133 | 100996297 | 100996580        
 (28,57)  |  2017 | 189  | 101009211 | 101009952        
 (28,69)  |  2029 | 2109 | 101015988 | 101016238        
 (106,1)  |  7421 | 2469 | 103702081 | 103702435        
 (130,67) |  9167 | 1801 | 104582977 | 104583754        
 (149,20) | 10450 | 4126 | 105204544 | 105204560        
(10 rows)        
        
Time: 3.157 ms        
        
postgres=# explain select ctid,* from tbl TABLESAMPLE system (5) where random()<0.01 limit 10;        
                             QUERY PLAN                                     
--------------------------------------------------------------------        
 Limit  (cost=0.00..0.92 rows=10 width=90)        
   ->  Sample Scan on tbl  (cost=0.00..1535.40 rows=16667 width=90)        
         Sampling: system ('5'::real)        
         Filter: (random() < '0.01'::double precision)        
(4 rows)        
        
Time: 0.858 ms

4. In addition to the built-in sampling method, PG supports extended sampling methods, such as sampling by time (which refers to the random scanning and scanning in the specified time) and scanning samples for 1 millisecond at most and returning the records scanned in the 1 millisecond.

It also supports returning by the number of rows. For example, a maximum of 100 rows are sampled.

PostgreSQL Random Sampling Application- table sample, tsm_system_rows, tsm_system_time

-- Up to 5 milliseconds      
postgres=# select ctid,* from tbl TABLESAMPLE system_time (5) where random()<0.02 limit 10;        
    ctid    |   id   | loc  |  beginid  |   endid           
------------+--------+------+-----------+-----------        
 (13943,32) | 976042 | 786  | 587732405 | 587732831        
 (5014,17)  | 350997 | 4515 | 275520535 | 275520831        
 (10371,58) | 726028 | 3201 | 462659708 | 462660690        
 (10371,67) | 726037 | 2160 | 462664042 | 462664194        
 (1442,8)   | 100948 | 1134 | 150431878 | 150432150        
 (1442,22)  | 100962 | 2839 | 150436680 | 150437122        
 (1442,36)  | 100976 | 3027 | 150444363 | 150444375        
 (6799,50)  | 475980 | 4172 | 337780753 | 337781376        
 (6799,51)  | 475981 | 1971 | 337781376 | 337781830        
 (12156,31) | 850951 | 3287 | 525181282 | 525182019        
(10 rows)        
        
Time: 0.908 ms

postgres=# explain select ctid,* from tbl TABLESAMPLE system_time (5) where random()<0.02 limit 10;        
                           QUERY PLAN                                   
----------------------------------------------------------------        
 Limit  (cost=0.00..0.92 rows=10 width=90)        
   ->  Sample Scan on tbl  (cost=0.00..10.75 rows=117 width=90)        
         Sampling: system_time ('5'::double precision)        
         Filter: (random() < '0.02'::double precision)        
(4 rows)        
        
Time: 0.697 ms        
- Up to 1000 rows      
-- 1000 * probability 0.01 = 10, so limit 10 can be ignored.        
postgres=# select ctid,* from tbl TABLESAMPLE system_rows(1000) where random()<0.01 ;      
   ctid    |   id   | loc  |  beginid  |   endid         
-----------+--------+------+-----------+-----------      
 (8522,48) | 596588 | 2148 | 397979059 | 397980022      
 (8454,21) | 591801 | 670  | 395580178 | 395581156      
 (8454,70) | 591850 | 1441 | 395608906 | 395609174      
 (1209,13) |  84643 | 3924 | 142242550 | 142243369      
 (8318,36) | 582296 | 4625 | 390811869 | 390811932      
 (1141,5)  |  79875 | 2093 | 139858769 | 139859220      
 (1141,70) |  79940 | 4611 | 139892888 | 139893837      
 (1005,14) |  70364 | 2328 | 135109401 | 135109605      
 (1005,23) |  70373 | 2382 | 135113320 | 135113475      
(9 rows)      
      
Time: 2.370 ms

Compare the Performance

postgres=# select count(*) from tbl;      
  count        
---------      
 1000000      
(1 row)

Comparison of Random Scan of 1 Million Records

IT-C02YW2EFLVDL:~ digoal$ pgbench -M prepared -n -r -P 1 -f ./t1.sql -c 12 -j 12 -T 10      
progress: 1.0 s, 11.9 tps, lat 685.136 ms stddev 65.024      
progress: 2.0 s, 13.1 tps, lat 669.947 ms stddev 18.605      
progress: 3.0 s, 22.9 tps, lat 708.591 ms stddev 43.357      
progress: 4.0 s, 13.0 tps, lat 641.470 ms stddev 24.108      
progress: 5.0 s, 22.9 tps, lat 676.400 ms stddev 37.549      
progress: 6.0 s, 13.0 tps, lat 689.115 ms stddev 17.644      
progress: 7.0 s, 18.9 tps, lat 767.281 ms stddev 22.536      
progress: 8.0 s, 15.0 tps, lat 819.248 ms stddev 39.973      
progress: 9.0 s, 13.0 tps, lat 856.244 ms stddev 29.988      
progress: 10.0 s, 12.0 tps, lat 829.066 ms stddev 22.431      
pgbench (PostgreSQL) 14.0      
transaction type: ./t1.sql      
scaling factor: 1      
query mode: prepared      
number of clients: 12      
number of threads: 12      
duration: 10 s      
number of transactions actually processed: 168      
latency average = 749.340 ms      
latency stddev = 101.206 ms      
initial connection time = 11.026 ms      
tps = 15.804188 (without initial connection time)      
statement latencies in milliseconds:      
       749.340  select * from tbl order by random() limit 10;      
      
      
IT-C02YW2EFLVDL:~ digoal$ pgbench -M prepared -n -r -P 1 -f ./t2.sql -c 12 -j 12 -T 10      
progress: 1.0 s, 32119.4 tps, lat 0.367 ms stddev 0.932      
progress: 2.0 s, 31335.6 tps, lat 0.384 ms stddev 1.286      
progress: 3.0 s, 31556.0 tps, lat 0.376 ms stddev 1.679      
progress: 4.0 s, 30996.3 tps, lat 0.390 ms stddev 1.413      
progress: 5.0 s, 30622.0 tps, lat 0.390 ms stddev 1.006      
progress: 6.0 s, 30778.7 tps, lat 0.384 ms stddev 2.160      
progress: 7.0 s, 29766.9 tps, lat 0.405 ms stddev 2.657      
progress: 8.0 s, 27425.5 tps, lat 0.443 ms stddev 2.396      
progress: 9.0 s, 25860.8 tps, lat 0.462 ms stddev 3.315      
progress: 10.0 s, 25695.2 tps, lat 0.462 ms stddev 2.708      
pgbench (PostgreSQL) 14.0      
transaction type: ./t2.sql      
scaling factor: 1      
query mode: prepared      
number of clients: 12      
number of threads: 12      
duration: 10 s      
number of transactions actually processed: 296261      
latency average = 0.404 ms      
latency stddev = 2.059 ms      
initial connection time = 13.047 ms      
tps = 29641.685598 (without initial connection time)      
statement latencies in milliseconds:      
         0.406  select ctid,* from tbl TABLESAMPLE SYSTEM (5) where random()<0.01 limit 10;      
      
      
IT-C02YW2EFLVDL:~ digoal$ pgbench -M prepared -n -r -P 1 -f ./t3.sql -c 12 -j 12 -T 10      
progress: 1.0 s, 28849.9 tps, lat 0.407 ms stddev 1.055      
progress: 2.0 s, 28843.1 tps, lat 0.418 ms stddev 1.950      
progress: 3.0 s, 27394.0 tps, lat 0.433 ms stddev 2.971      
progress: 4.0 s, 28447.4 tps, lat 0.422 ms stddev 2.686      
progress: 5.0 s, 26309.0 tps, lat 0.458 ms stddev 2.987      
progress: 6.0 s, 29036.4 tps, lat 0.413 ms stddev 0.963      
progress: 7.0 s, 29986.4 tps, lat 0.402 ms stddev 1.382      
progress: 8.0 s, 29973.1 tps, lat 0.400 ms stddev 1.649      
progress: 9.0 s, 25595.4 tps, lat 0.468 ms stddev 2.718      
progress: 10.0 s, 26574.2 tps, lat 0.451 ms stddev 3.553      
pgbench (PostgreSQL) 14.0      
transaction type: ./t3.sql      
scaling factor: 1      
query mode: prepared      
number of clients: 12      
number of threads: 12      
duration: 10 s      
number of transactions actually processed: 281140      
latency average = 0.426 ms      
latency stddev = 2.321 ms      
initial connection time = 15.631 ms      
tps = 28142.235104 (without initial connection time)      
statement latencies in milliseconds:      
         0.429  select ctid,* from tbl TABLESAMPLE SYSTEM_ROWS (1000) where random()<0.01;      
      
      
IT-C02YW2EFLVDL:~ digoal$ pgbench -M prepared -n -r -P 1 -f ./t4.sql -c 12 -j 12 -T 10      
progress: 1.0 s, 26988.7 tps, lat 0.439 ms stddev 1.327      
progress: 2.0 s, 24800.7 tps, lat 0.472 ms stddev 2.408      
progress: 3.0 s, 25742.0 tps, lat 0.470 ms stddev 3.191      
progress: 4.0 s, 29388.5 tps, lat 0.414 ms stddev 2.189      
progress: 5.0 s, 29199.8 tps, lat 0.404 ms stddev 0.590      
progress: 6.0 s, 25906.5 tps, lat 0.470 ms stddev 3.301      
progress: 7.0 s, 29297.8 tps, lat 0.409 ms stddev 1.143      
progress: 8.0 s, 29572.6 tps, lat 0.405 ms stddev 0.805      
progress: 9.0 s, 26367.6 tps, lat 0.443 ms stddev 2.976      
progress: 10.0 s, 25389.1 tps, lat 0.475 ms stddev 3.652      
pgbench (PostgreSQL) 14.0      
transaction type: ./t4.sql      
scaling factor: 1      
query mode: prepared      
number of clients: 12      
number of threads: 12      
duration: 10 s      
number of transactions actually processed: 272779      
latency average = 0.439 ms      
latency stddev = 2.379 ms      
initial connection time = 13.144 ms      
tps = 27255.103831 (without initial connection time)      
statement latencies in milliseconds:      
         0.441  select ctid,* from tbl TABLESAMPLE SYSTEM_time (5) where random()<0.01 limit 10;

What Should We Do if an Index Filter Condition Is Available, but the Percentage of Records That Meet the Condition Is Small?

Among the 10 million rows of records, 10,000 rows meet the conditions. In the random sampling method, users may have to scan many blocks to get accurate data.

Method 1: Traditional Order by Random() Method

create table t (id int primary key, c1 int, info text, crt_time timestamp);      
insert into t select generate_series(1,10000000), random()*999, md5(random()::text), clock_timestamp();      
create index idx_t_1 on t (c1);      
      
postgres=# explain select * from t where c1=0 limit 10;      
                                  QUERY PLAN                                         
-------------------------------------------------------------------------------      
 Limit  (cost=0.43..11.11 rows=10 width=49)      
   ->  Index Scan using idx_t_1 on t  (cost=0.43..10623.39 rows=9947 width=49)      
         Index Cond: (c1 = 0)      
(3 rows)      
      
postgres=# explain select * from t where c1=0 order by random() limit 10;      
                                     QUERY PLAN                                            
-------------------------------------------------------------------------------------      
 Limit  (cost=10863.21..10863.23 rows=10 width=57)      
   ->  Sort  (cost=10863.21..10888.07 rows=9947 width=57)      
         Sort Key: (random())      
         ->  Index Scan using idx_t_1 on t  (cost=0.43..10648.25 rows=9947 width=57)      
               Index Cond: (c1 = 0)      
(5 rows)      
      
      
IT-C02YW2EFLVDL:~ digoal$ pgbench -M prepared -n -r -P 1 -f ./test.sql -c 12 -j 12 -T 10      
progress: 1.0 s, 1165.9 tps, lat 9.960 ms stddev 14.434      
progress: 2.0 s, 1298.1 tps, lat 9.319 ms stddev 12.472      
progress: 3.0 s, 1313.8 tps, lat 9.137 ms stddev 11.276      
progress: 4.0 s, 1161.6 tps, lat 10.068 ms stddev 13.447      
progress: 5.0 s, 972.2 tps, lat 12.399 ms stddev 20.388      
progress: 6.0 s, 999.8 tps, lat 12.249 ms stddev 17.811      
progress: 7.0 s, 1153.3 tps, lat 10.366 ms stddev 13.479      
progress: 8.0 s, 1180.6 tps, lat 9.992 ms stddev 12.386      
progress: 9.0 s, 1029.6 tps, lat 11.916 ms stddev 17.798      
progress: 10.0 s, 966.5 tps, lat 12.252 ms stddev 17.236      
pgbench (PostgreSQL) 14.0      
transaction type: ./test.sql      
scaling factor: 1      
query mode: prepared      
number of clients: 12      
number of threads: 12      
duration: 10 s      
number of transactions actually processed: 11257      
latency average = 10.658 ms      
latency stddev = 15.150 ms      
initial connection time = 13.008 ms      
tps = 1124.835901 (without initial connection time)      
statement latencies in milliseconds:      
        11.003  select * from t where c1=0 order by random() limit 10;

Method 2: The index is used to get the blocks that meet the conditions. (This step is the most time-consuming.) Get ten random blocks from the blocks that meet the conditions and return ten random records from all records of these ten blocks.

select t.* from t where exists       
  ( select 1 from       
      (select blkid from (select substring(ctid::text,'(\d*),') blkid from t where c1=0 group by 1) t  -- Obtain all BLOCK IDs that meet the condition.      
        order by random() limit 10) t1   -- Obtain 10 random block id.      
  where t.ctid>=('('||t1.blkid||',0)')::tid   -- Obtain all the records of these 10 blocks.      
  and t.ctid< ('('||t1.blkid::int+1||',0)')::tid   -- Obtain all the records of these 10 blocks.      
  and t.c1=0 )  -- filter out c1=0      
order by random()        
limit 10;  -- 10 entries are returned.        
      
      
                                                                                    QUERY PLAN                                                                                          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------      
 Limit  (cost=26432.50..26432.52 rows=10 width=57)      
   ->  Sort  (cost=26432.50..26435.26 rows=1106 width=57)      
         Sort Key: (random())      
         ->  Nested Loop Semi Join  (cost=11246.98..26408.60 rows=1106 width=57)      
               Join Filter: ((t.ctid >= ((('('::text || t1.blkid) || ',0)'::text))::tid) AND (t.ctid < ((('('::text || (((t1.blkid)::integer + 1))::text) || ',0)'::text))::tid))      
               ->  Index Scan using idx_t_1 on t  (cost=0.43..10633.09 rows=9957 width=55)      
                     Index Cond: (c1 = 0)      
               ->  Materialize  (cost=11246.54..11246.72 rows=10 width=32)      
                     ->  Subquery Scan on t1  (cost=11246.54..11246.67 rows=10 width=32)      
                           ->  Limit  (cost=11246.54..11246.57 rows=10 width=40)      
                                 ->  Sort  (cost=11246.54..11271.43 rows=9957 width=40)      
                                       Sort Key: (random())      
                                       ->  Subquery Scan on t_1  (cost=10732.66..11031.37 rows=9957 width=40)      
                                             ->  HashAggregate  (cost=10732.66..10906.91 rows=9957 width=32)      
                                                   Group Key: "substring"((t_2.ctid)::text, '(\d*),'::text)      
                                                   ->  Index Scan using idx_t_1 on t t_2  (cost=0.43..10707.77 rows=9957 width=32)      
                                                         Index Cond: (c1 = 0)      
(17 rows)      
      
Time: 0.707 ms      
      
      
      
IT-C02YW2EFLVDL:~ digoal$ pgbench -M prepared -n -r -P 1 -f ./test.sql -c 12 -j 12 -T 10      
progress: 1.0 s, 113.9 tps, lat 99.518 ms stddev 20.071      
progress: 2.0 s, 125.1 tps, lat 95.576 ms stddev 10.811      
progress: 3.0 s, 117.7 tps, lat 101.022 ms stddev 14.942      
progress: 4.0 s, 95.2 tps, lat 124.592 ms stddev 26.282      
progress: 5.0 s, 93.8 tps, lat 128.335 ms stddev 15.483      
progress: 6.0 s, 95.9 tps, lat 128.530 ms stddev 14.861      
progress: 7.0 s, 101.4 tps, lat 115.433 ms stddev 15.376      
progress: 8.0 s, 116.0 tps, lat 104.154 ms stddev 10.846      
progress: 9.0 s, 112.7 tps, lat 105.956 ms stddev 11.932      
progress: 10.0 s, 110.3 tps, lat 107.870 ms stddev 11.686      
pgbench (PostgreSQL) 14.0      
transaction type: ./test.sql      
scaling factor: 1      
query mode: prepared      
number of clients: 12      
number of threads: 12      
duration: 10 s      
number of transactions actually processed: 1095      
latency average = 109.931 ms      
latency stddev = 19.405 ms      
initial connection time = 10.773 ms      
tps = 108.826246 (without initial connection time)      
statement latencies in milliseconds:      
       109.965  select t.* from t where exists

Alternatively, we can regularly find out at the business level. Put it into the middle business to randomly return it to the customer or directly put it in the database as a temporary table and refresh the content regularly.

postgres=# create unlogged table tmp_t_c1_1 (like t);
CREATE TABLE
Time: 6.524 ms
postgres=# insert into tmp_t_c1_1 select * from t where c1=1;
INSERT 0 10001
Time: 51.261 ms
postgres=# select * from tmp_t_c1_1 tables

postgres=# select * from tmp_t_c1_1 tablesample system_rows(1000) where random()<0.01 ;
   id    | c1 |               info               |          crt_time          
---------+----+----------------------------------+----------------------------
 7869277 |  1 | b4483c7728566a238de08f77eba12774 | 2021-05-27 16:19:32.566097
 7881151 |  1 | 49af1161c2e3a779328b32ddd058c7f0 | 2021-05-27 16:19:32.588799
 1226520 |  1 | 46981759668b3a9acb953ab8b279f4ea | 2021-05-27 16:19:16.06893
 1244566 |  1 | db248939b6c6647b25b1a32584747be8 | 2021-05-27 16:19:16.10505
 1251111 |  1 | 9096bc3b83f80a20342b47c483c66b3b | 2021-05-27 16:19:16.118112
 4677752 |  1 | 97dbe82b4f6476ee50d6a37775fb3e6f | 2021-05-27 16:19:24.337757
 7984355 |  1 | 4f996ed3be9fb0fbd26b0444f44681d6 | 2021-05-27 16:19:32.817659
 7996207 |  1 | bd4d80164abf2b17153e6b10fc871725 | 2021-05-27 16:19:32.840801
 8020673 |  1 | 95765bac6bf7952bc59a6bc8d5db24f8 | 2021-05-27 16:19:32.887891
 1322032 |  1 | 0f769db9cb6c5a920dba9ffd364b0a7d | 2021-05-27 16:19:16.285266
 4734787 |  1 | 75f8b9e9ce72e7f5733efeaf0ed40215 | 2021-05-27 16:19:24.449741
 8128961 |  1 | c356f3d289459dbe099769c8bd786e2c | 2021-05-27 16:19:33.124422
 1492978 |  1 | 0619d948880cf16f7ce0c8e405525621 | 2021-05-27 16:19:16.678358
(13 rows)

Time: 4.226 ms
postgres=# select * from tmp_t_c1_1 tablesample system_rows(1000) where random()<0.01 ;
   id    | c1 |               info               |          crt_time          
---------+----+----------------------------------+----------------------------
 3659031 |  1 | edb2c2ee58aa530fbe3ec8199c31914c | 2021-05-27 16:19:21.990993
 3705687 |  1 | 47dd5dce35ffdf84792a5c64f573f5cb | 2021-05-27 16:19:22.086795
 2578458 |  1 | bc97c41db6329221ae30fe604eb222a2 | 2021-05-27 16:19:19.382254
 1518045 |  1 | 78f90fa17050b2eacd3222c9948606dd | 2021-05-27 16:19:16.727858
 1553043 |  1 | 48d2363be6446e266d074c6e98a856d5 | 2021-05-27 16:19:16.796445
 1587461 |  1 | 9329578a7b41b36b4919e4c25b9e1c31 | 2021-05-27 16:19:16.890558
  410551 |  1 | 6184732511cfad0b05c75c1865a9e920 | 2021-05-27 16:19:13.974389
 9438915 |  1 | cb0fc9af4da9259c3f32ec37180a6774 | 2021-05-27 16:19:36.463244
 9501777 |  1 | b82cb22ba0e7620b21cb6e9219d6f425 | 2021-05-27 16:19:36.620395
 7345643 |  1 | 9820565758529cfb752135cc5e523adf | 2021-05-27 16:19:31.304132
 6273533 |  1 | a6cdf268caeeeeda3d766d2370e84655 | 2021-05-27 16:19:28.488774
 6323667 |  1 | b000452c87b882ff00e4aff2abe34cc9 | 2021-05-27 16:19:28.613903
 6346355 |  1 | 6810c56b96bdcaf2337ce23797b9d118 | 2021-05-27 16:19:28.669933
 5240569 |  1 | 2e33eaf8ea4e12cf8e3bcae6022c48c3 | 2021-05-27 16:19:25.660275
 5243275 |  1 | 6bcdd1dea42daf07302310c0edbfdaa7 | 2021-05-27 16:19:25.666971
 4184755 |  1 | 8c04c876d1fe13b7754ceb8eb4782969 | 2021-05-27 16:19:23.177027
(16 rows)

Time: 0.732 ms

Thinking

The problem is about good filterability for randomly obtaining N records. Modifying the kernel index structure may be more effective in solving the problem. All the data that meets the conditions must be scanned sequentially through the index linked list, and then the random issue is considered.

Community

PostgreSQL Random Query Sampling – Table Sample Method

Background

Compare the Performance

What Should We Do if an Index Filter Condition Is Available, but the Percentage of Records That Meet the Condition Is Small?

Thinking

Read previous post:

Read next post:

digoal

You may also like

Comments

digoal

Related Products

PolarDB for PostgreSQL

AnalyticDB for PostgreSQL

ApsaraDB RDS for PostgreSQL

Database for FinTech Solution