×
Community Blog Analyzing 1TB Table from Any Dimensions in Seconds with RDS PostgreSQL

Analyzing 1TB Table from Any Dimensions in Seconds with RDS PostgreSQL

With ApsaraDB for RDS PostrgreSQL, you can analyze 1TB tables from any dimensions in seconds, satisfying the requirements of statistical analysis.

By Digoal

Value estimation is a common tool in statistics. Because of the large data size, it takes a huge amount of resources to obtain accurate values. However, statistical analysis does not require completely accurate data. Therefore, value estimation is often a compromise and is widely used in statistical analysis scenarios.

PostgreSQL is a powerful database that provides a number of methods for estimated value statistics. In PostgreSQL, we can use the HLL plug-in to determine the estimated UV and incremental UV (using the COUNT DISTINCT function). You can also use a count-min sketch top-n extension https://github.com/remzicanaksoy/cms_topn

We can determine the TOP VALUEs (including the TOP elements of an array field) and COUNT of any fields by using the statistical information bar chart. We can use pg_class.reltuples to determine the data record count of an entire table.

We can use the estimated value of EXPLAIN to determine the number of returned records (for example, when determining pagination) or to determine the estimated value of COUNT (*) (simply convert the SQL statement to select 1 from ...).

When determining the number of unique values for multiple fields, we can obtain very accurate estimates by defining custom statistics. When determining the estimated values that meet certain conditions, for example when we are required to list TOP N movie stars of a province, we can take samples first and then compute using the samples.

This article describes how to take data samples and estimate values.

Scenario Design

I have previously written about a scenario for pivot analysis of a pan-content website, which involves a large amount of data. The amount of data that needs to be fully scanned is relatively large. Let's see whether the sampling method meets the requirements.

  1. Table structure
    create table tbl (  
      id int8,  -- Sequence    
      tag1 int[],   -- Array  
      c1 int,       -- 1–100  
      c2 int,       -- 1–10000  
      c3 timestamp   -- Timestamp  
    );  

  2. Create a function to generate random values
    Value range $1-$2, taking an array of $3 random values  
      
    create or replace function gen_rand_ints(int, int, int) returns int[] as $$  
      select array(select (random()*($2-$1))::int+$1 from generate_series(1,$3));  
    $$ language sql strict;  
      
    postgres=# select gen_rand_ints(10,25,5);  
      gen_rand_ints     
    ------------------  
     {20,19,24,22,21}  
    (1 row)  

  3. Write test data
    -- Write a hot spot array of 50 million values  
    insert into tbl select id, gen_rand_ints(1,1000,10), random()*100, random()*10000, clock_timestamp() from generate_series(1,50000000) t(id);  
      
    -- Write non-hot spot arrays of 100 million values  
    insert into tbl select id, gen_rand_ints(1,1000000,10), random()*100, random()*10000, clock_timestamp() from generate_series(1,100000000) t(id);  

    The data model is as follows

    postgres=# select * from tbl limit 10;  
        id    |                   tag1                    | c1 |  c2  |             c3               
    ----------+-------------------------------------------+----+------+----------------------------  
     38931521 | {424,448,91,420,382,657,677,60,530,503}   | 59 | 6120 | 2017-09-11 14:32:06.610512  
     38931522 | {66,87,468,207,79,780,307,714,520,149}    | 44 | 7848 | 2017-09-11 14:32:06.610522  
     38931523 | {99,628,798,558,415,74,863,839,522,953}   | 26 | 9032 | 2017-09-11 14:32:06.610531  
     38931524 | {610,935,962,140,438,551,752,503,636,220} | 71 | 7136 | 2017-09-11 14:32:06.61054  
     38931525 | {998,16,428,518,164,868,303,263,496,102}  | 82 | 9102 | 2017-09-11 14:32:06.61055  
     38931526 | {175,683,749,696,637,8,599,247,942,561}   | 39 | 3796 | 2017-09-11 14:32:06.610559  
     38931527 | {112,138,882,747,356,591,461,355,605,888} | 87 | 7684 | 2017-09-11 14:32:06.610568  
     38931528 | {756,175,31,252,276,850,162,450,533,910}  | 15 | 1691 | 2017-09-11 14:32:06.610578  
     38931529 | {917,744,416,860,306,801,240,416,937,122} | 16 | 2927 | 2017-09-11 14:32:06.610587  
     38931530 | {712,623,647,317,511,519,86,267,693,116}  | 52 | 9676 | 2017-09-11 14:32:06.610596  
    (10 rows)  

    Determine the TOP N elements of tag 1 under any conditions.

  4. Analyze the table to generate a bar chart.
    postgres=# analyze tbl;  
    ANALYZE  

    Table size 16 GB.

    postgres=# \dt+ tbl  
                       List of relations  
     Schema | Name | Type  |  Owner   | Size  | Description   
    --------+------+-------+----------+-------+-------------  
     public | tbl  | table | postgres | 16 GB |   
    (1 row)  

  5. Accurately determine the TOP N elements under certain conditions. There are 1000 hot spot IDs, so the COUNT result of returning the TOP 10 elements is very similar. The TOP 10 elements obtained may not be so accurate when using value estimation, but they are surely among the 1000 IDs.
    -- Query time with 32 parallel queries  
      
    postgres=# select unnest(tag1) tag1, count(*) from tbl where c1 between 1 and 10 group by 1 order by 2 desc limit 10;  
     tag1 | count   
    ------+-------  
      134 | 50935  
      768 | 50915  
      663 | 50876  
      567 | 50821  
      146 | 50821  
      332 | 50814  
      450 | 50807  
      884 | 50789  
       58 | 50781  
      605 | 50774  
    (10 rows)  
      
    Time: 23441.247 ms (00:23.441)  
      
    -- Query time without parallel queries  
    postgres=# select unnest(tag1) tag1, count(*) from tbl where c1 between 1 and 10 group by 1 order by 2 desc limit 10;  
     tag1 | count   
    ------+-------  
      134 | 50935  
      768 | 50915  
      663 | 50876  
      567 | 50821  
      146 | 50821  
      332 | 50814  
      450 | 50807  
      884 | 50789  
       58 | 50781  
      605 | 50774  
    (10 rows)  
      
    Time: 154935.686 ms (02:34.936)  

  6. Determine the sampling TOP N under the same conditions.

    For sampling methods, refer to the end of this article. PostgreSQL has two built-in common sampling methods, and two built-in extended sampling methods. In total, there are four built-in sampling methods.

    Use block-level sampling (currently, parallel sampling is not supported).

    postgres=# select unnest(tag1) tag1, (count(*))*20      -- Multiplied by 100/sampling coefficient  
    from   
    (select * from tbl TABLESAMPLE system (5)) t     
    where c1 between 1 and 10 group by 1 order by 2 desc limit 10;  
     tag1 | ?column?   
    ------+----------  
      724 |    53380  
      798 |    52680  
       24 |    52640  
      371 |    52480  
      569 |    52400  
      531 |    52280  
      979 |    52160  
      429 |    52140  
      980 |    52080  
      350 |    51920  
    (10 rows)  
      
    -- Sampling 5%, about 7 seconds.  
    Time: 6887.745 ms (00:06.888)   
      
    postgres=# select unnest(tag1) tag1, (count(*))*50    -- Multiplied by 100/sampling coefficient  
    from   
    (select * from tbl TABLESAMPLE system (2)) t     
    where c1 between 1 and 10 group by 1 order by 2 desc limit 10;  
     tag1 | ?column?   
    ------+----------  
      324 |    55450  
      435 |    55150  
      720 |    55050  
      943 |    54950  
      475 |    54750  
      958 |    54600  
       13 |    54400  
      742 |    54300  
      739 |    54100  
      301 |    53950  
    (10 rows)  
      
    -- Sampling 2%, about 3 seconds.  
    Time: 2720.140 ms (00:02.720)  

    A larger number of samples leads to a higher accuracy. The TOP N elements are very accurate when using the sampling method, because the example uses 1000 random values, and the probability of each random value is the same. If it is required to return TOP 1000, then it will be 100% accurate.

Large Table Example

Re-design hot spot data, write 4 billion pieces of test data in total:

Level 1 hot spot data, 1, 500 million

Level 2 hot spot data, 2–4, 500 million

Level 3 hot spot data, 5–10, 500 million

Level 4 hot spot data, 11–30, 500 million

Common data, 1–100000, 2 billion

  1. Table structure design
    create table tbl1 (  
      id int8,  -- Sequence  
      c1 int8,  -- Target field  
      c2 int8,  -- 1–100  
      c3 int8,  -- 1–100000  
      c4 timestamp  -- Timestamp  
    );  

  2. Write test data
    nohup psql -c "insert into tbl1 select id, 1, random()*100, random()*100000, clock_timestamp() from generate_series(1,500000000) t(id);" >/dev/null 2>&1 &  
    nohup psql -c "insert into tbl1 select id, random()*(4-2)+2, random()*100, random()*100000, clock_timestamp() from generate_series(1,500000000) t(id);" >/dev/null 2>&1 &  
    nohup psql -c "insert into tbl1 select id, random()*(10-5)+5, random()*100, random()*100000, clock_timestamp() from generate_series(1,500000000) t(id);" >/dev/null 2>&1 &  
    nohup psql -c "insert into tbl1 select id, random()*(30-11)+11, random()*100, random()*100000, clock_timestamp() from generate_series(1,500000000) t(id);" >/dev/null 2>&1 &  
    nohup psql -c "insert into tbl1 select id, random()*100000, random()*100, random()*100000, clock_timestamp() from generate_series(1,2000000000) t(id);" >/dev/null 2>&1 &  

  3. Analyze table
    postgres=# analyze tbl1;  
    ANALYZE  
    Time: 502.421 ms  
    Table size: 254 GB
    postgres=# \dt+ tbl1  
                        List of relations  
     Schema | Name | Type  |  Owner   |  Size  | Description   
    --------+------+-------+----------+--------+-------------  
     public | tbl1 | table | postgres | 254 GB |   
    (1 row)  

  4. Accurate TOP 30
    -- Query time with 32 parallel queries  
      
    postgres=# select c1,count(*) from tbl1 where c2 between 1 and 10 group by 1 order by 2 desc limit 30;  
     c1 |  count     
    ----+----------  
      1 | 49991259  
      3 | 25006580  
      2 | 12502559  
      4 | 12498741  
      9 | 10004285  
      6 | 10002597  
      8 |  9999530  
      7 |  9999215  
      5 |  5003219  
     10 |  4998870  
     29 |  2636193  
     18 |  2635457  
     13 |  2635344  
     17 |  2634693  
     26 |  2633965  
     19 |  2633690  
     28 |  2633526  
     14 |  2633512  
     15 |  2633363  
     24 |  2633260  
     20 |  2633014  
     25 |  2632926  
     16 |  2632779  
     22 |  2632508  
     27 |  2632288  
     23 |  2632216  
     21 |  2631443  
     12 |  2631315  
     11 |  1318483  
     30 |  1318451  
    (30 rows)  
      
    Time: 20845.738 ms (00:20.846)  
      
    -- Query time without parallel queries  
      
    postgres=# select c1,count(*) from tbl1 where c2 between 1 and 10 group by 1 order by 2 desc limit 30;  
      
     c1 |  count     
    ----+----------  
      1 | 49991259  
      3 | 25006580  
      2 | 12502559  
      4 | 12498741  
      9 | 10004285  
      6 | 10002597  
      8 |  9999530  
      7 |  9999215  
      5 |  5003219  
     10 |  4998870  
     29 |  2636193  
     18 |  2635457  
     13 |  2635344  
     17 |  2634693  
     26 |  2633965  
     19 |  2633690  
     28 |  2633526  
     14 |  2633512  
     15 |  2633363  
     24 |  2633260  
     20 |  2633014  
     25 |  2632926  
     16 |  2632779  
     22 |  2632508  
     27 |  2632288  
     23 |  2632216  
     21 |  2631443  
     12 |  2631315  
     11 |  1318483  
     30 |  1318451  
    (30 rows)  
      
    Time: 471112.827 ms (07:51.113)  

  5. Sampling TOP 30
    select c1,(count(*))*20 from   -- Multiplied by 100/sampling coefficient  
    (select * from tbl1 TABLESAMPLE system (5)) t     
    where c2 between 1 and 10 group by 1 order by 2 desc limit 30;  
      
     c1 | ?column?   
    ----+----------  
      1 | 50068840  
      3 | 25108820  
      2 | 12558680  
      4 | 12513080  
      7 | 10009300  
      9 | 10006260  
      6 | 10005400  
      8 |  9987220  
      5 |  5008280  
     10 |  5007980  
     17 |  2652940  
     16 |  2648640  
     25 |  2646800  
     28 |  2646600  
     15 |  2642480  
     20 |  2642220  
     14 |  2641620  
     26 |  2640500  
     23 |  2639420  
     29 |  2637740  
     22 |  2637320  
     13 |  2636900  
     19 |  2636100  
     18 |  2635120  
     24 |  2634440  
     12 |  2631480  
     27 |  2629880  
     21 |  2624940  
     11 |  1330140  
     30 |  1316480  
    (30 rows)  
      
    Time: 31884.725 ms (00:31.885)  
      
    -- Sampling 5%, about 32 seconds.  
      
    select c1,(count(*))*50 from   -- Multiplied by 100/sampling coefficient  
    (select * from tbl1 TABLESAMPLE system (2)) t     
    where c2 between 1 and 10 group by 1 order by 2 desc limit 30;  
      
     c1 | ?column?   
    ----+----------  
      1 | 50173200  
      3 | 24993550  
      2 | 12487100  
      4 | 12474100  
      6 |  9998250  
      8 |  9980450  
      7 |  9973950  
      9 |  9960450  
     10 |  4999050  
      5 |  4995000  
     29 |  2642700  
     28 |  2640900  
     16 |  2640300  
     26 |  2630250  
     24 |  2627500  
     23 |  2623700  
     19 |  2622350  
     27 |  2622000  
     18 |  2621200  
     12 |  2619450  
     20 |  2616200  
     17 |  2616050  
     21 |  2615800  
     15 |  2613200  
     22 |  2612200  
     14 |  2607700  
     13 |  2605900  
     25 |  2604150  
     30 |  1312300  
     11 |  1311950  
    (30 rows)  
      
    Time: 12942.455 ms (00:12.942)  
      
    -- Sampling 2%, about 13 seconds.  
      
    postgres=# select c1,(count(*))*1000 from   -- Multiplied by 100/sampling coefficient  
    (select * from tbl1 TABLESAMPLE system (0.1)) t     
    where c2 between 1 and 10 group by 1 order by 2 desc limit 30;  
     c1 | ?column?   
    ----+----------  
      1 | 48077000  
      3 | 25061000  
      2 | 12762000  
      4 | 12262000  
      8 |  9851000  
      6 |  9789000  
      7 |  9718000  
      9 |  9654000  
      5 |  4971000  
     10 |  4885000  
     18 |  2731000  
     28 |  2727000  
     29 |  2710000  
     23 |  2697000  
     15 |  2687000  
     27 |  2681000  
     22 |  2672000  
     17 |  2672000  
     25 |  2670000  
     19 |  2637000  
     20 |  2632000  
     12 |  2628000  
     14 |  2628000  
     21 |  2622000  
     26 |  2618000  
     13 |  2601000  
     24 |  2522000  
     16 |  2513000  
     11 |  1406000  
     30 |  1301000  
    (30 rows)  
      
    Time: 863.604 ms  
      
    -- Sampling 0.1%, about 0.86 seconds.  

    When the sampling one thousandth of the data (only about 254 MB of data is scanned), it takes less than one second to obtain the TOP 30 with high accuracy.

    Using this feature in Greenplum will be amazing, and it is possible to complete data pivoting for one trillion pieces of data from any dimensions within seconds.

Summary

Comparison of time-consumption for sampling and accurate query

Determine the array element TOP N

1

Determine scalar type TOP N

2

Sampling computing achieves high accuracy while consuming very little resources

Although parallel computing is also very fast, it requires more CPU and I/O resources, which directly affect the degree of parallelism. Unless there are enough resources, we still recommend that you use value estimation.

Efficiency evaluation of value estimation

Since the current value estimation methods do not support multi-core parallelism, the processing speed is about 254 MB per second. To respond within 1 second, the sampling rate should be set to 0.1% for a 254 GB table, and 0.025% for a 1 TB table. This means that value estimation from any dimensions can be achieved within seconds for TB level tables.

Sampling methods

https://www.postgresql.org/docs/9.6/static/sql-select.html

TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ]  

A TABLESAMPLE clause after a table_name indicates that the specified sampling_method should be used to retrieve a subset of the rows in that table. This sampling precedes the application of any other filters such as WHERE clauses. The standard PostgreSQL distribution includes two sampling methods, BERNOULLI and SYSTEM, and other sampling methods can be installed in the database via extensions.

The BERNOULLI and SYSTEM sampling methods each accept a single argument which is the fraction of the table to sample, expressed as a percentage between 0 and 100. This argument can be any real-valued expression. (Other sampling methods might accept more or different arguments.) These two methods each return a randomly-chosen sample of the table that will contain approximately the specified percentage of the table's rows. The BERNOULLI method scans the whole table and selects or ignores individual rows independently with the specified probability. The SYSTEM method does block-level sampling with each block having the specified chance of being selected; all rows in each selected block are returned. The SYSTEM method is significantly faster than the BERNOULLI method when small sampling percentages are specified, but it may return a less-random sample of the table as a result of clustering effects.

The optional REPEATABLE clause specifies a seed number or expression to use for generating random numbers within the sampling method. The seed value can be any non-null floating-point value. Two queries that specify the same seed and argument values will select the same sample of the table, if the table has not been changed meanwhile. But different seed values will usually produce different samples. If REPEATABLE is not given then a new random sample is selected for each query, based upon a system-generated seed. Note that some add-on sampling methods do not accept REPEATABLE, and will always produce new samples on each use.

https://www.postgresql.org/docs/9.6/static/tsm-system-rows.html

CREATE EXTENSION tsm_system_rows;

SELECT * FROM my_table TABLESAMPLE SYSTEM_ROWS(100);

https://www.postgresql.org/docs/9.6/static/tsm-system-time.html

CREATE EXTENSION tsm_system_time;

SELECT * FROM my_table TABLESAMPLE SYSTEM_TIME(1000);

Note 1: PostgreSQL also supports many other value estimation methods.

Note 2: If sampling supports parallelism, it could estimate larger tables with greater accuracy. For example, 1% is a good sampling rate for value estimation, which means 10 GB for a 1 TB table. If parallelism is supported, it could complete the entire sampling and value estimation process within 3 seconds.

1 1 1
Share on

digoal

282 posts | 25 followers

You may also like

Comments

Raja_KT March 15, 2019 at 1:56 pm

Good one. It will be more fascinating if you can share the infra details.

digoal

282 posts | 25 followers

Related Products