PostgreSQL Business Data Quality Real-time Monitoring Practices

This article talks about the built-in statistics feature in PostgreSQL, which can meet the requirements of most business data quality monitoring scenarios.

Background

As a business system grows, individual business units have increasingly frequent data ingestion. However, this brings a new concern - data quality. For example, you need to decide whether some fields have been omitted from upstream data, whether upstream data can arrive in a timely manner, or whether upstream data itself has some problems.

These problems can be solved through business data quality monitoring.

The built-in statistics feature in PostgreSQL can meet the requirements of most business data quality monitoring scenarios.

If a higher level of business-based and custom data quality monitoring is required, PostgreSQL also provides features such as async batch consume with atomicity, stream computing, and asynchronous messaging to support real-time data quality monitoring.

A Built-In Feature That Allows Real-Time Business Data Quality Monitoring

The following section describes the built-in statistics feature in PostgreSQL:

1. Number of real-time records

postgres=# \d pg_class      
                     Table "pg_catalog.pg_class"      
       Column        |     Type     | Collation | Nullable | Default       
---------------------+--------------+-----------+----------+---------      
 relname             | name         |           | not null |   -- 对象名    
 relnamespace        | oid          |           | not null |   -- 对象所属的schema, 对应pg_namespace.oid    
 relpages            | integer      |           | not null |   -- 评估的页数(单位为block_size)    
 reltuples           | real         |           | not null |   -- 评估的记录数

2. Real-time column statistics (proportion of empty values, average length, number of unique values, high-frequency words, proportion of high-frequency words, bar chart with evenly-distributed data values, linear correlation, high-frequency elements, proportion of high-frequency elements, and bar chart for high-frequency elements)

See the following for a detailed description:

postgres=# \d pg_stats       
                     View "pg_catalog.pg_stats"      
         Column         |   Type   | Default       
------------------------+----------+---------      
 schemaname             | name     |   -- 对象所属的schema    
 tablename              | name     |   -- 对象名    
 attname                | name     |   -- 列名    
 inherited              | boolean  |   -- 是否为继承表的统计信息（false时表示当前表的统计信息，true时表示包含所有继承表的统计信息）    
 null_frac              | real     |   -- 该列空值比例    
 avg_width              | integer  |   -- 该列平均长度    
 n_distinct             | real     |   -- 该列唯一值个数（-1表示唯一，小于1表示占比，大于等于1表示实际的唯一值个数）    
 most_common_vals       | anyarray |   -- 该列高频词    
 most_common_freqs      | real[]   |   -- 该列高频词对应的出现频率    
 histogram_bounds       | anyarray |   -- 该列柱状图（表示隔出的每个BUCKET的记录数均等）    
 correlation            | real     |   -- 该列存储相关性（-1到1的区间），绝对值越小，存储越离散。小于0表示反向相关，大于0表示正向相关    
 most_common_elems      | anyarray |   -- 该列为多值类型（数组）时，多值元素的高频词    
 most_common_elem_freqs | real[]   |   -- 多值元素高频词的出现频率    
 elem_count_histogram   | real[]   |   -- 多值元素的柱状图中，每个区间的非空唯一元素个数

3. Real-time table statistics (such as the number of full table scans performed, the number of records scanned by using a full table scan, the number of index scans performed, the number of records scanned by using an index scan, the number of written records, the number of updated records, and the number of DEAD TUPLEs ).

postgres=# \d pg_stat_all_tables   
                      View "pg_catalog.pg_stat_all_tables"  
       Column        |           Type           | Default   
---------------------+--------------------------+---------  
 relid               | oid                      |   
 schemaname          | name                     |   
 relname             | name                     |   
 seq_scan            | bigint                   | -- 被全表扫多少次  
 seq_tup_read        | bigint                   | -- 使用全表扫的方法扫了多少条记录  
 idx_scan            | bigint                   | -- 被索引扫多少次  
 idx_tup_fetch       | bigint                   | -- 使用索引扫的方法扫了多少条记录  
 n_tup_ins           | bigint                   | -- 插入了多少记录  
 n_tup_upd           | bigint                   | -- 更新了多少记录  
 n_tup_del           | bigint                   | -- 删除了多少记录  
 n_tup_hot_upd       | bigint                   | -- HOT更新了多少记录  
 n_live_tup          | bigint                   | -- 多少可见记录  
 n_dead_tup          | bigint                   | -- 多少垃圾记录  
 n_mod_since_analyze | bigint                   |   
 last_vacuum         | timestamp with time zone |   
 last_autovacuum     | timestamp with time zone |   
 last_analyze        | timestamp with time zone |   
 last_autoanalyze    | timestamp with time zone |   
 vacuum_count        | bigint                   |   
 autovacuum_count    | bigint                   |   
 analyze_count       | bigint                   |   
 autoanalyze_count   | bigint                   |

4. Statistics analysis and scheduling policies

PostgreSQL automatically collects statistics based on changes in table records. The following shows the control of scheduling parameters:

#track_counts = on  
#autovacuum = on                        # Enable autovacuum subprocess?  'on'  
autovacuum_naptime = 15s                # time between autovacuum runs  
#autovacuum_analyze_threshold = 50      # min number of row updates before  
                                        # analyze  
  
默认变更 0.1% 后就会自动收集统计信息。  
  
#autovacuum_analyze_scale_factor = 0.1  # fraction of table size before analyze

The built-in statistics feature allows you to obtain the following information:

1. Number of real-time records

2. Information about each column (proportion of empty values, average length, number of unique values, high-frequency words, proportion of high frequency words, bar chart with evenly distributed data values, linear correlation, high-frequency elements, proportion of high-frequency elements, bar chart for high-frequency elements)

Business data quality can be monitored in real time with the preceding feedback information.

Example

1. Create a test table

create table test(id int primary key, c1 int, c2 int, info text, crt_time timestamp);  
create index idx_test_1 on test (crt_time);

2. Create stress testing scripts

vi test.sql  
  
\set id random(1,10000000)  
insert into test values (:id, random()*100, random()*10000, random()::text, now()) on conflict (id) do update set crt_time=now();

3. Perform stress testing

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 32 -j 32 -T 1200

4. Create a data cleaning schedule that retains data within 30 seconds.

delete from test where ctid = any (array(  
  select ctid from test where crt_time < now()-interval '30 second'  
));

Schedule the task once every 0.1 seconds

psql   
  
delete from test where ctid = any (array(  
  select ctid from test where crt_time < now()-interval '30 second'  
));  
  
\watch 0.1

日志如下  
  
DELETE 18470  
  
Fri 08 Dec 2017 04:31:54 PM CST (every 0.1s)  
  
DELETE 19572  
  
Fri 08 Dec 2017 04:31:55 PM CST (every 0.1s)  
  
DELETE 20159  
  
Fri 08 Dec 2017 04:31:55 PM CST (every 0.1s)  
  
DELETE 20143  
  
Fri 08 Dec 2017 04:31:55 PM CST (every 0.1s)  
  
DELETE 21401  
  
Fri 08 Dec 2017 04:31:55 PM CST (every 0.1s)  
  
DELETE 21956  
  
Fri 08 Dec 2017 04:31:56 PM CST (every 0.1s)  
  
DELETE 19978  
  
Fri 08 Dec 2017 04:31:56 PM CST (every 0.1s)  
  
DELETE 21916

5. Monitor statistics in real time

Statistics per column

postgres=# select attname,null_frac,avg_width,n_distinct,most_common_vals,most_common_freqs,histogram_bounds,correlation from pg_stats where tablename='test';  
  
attname           | id  
null_frac         | 0  
avg_width         | 4  
n_distinct        | -1  
most_common_vals  |   
most_common_freqs |   
histogram_bounds  | {25,99836,193910,289331,387900,492669,593584,695430,795413,890787,1001849,1100457,1203161,1301537,1400265,1497824,1595610,1702278,1809415,1912946,2006274,2108505,2213771,2314440,2409333,2513067,2616217,2709052,2813209,2916342,3016292,3110554,3210817,3305896,3406145,3512379,3616638,3705990,3804538,3902207,4007939,4119100,4214497,4314986,4405492,4513675,4613327,4704905,4806556,4914360,5020248,5105998,5194904,5292779,5394640,5497986,5600441,5705246,5806209,5905498,6006522,6115688,6212831,6308451,6408320,6516028,6622895,6720613,6817877,6921460,7021999,7118151,7220074,7315355,7413563,7499978,7603076,7695692,7805120,7906168,8000492,8099783,8200918,8292854,8389462,8491879,8589691,8696502,8798076,8892978,8992364,9089390,9192142,9294759,9399562,9497099,9601571,9696437,9800758,9905327,9999758}  
correlation       | -0.00220302  
.....  
  
  
attname           | c2  
null_frac         | 0  
avg_width         | 4  
n_distinct        | 9989  
most_common_vals  | {3056,6203,1352,1649,1777,3805,7029,420,430,705,1015,1143,2810,3036,3075,3431,3792,4459,4812,5013,5662,5725,5766,6445,6882,7034,7064,7185,7189,7347,8266,8686,8897,9042,9149,9326,9392,9648,9652,9802,63,164,235,453,595,626,672,813,847,1626,1636,1663,1749,1858,2026,2057,2080,2106,2283,2521,2596,2666,2797,2969,3131,3144,3416,3500,3870,3903,3956,3959,4252,4265,4505,4532,4912,5048,5363,5451,5644,5714,5734,5739,5928,5940,5987,6261,6352,6498,6646,6708,6886,6914,7144,7397,7589,7610,7640,7687}  
most_common_freqs | {0.000366667,0.000366667,0.000333333,0.000333333,0.000333333,0.000333333,0.000333333,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667}  
histogram_bounds  | {0,103,201,301,399,495,604,697,802,904,1009,1121,1224,1320,1419,1514,1623,1724,1820,1930,2045,2147,2240,2335,2433,2532,2638,2738,2846,2942,3038,3143,3246,3342,3443,3547,3644,3744,3852,3966,4064,4162,4262,4354,4460,4562,4655,4755,4851,4948,5046,5143,5237,5340,5428,5532,5625,5730,5830,5932,6048,6144,6248,6349,6456,6562,6657,6768,6859,6964,7060,7161,7264,7357,7454,7547,7638,7749,7852,7956,8046,8138,8240,8337,8445,8539,8626,8728,8825,8924,9016,9116,9214,9311,9420,9512,9603,9709,9811,9911,10000}  
correlation       | -0.00246515  
  
...  
  
attname           | crt_time  
null_frac         | 0  
avg_width         | 8  
n_distinct        | -0.931747  
most_common_vals  | {"2017-12-08 16:32:53.836223","2017-12-08 16:33:02.700473","2017-12-08 16:33:03.226319","2017-12-08 16:33:03.613826","2017-12-08 16:33:08.171908","2017-12-08 16:33:14.727654","2017-12-08 16:33:20.857187","2017-12-08 16:33:22.519299","2017-12-08 16:33:23.388035","2017-12-08 16:33:23.519205"}  
most_common_freqs | {6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05}  
histogram_bounds  | {"2017-12-08 16:32:50.397367","2017-12-08 16:32:50.987576","2017-12-08 16:32:51.628523","2017-12-08 16:32:52.117421","2017-12-08 16:32:52.610271","2017-12-08 16:32:53.152021","2017-12-08 16:32:53.712685","2017-12-08 16:32:54.3036","2017-12-08 16:32:54.735576","2017-12-08 16:32:55.269238","2017-12-08 16:32:55.691081","2017-12-08 16:32:56.066085","2017-12-08 16:32:56.541396","2017-12-08 16:32:56.865717","2017-12-08 16:32:57.350169","2017-12-08 16:32:57.698694","2017-12-08 16:32:58.062828","2017-12-08 16:32:58.464265","2017-12-08 16:32:58.92354","2017-12-08 16:32:59.27284","2017-12-08 16:32:59.667347","2017-12-08 16:32:59.984229","2017-12-08 16:33:00.310772","2017-12-08 16:33:00.644104","2017-12-08 16:33:00.976184","2017-12-08 16:33:01.366153","2017-12-08 16:33:01.691384","2017-12-08 16:33:02.021643","2017-12-08 16:33:02.382856","2017-12-08 16:33:02.729636","2017-12-08 16:33:03.035666","2017-12-08 16:33:03.508461","2017-12-08 16:33:03.829351","2017-12-08 16:33:04.151727","2017-12-08 16:33:04.4596","2017-12-08 16:33:04.76933","2017-12-08 16:33:05.125295","2017-12-08 16:33:05.537555","2017-12-08 16:33:05.83828","2017-12-08 16:33:06.15387","2017-12-08 16:33:06.545922","2017-12-08 16:33:06.843679","2017-12-08 16:33:07.111281","2017-12-08 16:33:07.414602","2017-12-08 16:33:07.707961","2017-12-08 16:33:08.119891","2017-12-08 16:33:08.388883","2017-12-08 16:33:08.674867","2017-12-08 16:33:08.979336","2017-12-08 16:33:09.339377","2017-12-08 16:33:09.647791","2017-12-08 16:33:09.94157","2017-12-08 16:33:10.232294","2017-12-08 16:33:10.652072","2017-12-08 16:33:10.921087","2017-12-08 16:33:11.17986","2017-12-08 16:33:11.477399","2017-12-08 16:33:11.776529","2017-12-08 16:33:12.110676","2017-12-08 16:33:12.382742","2017-12-08 16:33:12.70362","2017-12-08 16:33:13.020485","2017-12-08 16:33:13.477398","2017-12-08 16:33:13.788134","2017-12-08 16:33:14.072125","2017-12-08 16:33:14.346058","2017-12-08 16:33:14.625692","2017-12-08 16:33:14.889661","2017-12-08 16:33:15.139977","2017-12-08 16:33:15.390732","2017-12-08 16:33:15.697878","2017-12-08 16:33:16.127449","2017-12-08 16:33:16.438117","2017-12-08 16:33:16.725608","2017-12-08 16:33:17.01954","2017-12-08 16:33:17.344609","2017-12-08 16:33:17.602447","2017-12-08 16:33:17.919983","2017-12-08 16:33:18.201386","2017-12-08 16:33:18.444387","2017-12-08 16:33:18.714402","2017-12-08 16:33:19.099394","2017-12-08 16:33:19.402888","2017-12-08 16:33:19.673556","2017-12-08 16:33:19.991907","2017-12-08 16:33:20.23329","2017-12-08 16:33:20.517752","2017-12-08 16:33:20.783084","2017-12-08 16:33:21.032402","2017-12-08 16:33:21.304109","2017-12-08 16:33:21.725122","2017-12-08 16:33:21.998994","2017-12-08 16:33:22.232959","2017-12-08 16:33:22.462384","2017-12-08 16:33:22.729792","2017-12-08 16:33:23.001244","2017-12-08 16:33:23.251215","2017-12-08 16:33:23.534155","2017-12-08 16:33:23.772144","2017-12-08 16:33:24.076088","2017-12-08 16:33:24.471151"}  
correlation       | 0.760231

Number of records

postgres=# select reltuples from pg_class where relname='test';  
-[ RECORD 1 ]----------  
reltuples | 3.74614e+06

DML activity statistics

postgres=# select * from pg_stat_all_tables where relname ='test';  
-[ RECORD 1 ]-------+------------------------------  
relid               | 591006  
schemaname          | public  
relname             | test  
seq_scan            | 2  
seq_tup_read        | 0  
idx_scan            | 28300980  
idx_tup_fetch       | 24713736  
n_tup_ins           | 19730476  
n_tup_upd           | 8567352  
n_tup_del           | 16143587  
n_tup_hot_upd       | 0  
n_live_tup          | 3444573  
n_dead_tup          | 24748887  
n_mod_since_analyze | 547474  
last_vacuum         |   
last_autovacuum     | 2017-12-08 16:31:10.820459+08  
last_analyze        |   
last_autoanalyze    | 2017-12-08 16:35:16.75293+08  
vacuum_count        | 0  
autovacuum_count    | 1  
analyze_count       | 0  
autoanalyze_count   | 124

Data Cleaning Schedule

Since not all data needs to be retained for data quality monitoring, we can use the following method to clean data efficiently without an impact on reading and writing data.

How to clean expired data efficiently by row number - non-partitioned tables: data aging practices

Around 2.63 million rows can be cleaned per second on a single instance.

How to Clean Statistics

postgres=# select pg_stat_reset_single_table_counters('test'::regclass);

How to Enforce Manual Statistics Collection

postgres=# analyze verbose test;  
INFO:  analyzing "public.test"  
INFO:  "test": scanned 30000 of 238163 pages, containing 560241 live rows and 4294214 dead rows; 30000 rows in sample, 4319958 estimated total rows  
ANALYZE

Customized and Real-Time Business Data Quality Monitoring

Use the async batch consume with atomicity method to monitor data quality in real time

Examples:

HTAP database PostgreSQL scenarios and performance tests - no. 32 (OLTP) high throughput data input/output (storage, row scanning, no indexes) - async batch consume with atomicity (JSON + functional stream computing)

HTAP database PostgreSQL scenarios and performance tests - no. 31 (OLTP) high throughput data input/output (storage, row scanning, no indexes) - async batch consume with atomicity (paralleled testing for high throughput read/write)

HTAP database PostgreSQL scenarios and performance tests - no. 27 (OLTP) IoT - FEED logs, stream computing, and async batch consume with atomicity (CTE)

PostgreSQL asynchronous message practice - feed system real-time monitoring and response like in proactive e-commerce services - minute level to millisecond level implementation

Data Cleaning Schedule

Since not all data needs to be retained for data quality monitoring, we can use the following method to clean data efficiently without any impact on reading and writing data.

How to clean expired data efficiently by row number - non-partitioned tables: data aging practices

Around 2.63 million rows can be cleaned per second on a single instance.

: Dikky Ryan Pratama May 5, 2023 at 4:15 am

I really enjoyed reading your article. Your writing style is engaging and kept me interested from start to finish.Your article was very informative and well-researched. I learned a lot from it and appreciate the effort you put into creating such a valuable resource.I love the way you approached this topic. Your unique perspective and insights gave me a fresh understanding and appreciation for the subject matter.

0 0

Community

PostgreSQL Business Data Quality Real-time Monitoring Practices

Background

A Built-In Feature That Allows Real-Time Business Data Quality Monitoring

Data Cleaning Schedule

How to Clean Statistics

How to Enforce Manual Statistics Collection

Customized and Real-Time Business Data Quality Monitoring

Data Cleaning Schedule

Read previous post:

Read next post:

digoal

You may also like

Comments

Dikky Ryan Pratama May 5, 2023 at 4:15 am

digoal

Related Products

ApsaraDB RDS for PostgreSQL

PolarDB for PostgreSQL

ApsaraDB RDS for MySQL

ApsaraDB RDS for MariaDB