As a business system grows, individual business units have increasingly frequent data ingestion. However, this brings a new concern - data quality. For example, you need to decide whether some fields have been omitted from upstream data, whether upstream data can arrive in a timely manner, or whether upstream data itself has some problems.
These problems can be solved through business data quality monitoring.
The built-in statistics feature in PostgreSQL can meet the requirements of most business data quality monitoring scenarios.
If a higher level of business-based and custom data quality monitoring is required, PostgreSQL also provides features such as async batch consume with atomicity, stream computing, and asynchronous messaging to support real-time data quality monitoring.
The following section describes the built-in statistics feature in PostgreSQL:
1. Number of real-time records
postgres=# \d pg_class
Table "pg_catalog.pg_class"
Column | Type | Collation | Nullable | Default
---------------------+--------------+-----------+----------+---------
relname | name | | not null | -- 对象名
relnamespace | oid | | not null | -- 对象所属的schema, 对应pg_namespace.oid
relpages | integer | | not null | -- 评估的页数(单位为block_size)
reltuples | real | | not null | -- 评估的记录数
2. Real-time column statistics (proportion of empty values, average length, number of unique values, high-frequency words, proportion of high-frequency words, bar chart with evenly-distributed data values, linear correlation, high-frequency elements, proportion of high-frequency elements, and bar chart for high-frequency elements)
See the following for a detailed description:
postgres=# \d pg_stats
View "pg_catalog.pg_stats"
Column | Type | Default
------------------------+----------+---------
schemaname | name | -- 对象所属的schema
tablename | name | -- 对象名
attname | name | -- 列名
inherited | boolean | -- 是否为继承表的统计信息(false时表示当前表的统计信息,true时表示包含所有继承表的统计信息)
null_frac | real | -- 该列空值比例
avg_width | integer | -- 该列平均长度
n_distinct | real | -- 该列唯一值个数(-1表示唯一,小于1表示占比,大于等于1表示实际的唯一值个数)
most_common_vals | anyarray | -- 该列高频词
most_common_freqs | real[] | -- 该列高频词对应的出现频率
histogram_bounds | anyarray | -- 该列柱状图(表示隔出的每个BUCKET的记录数均等)
correlation | real | -- 该列存储相关性(-1到1的区间),绝对值越小,存储越离散。小于0表示反向相关,大于0表示正向相关
most_common_elems | anyarray | -- 该列为多值类型(数组)时,多值元素的高频词
most_common_elem_freqs | real[] | -- 多值元素高频词的出现频率
elem_count_histogram | real[] | -- 多值元素的柱状图中,每个区间的非空唯一元素个数
3. Real-time table statistics (such as the number of full table scans performed, the number of records scanned by using a full table scan, the number of index scans performed, the number of records scanned by using an index scan, the number of written records, the number of updated records, and the number of DEAD TUPLEs ).
postgres=# \d pg_stat_all_tables
View "pg_catalog.pg_stat_all_tables"
Column | Type | Default
---------------------+--------------------------+---------
relid | oid |
schemaname | name |
relname | name |
seq_scan | bigint | -- 被全表扫多少次
seq_tup_read | bigint | -- 使用全表扫的方法扫了多少条记录
idx_scan | bigint | -- 被索引扫多少次
idx_tup_fetch | bigint | -- 使用索引扫的方法扫了多少条记录
n_tup_ins | bigint | -- 插入了多少记录
n_tup_upd | bigint | -- 更新了多少记录
n_tup_del | bigint | -- 删除了多少记录
n_tup_hot_upd | bigint | -- HOT更新了多少记录
n_live_tup | bigint | -- 多少可见记录
n_dead_tup | bigint | -- 多少垃圾记录
n_mod_since_analyze | bigint |
last_vacuum | timestamp with time zone |
last_autovacuum | timestamp with time zone |
last_analyze | timestamp with time zone |
last_autoanalyze | timestamp with time zone |
vacuum_count | bigint |
autovacuum_count | bigint |
analyze_count | bigint |
autoanalyze_count | bigint |
4. Statistics analysis and scheduling policies
PostgreSQL automatically collects statistics based on changes in table records. The following shows the control of scheduling parameters:
#track_counts = on
#autovacuum = on # Enable autovacuum subprocess? 'on'
autovacuum_naptime = 15s # time between autovacuum runs
#autovacuum_analyze_threshold = 50 # min number of row updates before
# analyze
默认变更 0.1% 后就会自动收集统计信息。
#autovacuum_analyze_scale_factor = 0.1 # fraction of table size before analyze
The built-in statistics feature allows you to obtain the following information:
1. Number of real-time records
2. Information about each column (proportion of empty values, average length, number of unique values, high-frequency words, proportion of high frequency words, bar chart with evenly distributed data values, linear correlation, high-frequency elements, proportion of high-frequency elements, bar chart for high-frequency elements)
Business data quality can be monitored in real time with the preceding feedback information.
Example
1. Create a test table
create table test(id int primary key, c1 int, c2 int, info text, crt_time timestamp);
create index idx_test_1 on test (crt_time);
2. Create stress testing scripts
vi test.sql
\set id random(1,10000000)
insert into test values (:id, random()*100, random()*10000, random()::text, now()) on conflict (id) do update set crt_time=now();
3. Perform stress testing
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 32 -j 32 -T 1200
4. Create a data cleaning schedule that retains data within 30 seconds.
delete from test where ctid = any (array(
select ctid from test where crt_time < now()-interval '30 second'
));
Schedule the task once every 0.1 seconds
psql
delete from test where ctid = any (array(
select ctid from test where crt_time < now()-interval '30 second'
));
\watch 0.1
日志如下
DELETE 18470
Fri 08 Dec 2017 04:31:54 PM CST (every 0.1s)
DELETE 19572
Fri 08 Dec 2017 04:31:55 PM CST (every 0.1s)
DELETE 20159
Fri 08 Dec 2017 04:31:55 PM CST (every 0.1s)
DELETE 20143
Fri 08 Dec 2017 04:31:55 PM CST (every 0.1s)
DELETE 21401
Fri 08 Dec 2017 04:31:55 PM CST (every 0.1s)
DELETE 21956
Fri 08 Dec 2017 04:31:56 PM CST (every 0.1s)
DELETE 19978
Fri 08 Dec 2017 04:31:56 PM CST (every 0.1s)
DELETE 21916
5. Monitor statistics in real time
Statistics per column
postgres=# select attname,null_frac,avg_width,n_distinct,most_common_vals,most_common_freqs,histogram_bounds,correlation from pg_stats where tablename='test';
attname | id
null_frac | 0
avg_width | 4
n_distinct | -1
most_common_vals |
most_common_freqs |
histogram_bounds | {25,99836,193910,289331,387900,492669,593584,695430,795413,890787,1001849,1100457,1203161,1301537,1400265,1497824,1595610,1702278,1809415,1912946,2006274,2108505,2213771,2314440,2409333,2513067,2616217,2709052,2813209,2916342,3016292,3110554,3210817,3305896,3406145,3512379,3616638,3705990,3804538,3902207,4007939,4119100,4214497,4314986,4405492,4513675,4613327,4704905,4806556,4914360,5020248,5105998,5194904,5292779,5394640,5497986,5600441,5705246,5806209,5905498,6006522,6115688,6212831,6308451,6408320,6516028,6622895,6720613,6817877,6921460,7021999,7118151,7220074,7315355,7413563,7499978,7603076,7695692,7805120,7906168,8000492,8099783,8200918,8292854,8389462,8491879,8589691,8696502,8798076,8892978,8992364,9089390,9192142,9294759,9399562,9497099,9601571,9696437,9800758,9905327,9999758}
correlation | -0.00220302
.....
attname | c2
null_frac | 0
avg_width | 4
n_distinct | 9989
most_common_vals | {3056,6203,1352,1649,1777,3805,7029,420,430,705,1015,1143,2810,3036,3075,3431,3792,4459,4812,5013,5662,5725,5766,6445,6882,7034,7064,7185,7189,7347,8266,8686,8897,9042,9149,9326,9392,9648,9652,9802,63,164,235,453,595,626,672,813,847,1626,1636,1663,1749,1858,2026,2057,2080,2106,2283,2521,2596,2666,2797,2969,3131,3144,3416,3500,3870,3903,3956,3959,4252,4265,4505,4532,4912,5048,5363,5451,5644,5714,5734,5739,5928,5940,5987,6261,6352,6498,6646,6708,6886,6914,7144,7397,7589,7610,7640,7687}
most_common_freqs | {0.000366667,0.000366667,0.000333333,0.000333333,0.000333333,0.000333333,0.000333333,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667,0.000266667}
histogram_bounds | {0,103,201,301,399,495,604,697,802,904,1009,1121,1224,1320,1419,1514,1623,1724,1820,1930,2045,2147,2240,2335,2433,2532,2638,2738,2846,2942,3038,3143,3246,3342,3443,3547,3644,3744,3852,3966,4064,4162,4262,4354,4460,4562,4655,4755,4851,4948,5046,5143,5237,5340,5428,5532,5625,5730,5830,5932,6048,6144,6248,6349,6456,6562,6657,6768,6859,6964,7060,7161,7264,7357,7454,7547,7638,7749,7852,7956,8046,8138,8240,8337,8445,8539,8626,8728,8825,8924,9016,9116,9214,9311,9420,9512,9603,9709,9811,9911,10000}
correlation | -0.00246515
...
attname | crt_time
null_frac | 0
avg_width | 8
n_distinct | -0.931747
most_common_vals | {"2017-12-08 16:32:53.836223","2017-12-08 16:33:02.700473","2017-12-08 16:33:03.226319","2017-12-08 16:33:03.613826","2017-12-08 16:33:08.171908","2017-12-08 16:33:14.727654","2017-12-08 16:33:20.857187","2017-12-08 16:33:22.519299","2017-12-08 16:33:23.388035","2017-12-08 16:33:23.519205"}
most_common_freqs | {6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05}
histogram_bounds | {"2017-12-08 16:32:50.397367","2017-12-08 16:32:50.987576","2017-12-08 16:32:51.628523","2017-12-08 16:32:52.117421","2017-12-08 16:32:52.610271","2017-12-08 16:32:53.152021","2017-12-08 16:32:53.712685","2017-12-08 16:32:54.3036","2017-12-08 16:32:54.735576","2017-12-08 16:32:55.269238","2017-12-08 16:32:55.691081","2017-12-08 16:32:56.066085","2017-12-08 16:32:56.541396","2017-12-08 16:32:56.865717","2017-12-08 16:32:57.350169","2017-12-08 16:32:57.698694","2017-12-08 16:32:58.062828","2017-12-08 16:32:58.464265","2017-12-08 16:32:58.92354","2017-12-08 16:32:59.27284","2017-12-08 16:32:59.667347","2017-12-08 16:32:59.984229","2017-12-08 16:33:00.310772","2017-12-08 16:33:00.644104","2017-12-08 16:33:00.976184","2017-12-08 16:33:01.366153","2017-12-08 16:33:01.691384","2017-12-08 16:33:02.021643","2017-12-08 16:33:02.382856","2017-12-08 16:33:02.729636","2017-12-08 16:33:03.035666","2017-12-08 16:33:03.508461","2017-12-08 16:33:03.829351","2017-12-08 16:33:04.151727","2017-12-08 16:33:04.4596","2017-12-08 16:33:04.76933","2017-12-08 16:33:05.125295","2017-12-08 16:33:05.537555","2017-12-08 16:33:05.83828","2017-12-08 16:33:06.15387","2017-12-08 16:33:06.545922","2017-12-08 16:33:06.843679","2017-12-08 16:33:07.111281","2017-12-08 16:33:07.414602","2017-12-08 16:33:07.707961","2017-12-08 16:33:08.119891","2017-12-08 16:33:08.388883","2017-12-08 16:33:08.674867","2017-12-08 16:33:08.979336","2017-12-08 16:33:09.339377","2017-12-08 16:33:09.647791","2017-12-08 16:33:09.94157","2017-12-08 16:33:10.232294","2017-12-08 16:33:10.652072","2017-12-08 16:33:10.921087","2017-12-08 16:33:11.17986","2017-12-08 16:33:11.477399","2017-12-08 16:33:11.776529","2017-12-08 16:33:12.110676","2017-12-08 16:33:12.382742","2017-12-08 16:33:12.70362","2017-12-08 16:33:13.020485","2017-12-08 16:33:13.477398","2017-12-08 16:33:13.788134","2017-12-08 16:33:14.072125","2017-12-08 16:33:14.346058","2017-12-08 16:33:14.625692","2017-12-08 16:33:14.889661","2017-12-08 16:33:15.139977","2017-12-08 16:33:15.390732","2017-12-08 16:33:15.697878","2017-12-08 16:33:16.127449","2017-12-08 16:33:16.438117","2017-12-08 16:33:16.725608","2017-12-08 16:33:17.01954","2017-12-08 16:33:17.344609","2017-12-08 16:33:17.602447","2017-12-08 16:33:17.919983","2017-12-08 16:33:18.201386","2017-12-08 16:33:18.444387","2017-12-08 16:33:18.714402","2017-12-08 16:33:19.099394","2017-12-08 16:33:19.402888","2017-12-08 16:33:19.673556","2017-12-08 16:33:19.991907","2017-12-08 16:33:20.23329","2017-12-08 16:33:20.517752","2017-12-08 16:33:20.783084","2017-12-08 16:33:21.032402","2017-12-08 16:33:21.304109","2017-12-08 16:33:21.725122","2017-12-08 16:33:21.998994","2017-12-08 16:33:22.232959","2017-12-08 16:33:22.462384","2017-12-08 16:33:22.729792","2017-12-08 16:33:23.001244","2017-12-08 16:33:23.251215","2017-12-08 16:33:23.534155","2017-12-08 16:33:23.772144","2017-12-08 16:33:24.076088","2017-12-08 16:33:24.471151"}
correlation | 0.760231
Number of records
postgres=# select reltuples from pg_class where relname='test';
-[ RECORD 1 ]----------
reltuples | 3.74614e+06
DML activity statistics
postgres=# select * from pg_stat_all_tables where relname ='test';
-[ RECORD 1 ]-------+------------------------------
relid | 591006
schemaname | public
relname | test
seq_scan | 2
seq_tup_read | 0
idx_scan | 28300980
idx_tup_fetch | 24713736
n_tup_ins | 19730476
n_tup_upd | 8567352
n_tup_del | 16143587
n_tup_hot_upd | 0
n_live_tup | 3444573
n_dead_tup | 24748887
n_mod_since_analyze | 547474
last_vacuum |
last_autovacuum | 2017-12-08 16:31:10.820459+08
last_analyze |
last_autoanalyze | 2017-12-08 16:35:16.75293+08
vacuum_count | 0
autovacuum_count | 1
analyze_count | 0
autoanalyze_count | 124
Since not all data needs to be retained for data quality monitoring, we can use the following method to clean data efficiently without an impact on reading and writing data.
How to clean expired data efficiently by row number - non-partitioned tables: data aging practices
Around 2.63 million rows can be cleaned per second on a single instance.
postgres=# select pg_stat_reset_single_table_counters('test'::regclass);
postgres=# analyze verbose test;
INFO: analyzing "public.test"
INFO: "test": scanned 30000 of 238163 pages, containing 560241 live rows and 4294214 dead rows; 30000 rows in sample, 4319958 estimated total rows
ANALYZE
Use the async batch consume with atomicity method to monitor data quality in real time
Examples:
Since not all data needs to be retained for data quality monitoring, we can use the following method to clean data efficiently without any impact on reading and writing data.
How to clean expired data efficiently by row number - non-partitioned tables: data aging practices
Around 2.63 million rows can be cleaned per second on a single instance.
PostgreSQL Time-Series Best Practices: Stock Exchange System Database
Alibaba Clouder - August 14, 2020
Alibaba Cloud Community - March 29, 2022
Alibaba Clouder - February 15, 2018
digoal - December 23, 2020
Hologres - June 16, 2022
digoal - May 16, 2019
An on-demand database hosting service for PostgreSQL with automated monitoring, backup and disaster recovery capabilities
Learn MoreAlibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAn on-demand database hosting service for MySQL with automated monitoring, backup and disaster recovery capabilities
Learn MoreApsaraDB RDS for MariaDB supports multiple storage engines, including MySQL InnoDB to meet different user requirements.
Learn MoreMore Posts by digoal
Dikky Ryan Pratama May 5, 2023 at 4:15 am
I really enjoyed reading your article. Your writing style is engaging and kept me interested from start to finish.Your article was very informative and well-researched. I learned a lot from it and appreciate the effort you put into creating such a valuable resource.I love the way you approached this topic. Your unique perspective and insights gave me a fresh understanding and appreciation for the subject matter.