AnalyticDB PostgreSQL版7.0版支援了列組(Column Group)統計資訊功能,您可以指定收集多個目標列的聯合統計資訊,在代價估算時打破獨立分布的假設,提高代價估算的準確性,進而提升查詢效能。
注意事項
列組統計資訊功能目前僅支援Planner最佳化器(原Legacy最佳化器),如何設定最佳化器,請參見參數配置。
文法
CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
[ ( statistics_kind [, ... ] ) ]
ON column_name, column_name [, ...]
FROM table_name參數介紹
| 參數 | 說明 |
| statistics_name | 列組統計資訊的名稱。 |
| statistics_kind | 需要收集的統計資訊種類,目前支援以下三種:
|
| column_name | 列組統計資訊包含的列的名稱,至少需要指定兩個列。 |
| table_name | 上述列所屬的表名。 |
樣本
- 建立表ndistinct,並插入資料。
CREATE TABLE ndistinct(a int, b int, c int, d int) DISTRIBUTED BY (d); INSERT INTO ndistinct(a, b) SELECT i/100, i/200, i/100, i FROM generate_series(1, 10000000) i;其中a、b列的值有強相關性,不符合獨立分布的假設。
- 建立統計估算行數和實際行數的函數。
CREATE FUNCTION check_estimated_rows(text) returns table (estimated int, actual int) LANGUAGE plpgsql AS $$ DECLARE ln text; tmp text[]; first_row bool := true; BEGIN FOR ln in execute format('explain analyze %s', $1) LOOP IF first_row then first_row := false; tmp := regexp_match(ln, 'rows=(\d*) .* rows=(\d*)'); return query select tmp[1]::int, tmp[2]::int; END IF; END LOOP; END; $$; - 查看估算行數和實際行數。
SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM ndistinct GROUP BY a, b');返回資訊如下:
estimated | actual -----------+-------- 1000000 | 100001 (1 row)本次樣本中估算行數為1000000行,實際行數為100001行,差距較大。
- 查看執行計畫。
EXPLAIN ANALYZE SELECT count(*) FROM ndistinct GROUP BY a,b,c;執行計畫返回資訊如下,用於和建立列組統計資訊後的執行計畫進行對比:
QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------- Gather Motion 3:1 (slice1; segments: 3) (cost=137814.33..154481.00 rows=1000000 width=20) (actual time=3986.143..4014.601 rows=100001 loops=1) -> HashAggregate (cost=137814.33..141147.67 rows=333333 width=20) (actual time=3977.037..4002.222 rows=33661 loops=1) Group Key: a, b, c Peak Memory Usage: 0 kB -> Redistribute Motion 3:3 (slice2; segments: 3) (cost=0.00..104481.00 rows=3333333 width=12) (actual time=0.079..1492.983 rows=3366100 loops=1) Hash Key: a, b, c -> Seq Scan on ndistinct (cost=0.00..37814.33 rows=3333333 width=12) (actual time=0.050..1101.632 rows=3334839 loops=1) Planning Time: 0.161 ms (slice0) Executor memory: 12336K bytes. (slice1) Executor memory: 13884K bytes avg x 3 workers, 13899K bytes max (seg0). Work_mem: 16401K bytes max. (slice2) Executor memory: 37K bytes avg x 3 workers, 37K bytes max (seg0). Memory used: 128000kB Optimizer: Postgres query optimizer Execution Time: 4041.613 ms (14 rows) - 建立列組統計資訊,並執行ANALYZE收集統計資訊。
CREATE STATISTICS s_a_b(ndistinct) ON a,b FROM ndistinct; ANALYZE ndistinct; - 再次查看估算行數和實際行數。
SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM ndistinct GROUP BY a, b');返回資訊如下:
estimated | actual -----------+-------- 99431 | 100001 (1 row)本次樣本中估算行數為99431行,實際行數為100001行,相比較建立列組統計資訊前,估算的行數更準確。
- 查看執行計畫。
EXPLAIN ANALYZE SELECT count(*) FROM ndistinct GROUP BY a,b,c;執行計畫返回資訊如下,從執行計畫可以看出,因為估算更準確,所以執行計畫中產生了Partial HashAggregate,進而縮短了查詢時間:
QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------- Gather Motion 3:1 (slice1; segments: 3) (cost=75124.91..76782.09 rows=99431 width=20) (actual time=2854.765..2879.734 rows=100001 loops=1) -> Finalize HashAggregate (cost=75124.91..75456.34 rows=33144 width=20) (actual time=2853.610..2868.194 rows=33661 loops=1) Group Key: a, b, c Peak Memory Usage: 0 kB -> Redistribute Motion 3:3 (slice2; segments: 3) (cost=71147.67..74130.60 rows=99431 width=20) (actual time=2269.435..2759.413 rows=100983 loops=1) Hash Key: a, b, c -> Partial HashAggregate (cost=71147.67..72141.98 rows=99431 width=20) (actual time=2744.039..2794.808 rows=100001 loops=1) Group Key: a, b, c Peak Memory Usage: 0 kB -> Seq Scan on ndistinct (cost=0.00..37814.33 rows=3333333 width=12) (actual time=0.028..454.030 rows=3334839 loops=1) Planning Time: 0.173 ms (slice0) Executor memory: 4670K bytes. (slice1) Executor memory: 3134K bytes avg x 3 workers, 3149K bytes max (seg0). Work_mem: 5649K bytes max. (slice2) Executor memory: 13848K bytes avg x 3 workers, 13848K bytes max (seg0). Work_mem: 14353K bytes max. Memory used: 128000kB Optimizer: Postgres query optimizer Execution Time: 2893.470 ms (17 rows)