注意事項
列組統計資訊功能目前僅支援Planner最佳化器(原Legacy最佳化器),如何設定最佳化器,請參見參數配置。
文法
CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
[ ( statistics_kind [, ... ] ) ]
ON column_name, column_name [, ...]
FROM table_name
參數介紹
參數 | 說明 |
statistics_name | 列組統計資訊的名稱。 |
statistics_kind | 需要收集的統計資訊種類,目前支援以下三種:- ndistinct:多列組合值中不同值(DISTINCT)的數量。
- dependencies:多列之間的Functional Dependency,可以理解為多列之間的相關性。
- mcv:多列組合值中出現最多的值及其頻率。
不指定該參數的話,以上三種類型的統計資訊都會採集。 |
column_name | 列組統計資訊包含的列的名稱,至少需要指定兩個列。 |
table_name | 上述列所屬的表名。 |
樣本
- 建立表ndistinct,並插入資料。
CREATE TABLE ndistinct(a int, b int, c int, d int) DISTRIBUTED BY (d);
INSERT INTO ndistinct(a, b) SELECT i/100, i/200, i/100, i FROM generate_series(1, 10000000) i;
其中a、b列的值有強相關性,不符合獨立分布的假設。
- 建立統計估算行數和實際行數的函數。
CREATE FUNCTION check_estimated_rows(text) returns table (estimated int, actual int)
LANGUAGE plpgsql AS
$$
DECLARE
ln text;
tmp text[];
first_row bool := true;
BEGIN
FOR ln in
execute format('explain analyze %s', $1)
LOOP
IF first_row then
first_row := false;
tmp := regexp_match(ln, 'rows=(\d*) .* rows=(\d*)');
return query select tmp[1]::int, tmp[2]::int;
END IF;
END LOOP;
END;
$$;
- 查看估算行數和實際行數。
SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM ndistinct GROUP BY a, b');
返回資訊如下:
estimated | actual
-----------+--------
1000000 | 100001
(1 row)
本次樣本中估算行數為1000000行,實際行數為100001行,差距較大。
- 查看執行計畫。
EXPLAIN ANALYZE SELECT count(*) FROM ndistinct GROUP BY a,b,c;
執行計畫返回資訊如下,用於和建立列組統計資訊後的執行計畫進行對比:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather Motion 3:1 (slice1; segments: 3) (cost=137814.33..154481.00 rows=1000000 width=20) (actual time=3986.143..4014.601 rows=100001 loops=1)
-> HashAggregate (cost=137814.33..141147.67 rows=333333 width=20) (actual time=3977.037..4002.222 rows=33661 loops=1)
Group Key: a, b, c
Peak Memory Usage: 0 kB
-> Redistribute Motion 3:3 (slice2; segments: 3) (cost=0.00..104481.00 rows=3333333 width=12) (actual time=0.079..1492.983 rows=3366100 loops=1)
Hash Key: a, b, c
-> Seq Scan on ndistinct (cost=0.00..37814.33 rows=3333333 width=12) (actual time=0.050..1101.632 rows=3334839 loops=1)
Planning Time: 0.161 ms
(slice0) Executor memory: 12336K bytes.
(slice1) Executor memory: 13884K bytes avg x 3 workers, 13899K bytes max (seg0). Work_mem: 16401K bytes max.
(slice2) Executor memory: 37K bytes avg x 3 workers, 37K bytes max (seg0).
Memory used: 128000kB
Optimizer: Postgres query optimizer
Execution Time: 4041.613 ms
(14 rows)
- 建立列組統計資訊,並執行ANALYZE收集統計資訊。
CREATE STATISTICS s_a_b(ndistinct) ON a,b FROM ndistinct;
ANALYZE ndistinct;
- 再次查看估算行數和實際行數。
SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM ndistinct GROUP BY a, b');
返回資訊如下:
estimated | actual
-----------+--------
99431 | 100001
(1 row)
本次樣本中估算行數為99431行,實際行數為100001行,相比較建立列組統計資訊前,估算的行數更準確。
- 查看執行計畫。
EXPLAIN ANALYZE SELECT count(*) FROM ndistinct GROUP BY a,b,c;
執行計畫返回資訊如下,從執行計畫可以看出,因為估算更準確,所以執行計畫中產生了Partial HashAggregate,進而縮短了查詢時間:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather Motion 3:1 (slice1; segments: 3) (cost=75124.91..76782.09 rows=99431 width=20) (actual time=2854.765..2879.734 rows=100001 loops=1)
-> Finalize HashAggregate (cost=75124.91..75456.34 rows=33144 width=20) (actual time=2853.610..2868.194 rows=33661 loops=1)
Group Key: a, b, c
Peak Memory Usage: 0 kB
-> Redistribute Motion 3:3 (slice2; segments: 3) (cost=71147.67..74130.60 rows=99431 width=20) (actual time=2269.435..2759.413 rows=100983 loops=1)
Hash Key: a, b, c
-> Partial HashAggregate (cost=71147.67..72141.98 rows=99431 width=20) (actual time=2744.039..2794.808 rows=100001 loops=1)
Group Key: a, b, c
Peak Memory Usage: 0 kB
-> Seq Scan on ndistinct (cost=0.00..37814.33 rows=3333333 width=12) (actual time=0.028..454.030 rows=3334839 loops=1)
Planning Time: 0.173 ms
(slice0) Executor memory: 4670K bytes.
(slice1) Executor memory: 3134K bytes avg x 3 workers, 3149K bytes max (seg0). Work_mem: 5649K bytes max.
(slice2) Executor memory: 13848K bytes avg x 3 workers, 13848K bytes max (seg0). Work_mem: 14353K bytes max.
Memory used: 128000kB
Optimizer: Postgres query optimizer
Execution Time: 2893.470 ms
(17 rows)