The Beam storage engine of AnalyticDB for PostgreSQL provides the dictionary encoding feature. This feature can compress low-cardinality string data to integer data to improve storage efficiency and accelerate filter-based and aggregate-based queries.
Usage notes
Only AnalyticDB for PostgreSQL V7.0.x instances in elastic storage mode support the dictionary encoding feature of the Beam storage engine.
The dictionary encoding feature of the Beam storage engine is officially available after the public preview is complete on AnalyticDB for PostgreSQL V7.0.6.2 in elastic storage mode. This version fixes the issues that occur during the public preview. We recommend that you update your AnalyticDB for PostgreSQL instance to V7.0.6.2 or later.
Use dictionary encoding to compress storage
The data compression feature can save the storage space of tables and reduce the amount of data that needs to be read from disks for queries. This reduces I/O operations and improves query performance. Appropriate compression algorithms that are suitable for different types of data can help you improve database performance.
AnalyticDB for PostgreSQL uses dictionary encoding to compress only string data. You can perform dictionary encoding on columns or tables.
Column-level dictionary encoding: compresses specific columns of the STRING type.
Table-level dictionary encoding: compresses low-cardinality columns. Non-string columns are not compressed when you create a table.
In this example, the nation
table of TPC-H is used. When you create a Beam table, use compresstype='gdict'
to specify dictionary encoding compression for the table. This way, dictionary encoding is performed on the n_name
and n_comment
columns. Sample statement:
CREATE TABLE NATION (
n_nationkey integer NOT NULL,
n_name character(25) NOT NULL,
n_regionkey integer NOT NULL,
n_comment character varying(152)
)
USING beam WITH (compresstype = 'gdict')
DISTRIBUTED by (n_nationkey);
You can also perform dictionary encoding only on the n_name
column. Sample statement:
CREATE TABLE NATION (
n_nationkey integer NOT NULL,
n_name character(25) NOT NULL ENCODING (compresstype='gdict'),
n_regionkey integer NOT NULL,
n_comment character varying(152)
)
USING beam WITH (compresstype = 'lz4', compresslevel = 9)
DISTRIBUTE by (n_nationkey);
After you perform dictionary encoding on the n_name
column, the data size is compressed from 25 bytes to 2 bytes. This greatly reduces data storage. For the original CHARACTER(25)
type, the end of a string that is shorter than 25 characters is padded with spaces.
n_name String value | Original data size (bytes) | Dictionary value | Data size after compression (bytes) |
ALGERIA | 25 | 0 | 2 |
ARGENTINA | 25 | 1 | 2 |
BRAZIL | 25 | 2 | 2 |
CANADA | 25 | 3 | 2 |
CHINA | 25 | 4 | 2 |
EGYPT | 25 | 5 | 2 |
ETHIOPIA | 25 | 6 | 2 |
FRANCE | 25 | 7 | 2 |
Total | 200 | / | 16 |
Use dictionary encoding to accelerate queries
After data is compressed by using dictionary encoding, you can further use this feature on the compressed data to accelerate queries. In filtering, aggregating, and sorting scenarios that involve low-cardinality string data, dictionary encoding can improve query performance by 10% to 200% for different SQL statements.
To enable dictionary encoding acceleration, set the adbpg_enable_encode_optimize
parameter to ON. By default, this parameter is set to OFF. After you enable dictionary encoding acceleration, the optimizer automatically generates appropriate execution plans for the SQL statements that meet dictionary encoding conditions. Sample statement:
SELECT
n_name,
max(n_regionkey)
FROM
nation
WHERE
n_name > 'ALGERIA'
AND n_name < 'FRANCE'
GROUP BY
n_name;
You can use an EXPLAIN statement to query the execution plan of the preceding statement. The execution plan shows the used encoding method and a Decode operator for decoding.
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather Motion 3:1 (slice1; segments: 3) (cost=76.72..78.27 rows=81 width=36)
-> GroupAggregate (cost=76.72..77.19 rows=27 width=36)
Group Key: n_name
-> Sort (cost=76.72..76.79 rows=28 width=36)
Sort Key: n_name
-> Redistribute Motion 3:3 (slice2; segments: 3) (cost=0.00..76.05 rows=28 width=36)
Hash Key: n_name
-> Decode (cost=0.00..75.50 rows=28 width=36)
Params: dictKeyIdx[1], dictID[0]
-> Seq Scan on nation (cost=0.00..75.50 rows=28 width=36)
Filter: ((ordered_encode(n_name) > ('ALGERIA'::text)::integer with dict ID 0) AND (ordered_encode(n_name) < ('FRANCE'::text)::integer with dict ID 0))
Optimizer: Postgres query optimizer
(12 rows)
Compared with string data, the integer data that is generated by using dictionary encoding can effectively improve query performance. In addition, the optimizer uses a Decode operator in the execution plan to decode the encoded data. This ensures the accuracy of query execution.
Use automatic compression encoding
The Beam storage engine supports an adaptive compression algorithm. If you are not sure whether dictionary encoding is suitable, you can specify compresstype = 'auto'
when you create a table. This way, the Beam storage engine automatically selects an appropriate encoding compression algorithm for the table.
CREATE TABLE NATION (
n_nationkey integer NOT NULL,
n_name character(25) NOT NULL,
n_regionkey integer NOT NULL,
n_comment character varying(152)
)
USING beam WITH (compresstype = 'auto')
DISTRIBUTE by (n_nationkey);
The following table describes the encoding methods for each column of the nation
table that is configured with automatic compression encoding.
Column | Data type | Encoding method | Description |
n_nationkey | Integer | LZ4 or integer encoding | This column uses integer data. By default, LZ4 or integer encoding is used. |
n_name | Character(25) | Dictionary encoding | The |
n_regionkey | Integer | LZ4 or integer encoding | This column uses integer data. By default, LZ4 or integer encoding is used. |
n_comment | Character varying(152) | Dictionary encoding | The |
Delta support for dictionary encoding
The Beam storage engine consists of the following parts:
The row-oriented Delta storage that handles real-time writes.
The PAX-based column-oriented Base storage that handles batch writes and large-scale scanning.
AnalyticDB for PostgreSQL instances of V7.0.2.3 or later support the hybrid row-column Base storage.
AnalyticDB for PostgreSQL instances of V7.0.4.0 or later support the row-oriented Delta storage for easier use of dictionary encoding.
For example, execute the following statements and use the EXPLAIN statement to query an execution plan. The execution plan shows a Decode operator for decoding. This improves data processing performance and data security for easier use of dictionary encoding.
CREATE TABLE test_auto(a int, b text)
using beam with(compresstype=auto,compresslevel=5);
INSERT INTO test_auto values (1,'adbpg1'),(2,'adbpg2'), (3, 'adbpg3'),(10,'adbpg10');
SET adbpg_enable_encode_optimize to ON;
explain SELECT * FROM test_auto WHERE b='adbpg1';
-----------------------------------------------------------------------------
Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..2.04 rows=1 width=36)
-> Decode (cost=0.00..2.02 rows=1 width=36)
Params: dictKeyIdx[2], dictID[0]
-> Seq Scan on test_auto (cost=0.00..2.02 rows=1 width=36)
Filter: (encode(b) = ('adbpg1'::text)::integer with dict ID 0)
Optimizer: Postgres-based planner