Use dictionary encoding to compress data or accelerate queries - AnalyticDB

The Beam storage engine of AnalyticDB for PostgreSQL provides the dictionary encoding feature. This feature can compress low-cardinality string data to integer data to improve storage efficiency and accelerate filter-based and aggregate-based queries.

Usage notes

Only AnalyticDB for PostgreSQL V7.0.x instances in elastic storage mode support the dictionary encoding feature of the Beam storage engine.
The dictionary encoding feature of the Beam storage engine is officially available after the public preview is complete on AnalyticDB for PostgreSQL V7.0.6.2 in elastic storage mode. This version fixes the issues that occur during the public preview. We recommend that you update your AnalyticDB for PostgreSQL instance to V7.0.6.2 or later.

Use dictionary encoding to compress storage

The data compression feature can save the storage space of tables and reduce the amount of data that needs to be read from disks for queries. This reduces I/O operations and improves query performance. Appropriate compression algorithms that are suitable for different types of data can help you improve database performance.

AnalyticDB for PostgreSQL uses dictionary encoding to compress only string data. You can perform dictionary encoding on columns or tables.

Column-level dictionary encoding: compresses specific columns of the STRING type.
Table-level dictionary encoding: compresses low-cardinality columns. Non-string columns are not compressed when you create a table.

In this example, the nation table of TPC-H is used. When you create a Beam table, use compresstype='gdict' to specify dictionary encoding compression for the table. This way, dictionary encoding is performed on the n_name and n_comment columns. Sample statement:

CREATE TABLE NATION (
    n_nationkey integer NOT NULL,
    n_name character(25) NOT NULL,
    n_regionkey integer NOT NULL,
    n_comment character varying(152)
)
USING beam WITH (compresstype = 'gdict') 
DISTRIBUTED by (n_nationkey);

You can also perform dictionary encoding only on the n_name column. Sample statement:

CREATE TABLE NATION (
    n_nationkey integer NOT NULL,
    n_name character(25) NOT NULL ENCODING (compresstype='gdict'),
    n_regionkey integer NOT NULL,
    n_comment character varying(152)
)
USING beam WITH (compresstype = 'lz4', compresslevel = 9) 
DISTRIBUTE by (n_nationkey);

After you perform dictionary encoding on the n_name column, the data size is compressed from 25 bytes to 2 bytes. This greatly reduces data storage. For the original CHARACTER(25) type, the end of a string that is shorter than 25 characters is padded with spaces.

n_name String value	Original data size (bytes)	Dictionary value	Data size after compression (bytes)
ALGERIA	25	0	2
ARGENTINA	25	1	2
BRAZIL	25	2	2
CANADA	25	3	2
CHINA	25	4	2
EGYPT	25	5	2
ETHIOPIA	25	6	2
FRANCE	25	7	2
Total	200	/	16

Use dictionary encoding to accelerate queries

After data is compressed by using dictionary encoding, you can further use this feature on the compressed data to accelerate queries. In filtering, aggregating, and sorting scenarios that involve low-cardinality string data, dictionary encoding can improve query performance by 10% to 200% for different SQL statements.

To enable dictionary encoding acceleration, set the adbpg_enable_encode_optimize parameter to ON. By default, this parameter is set to OFF. After you enable dictionary encoding acceleration, the optimizer automatically generates appropriate execution plans for the SQL statements that meet dictionary encoding conditions. Sample statement:

SELECT
	n_name,
	max(n_regionkey)
FROM
	nation
WHERE
	n_name > 'ALGERIA'
	AND n_name < 'FRANCE'
GROUP BY
	n_name;

You can use an EXPLAIN statement to query the execution plan of the preceding statement. The execution plan shows the used encoding method and a Decode operator for decoding.

 QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=76.72..78.27 rows=81 width=36)
   ->  GroupAggregate  (cost=76.72..77.19 rows=27 width=36)
         Group Key: n_name
         ->  Sort  (cost=76.72..76.79 rows=28 width=36)
               Sort Key: n_name
               ->  Redistribute Motion 3:3  (slice2; segments: 3)  (cost=0.00..76.05 rows=28 width=36)
                     Hash Key: n_name
                     ->  Decode  (cost=0.00..75.50 rows=28 width=36)
                           Params: dictKeyIdx[1], dictID[0]
                           ->  Seq Scan on nation  (cost=0.00..75.50 rows=28 width=36)
                                 Filter: ((ordered_encode(n_name) > ('ALGERIA'::text)::integer with dict ID 0) AND (ordered_encode(n_name) < ('FRANCE'::text)::integer with dict ID 0))
 Optimizer: Postgres query optimizer
(12 rows)

Compared with string data, the integer data that is generated by using dictionary encoding can effectively improve query performance. In addition, the optimizer uses a Decode operator in the execution plan to decode the encoded data. This ensures the accuracy of query execution.

Use automatic compression encoding

The Beam storage engine supports an adaptive compression algorithm. If you are not sure whether dictionary encoding is suitable, you can specify compresstype = 'auto' when you create a table. This way, the Beam storage engine automatically selects an appropriate encoding compression algorithm for the table.

CREATE TABLE NATION (
    n_nationkey integer NOT NULL,
    n_name character(25) NOT NULL,
    n_regionkey integer NOT NULL,
    n_comment character varying(152)
)
USING beam WITH (compresstype = 'auto') 
DISTRIBUTE by (n_nationkey);

The following table describes the encoding methods for each column of the nation table that is configured with automatic compression encoding.

Column	Data type	Encoding method	Description
n_nationkey	Integer	LZ4 or integer encoding	This column uses integer data. By default, LZ4 or integer encoding is used.
n_name	Character(25)	Dictionary encoding	The `n_name` column contains the names of specific countries that are represented by a set of fixed-length string data. Dictionary encoding is suitable for this column.
n_regionkey	Integer	LZ4 or integer encoding	This column uses integer data. By default, LZ4 or integer encoding is used.
n_comment	Character varying(152)	Dictionary encoding	The `n_comment` column contains unique string data. If the number of data rows is greater than the encoding threshold 255, LZ4 encoding instead of dictionary encoding is used for the `n_comment` column.

Delta support for dictionary encoding

The Beam storage engine consists of the following parts:

The row-oriented Delta storage that handles real-time writes.
The PAX-based column-oriented Base storage that handles batch writes and large-scale scanning.

Important

AnalyticDB for PostgreSQL instances of V7.0.2.3 or later support the hybrid row-column Base storage.

AnalyticDB for PostgreSQL instances of V7.0.4.0 or later support the row-oriented Delta storage for easier use of dictionary encoding.

For example, execute the following statements and use the EXPLAIN statement to query an execution plan. The execution plan shows a Decode operator for decoding. This improves data processing performance and data security for easier use of dictionary encoding.

 CREATE TABLE test_auto(a int, b text) 
 using beam with(compresstype=auto,compresslevel=5);

 INSERT INTO test_auto values (1,'adbpg1'),(2,'adbpg2'), (3, 'adbpg3'),(10,'adbpg10');

SET adbpg_enable_encode_optimize to ON;

 explain SELECT * FROM test_auto WHERE b='adbpg1';
 -----------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..2.04 rows=1 width=36)
   ->  Decode  (cost=0.00..2.02 rows=1 width=36)
         Params: dictKeyIdx[2], dictID[0]
         ->  Seq Scan on test_auto  (cost=0.00..2.02 rows=1 width=36)
               Filter: (encode(b) = ('adbpg1'::text)::integer with dict ID 0)
 Optimizer: Postgres-based planner