By digoal
MD5 hash example: c4ca4238a0b923820dcc509a6f75849b. In databases, text or varchar is required to store MD5 hashes, which is space-wasting. The reason is that the text type requires a header of 1 to 4 bytes, and each character consumes 1 byte. So, 33 bytes are needed to store the above MD5 hash.
In fact, you can compress the hash to 128 bits for storage. This is what the md5hash plug-in does. It defines a md5hash data type, which is stored at the bottom with 128 bits.
This extension provides a simple data type storing 128-bit values (e.g., MD5 hashes) in a bit more efficient way - in a fixed-length column and inline.
The extension defines the 'md5hash' type itself, casts, operators, and an operator class for b-tree indexes. It's fairly straightforward and simple to add more operators or classes.
Take the usage of this plug-in in PostgreSQL (PG) 12 as an example.
1) Install PG 12.
wget http://api.pgxn.org/dist/md5hash/1.0.1/md5hash-1.0.1.zip
unzip md5hash-1.0.1.zip
cd md5hash-1.0.1/
export PATH=$PGHOME/bin:$PATH
USE_PGXS=1 make
USE_PGXS=1 make install
2) Install the plug-in.
postgres=# create extension md5hash;
CREATE EXTENSION
3) Use the md5hash data type.
postgres=# create table ts(id md5hash primary key);
CREATE TABLE
postgres=# create table ts1(id md5hash, id1 text);
CREATE TABLE
postgres=# insert into ts1 select md5(i::text),md5(i::text) from generate_series(1,100) i;
INSERT 0 100
postgres=# select * from ts1 limit 10;
id | id1
----------------------------------+----------------------------------
c4ca4238a0b923820dcc509a6f75849b | c4ca4238a0b923820dcc509a6f75849b
c81e728d9d4c2f636f067f89cc14862c | c81e728d9d4c2f636f067f89cc14862c
eccbc87e4b5ce2fe28308fd9f2a7baf3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
a87ff679a2f3e71d9181a67b7542122c | a87ff679a2f3e71d9181a67b7542122c
e4da3b7fbbce2345d7772b0674a318d5 | e4da3b7fbbce2345d7772b0674a318d5
1679091c5a880faf6fb5e6087eb1b2dc | 1679091c5a880faf6fb5e6087eb1b2dc
8f14e45fceea167a5a36dedd4bea2543 | 8f14e45fceea167a5a36dedd4bea2543
c9f0f895fb98ab9159f51fd0297e236d | c9f0f895fb98ab9159f51fd0297e236d
45c48cce2e2d7fbdea1afc51c7c6ad26 | 45c48cce2e2d7fbdea1afc51c7c6ad26
d3d9446802a44259755d38e6d163e820 | d3d9446802a44259755d38e6d163e820
(10 rows)
When comparing md5hash with text, compulsory conversion is needed.
postgres=# select * from ts1 where id::text=id1 limit 10;
id | id1
----------------------------------+----------------------------------
c4ca4238a0b923820dcc509a6f75849b | c4ca4238a0b923820dcc509a6f75849b
c81e728d9d4c2f636f067f89cc14862c | c81e728d9d4c2f636f067f89cc14862c
eccbc87e4b5ce2fe28308fd9f2a7baf3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
a87ff679a2f3e71d9181a67b7542122c | a87ff679a2f3e71d9181a67b7542122c
e4da3b7fbbce2345d7772b0674a318d5 | e4da3b7fbbce2345d7772b0674a318d5
1679091c5a880faf6fb5e6087eb1b2dc | 1679091c5a880faf6fb5e6087eb1b2dc
8f14e45fceea167a5a36dedd4bea2543 | 8f14e45fceea167a5a36dedd4bea2543
c9f0f895fb98ab9159f51fd0297e236d | c9f0f895fb98ab9159f51fd0297e236d
45c48cce2e2d7fbdea1afc51c7c6ad26 | 45c48cce2e2d7fbdea1afc51c7c6ad26
d3d9446802a44259755d38e6d163e820 | d3d9446802a44259755d38e6d163e820
(10 rows)
postgres=# select * from ts1 where id=id1::md5hash limit 10;
id | id1
----------------------------------+----------------------------------
c4ca4238a0b923820dcc509a6f75849b | c4ca4238a0b923820dcc509a6f75849b
c81e728d9d4c2f636f067f89cc14862c | c81e728d9d4c2f636f067f89cc14862c
eccbc87e4b5ce2fe28308fd9f2a7baf3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
a87ff679a2f3e71d9181a67b7542122c | a87ff679a2f3e71d9181a67b7542122c
e4da3b7fbbce2345d7772b0674a318d5 | e4da3b7fbbce2345d7772b0674a318d5
1679091c5a880faf6fb5e6087eb1b2dc | 1679091c5a880faf6fb5e6087eb1b2dc
8f14e45fceea167a5a36dedd4bea2543 | 8f14e45fceea167a5a36dedd4bea2543
c9f0f895fb98ab9159f51fd0297e236d | c9f0f895fb98ab9159f51fd0297e236d
45c48cce2e2d7fbdea1afc51c7c6ad26 | 45c48cce2e2d7fbdea1afc51c7c6ad26
d3d9446802a44259755d38e6d163e820 | d3d9446802a44259755d38e6d163e820
(10 rows)
The comparison of the storage space shows that md5hash saves space.
postgres=# select pg_column_size(id), pg_column_size(id1) from ts1 limit 10;
pg_column_size | pg_column_size
----------------+----------------
16 | 33
16 | 33
16 | 33
16 | 33
16 | 33
16 | 33
16 | 33
16 | 33
16 | 33
16 | 33
(10 rows)
In addition to space savings, md5hash is more efficient in data querying.
CREATE TABLE test_md5hash (id md5hash PRIMARY KEY);
CREATE TABLE test_text (id varchar(32) PRIMARY KEY);
INSERT INTO test_md5hash SELECT md5(i::text)
FROM generate_series(1,1000000) s(i);
INSERT INTO test_text SELECT md5(i::text)
FROM generate_series(1,1000000) s(i);
SELECT relname,
(pg_relation_size(oid)/1024) AS relation_size_kB,
(pg_total_relation_size(oid)/1024) AS total_size_kB
FROM pg_class WHERE relname LIKE 'test_%';
relname | relation_size_kb | total_size_kb
-------------------+------------------+---------------
test_md5hash | 43248 | 82744
test_md5hash_pkey | 39464 | 39464
test_text | 66672 | 141816
test_text_pkey | 75096 | 75096
(4 rows)
Now, let's see the difference when querying the column. To eliminate planning overhead, we've used prepared statements and executed 1.000.000 queries with 'WHERE id = $1' condition (using the PK index).
TEXT 129 seconds
md5hash 117 seconds
The md5hash plug-in stores MD5 hashes in 128 bits, saving the storage space and the usage of memories such as shared buffers. For queries on the pkey column, md5hash is more efficient than text.
PostgreSQL pg_pathman: Partition Tables into Native Partition Tables
digoal - December 11, 2019
Yunlei - March 4, 2024
ApsaraDB - October 20, 2020
Alibaba Clouder - March 6, 2020
digoal - May 9, 2020
digoal - May 17, 2021
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAn online MPP warehousing service based on the Greenplum Database open source program
Learn MorePlan and optimize your storage budget with flexible storage services
Learn MoreA cost-effective, efficient and easy-to-manage hybrid cloud storage solution.
Learn MoreMore Posts by digoal