By Digoal
It is easy to implement full-text searches and fuzzy queries in PostgreSQL. The pg_trgm plugin allows you to use indexes to speed up fuzzy queries (fuzzy prefix, suffix, or middle-string query), similarity queries, and regexp queries. You can also use word-break plugins to break down characters in multiple languages, for example, to break down Chinese characters you can use pg_jieba, pg_scws, and zhparser.
However, what if you want your database to support encrypted storage and perform fuzzy queries or full-text searches on data before it is encrypted? This article addresses these challenges in detail in the following sections.
There are several types of data encryption and we recommend that you choose a proper encryption type depending on your actual scenario.
Reversible encryption (for example, pgcrypto) supports encryption methods like symmetric encryption. and allows you to store data in an encrypted form. However, it is not easy to create an index on encrypted data.
Fortunately, PostgreSQL supports expression indexes. It means you can create indexes on decrypted expressions to implement acceleration. However, note that expressions of indexes change to the form prior to the encryption. Therefore, avoid this encryption type to reduce the risk.
Comparatively, it is even more difficult to create indexes on data that uses the irreversible encryption (for example, MD5). However, creating indexes is executable for this encryption type and is very secure.
Transparent encryption requires the transformation of kernels. Data is stored in encrypted form and encryption/decryption is automatically performed during the input/output process.
The following example illustrates how to implement full-text search and fuzzy queries for reversibly encrypted data.
Step 1. Create the crypto plugin.
create extension pgcrypto;
Step 2. Create an immutable key function by using the superuser account.
create or replace function f1(int) returns text as $$
select md5(md5(md5($1::text)));
$$ language sql strict immutable;
If it is a C function, you can hide it better. Also, you can store the key in an external table such as another database, LDAP or another network service that is accessible through the PostgreSQL fdw interface. It is significant to note that the user can obtain the key only after the corresponding query permission is granted.
Step 3. Hide the code of the key function as shown below.
revoke select on pg_proc from public;
grant select(oid,proname,pronamespace,proowner,prolang,procost,prorows,provariadic,protransform,proisagg,proiswindow,prosecdef,proleakproof,proisstrict,proretset,provolatile,pronargs,pronargdefaults,prorettype,proargtypes,proallargtypes,proargmodes,proargnames,proargdefaults,probin,proconfig,proacl) on pg_proc to public;
revoke select(prosrc) on pg_proc from public;
revoke execute on function pg_get_functiondef(oid) from public;
Step 4. Next, set access to the key function.
grant execute on function f1(int) to digoal;
Step 5. Create a test table.
create table test (id int , info bytea);
Step 6. Now create an expression (decrypted) index.
create or replace function enc(bytea,text,text) returns text as $$
select pgp_sym_decrypt($1,$2,$3);
$$ language sql strict immutable;
create index idx1 on test (enc(info, f1(1), 'cipher-algo=bf, compress-algo=2, compress-level=9'));
Step 7. Write data as shown in the example below.
insert into test values (1, pgp_sym_encrypt('abcdefg', f1(1), 'cipher-algo=bf, compress-algo=2, compress-level=9'));
Step 8. Now, perform a query as shown in the example below.
postgres=> select * from test;
id | info
----+------------------------------------------------------------------------------------------------------------------------------------------------------------
1 | \xc30d040403029b1c64cd9b1093ba62d23b019368155e5c6ff91bb144bc1c2852c9ab21971d62ea529056ff3a588229044ff54fe15292db6765c9d69ad0e6649f57b34f6e374883c87903b099
(1 row)
Query the data using the index.
postgres=> select * from test where enc(info, f1(1), 'cipher-algo=bf, compress-algo=2, compress-level=9') = 'abcdefg';
id | info
----+------------------------------------------------------------------------------------------------------------------------------------------------------------
1 | \xc30d040403029b1c64cd9b1093ba62d23b019368155e5c6ff91bb144bc1c2852c9ab21971d62ea529056ff3a588229044ff54fe15292db6765c9d69ad0e6649f57b34f6e374883c87903b099
(1 row)
postgres=> explain select * from test where enc(info, f1(1), 'cipher-algo=bf, compress-algo=2, compress-level=9') = 'abcdefg';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=1.40..6.78 rows=6 width=36)
Recheck Cond: (pgp_sym_decrypt(info, '40f5888b67c748df7efba008e7c2f9d2'::text, 'cipher-algo=bf, compress-algo=2, compress-level=9'::text) = 'abcdefg'::text)
-> Bitmap Index Scan on idx1 (cost=0.00..1.40 rows=6 width=0)
Index Cond: (pgp_sym_decrypt(info, '40f5888b67c748df7efba008e7c2f9d2'::text, 'cipher-algo=bf, compress-algo=2, compress-level=9'::text) = 'abcdefg'::text)
(4 rows)
Step 9. Support full-text searches and fuzzy queries. Create an index based on the expression- enc(info, f1(1), 'cipher-algo=bf, compress-algo=2, compress-level=9') to implement full-text searches and fuzzy queries.
To support full-text searches and fuzzy queries for the irreversible encryption type, you need to convert strings to tokens, irreversibly encrypt tokens and store them into arrays.
While executing queries, convert the query string to tokens, irreversibly encrypt these tokens and retrieve inclusive or overlapping data in arrays by using the index. This will speed up the query and ensure the highest security.
Consider the following example table.
create table tbl(
id int,
info bytea, -- 原始字符串加密存储,使用可逆加密(秘钥存储在客户端),不建立索引。
info_arr text[] -- 客户端解密后,TOKEN化,然后使用不可逆加密,得到加密后的TOKEN数组。
);
Step 1. Use a language-specific full-text search plugin to tokenize strings for performing a full-text search. Obtain the tokens of words and phrases (tsvector) and convert tsvector to arrays.
tsvector_to_array(tsvector) 得到 text[]
Step 2. Convert content strings to tokens (double-character tokens).
create or replace function two_token(text) returns text[] as $$
declare
res text[] := '{}'::text[];
len int := length($1);
begin
if len<=1 then
return array[$1];
end if;
for i in 1..len-1 loop
res := array_append(res, substring($1, i, 2));
end loop;
return res;
end;
$$ language plpgsql strict immutable;
postgres=> select two_token('abcde');
two_token
---------------
{ab,bc,cd,de}
(1 row)
Step 3. Convert content strings to tokens (single-character tokens). Use regexp_split_to_array to obtain single-character arrays.
postgres=# select regexp_split_to_array('abcde','');
regexp_split_to_array
-----------------------
{a,b,c,d,e}
(1 row)
Step 4. Now, encrypt tokens as shown below.
create or replace function md5_token(text[]) returns text[] as $$
declare
res text[] := '{}'::text[];
i text;
begin
foreach i in array $1 loop
res := array_append(res, md5(i));
end loop;
return res;
end;
$$ language plpgsql strict immutable;
postgres=> select md5_token(two_token('abcde'));
md5_token
---------------------------------------------------------------------------------------------------------------------------------------
{187ef4436122d1cc2f40dc2b92f0eba0,5360af35bde9ebd8f01f492dc059593c,6865aeb3a9ed28f9a79ec454b259e5d0,5f02f0889301fd7be1ac972c11bf3e7d}
(1 row)
Step 5. Store encrypted arrays.
insert into tbl values (1, 客户端加密的bytea, 加密后的TEXT数组);
Step 6. Create an index on arrays using GIN.
create index idx on tbl using gin (info_arr);
Step 1. Convert the query string to tokens and encrypt tokens.
postgres=> select md5_token(two_token('abcde'));
md5_token
---------------------------------------------------------------------------------------------------------------------------------------
{187ef4436122d1cc2f40dc2b92f0eba0,5360af35bde9ebd8f01f492dc059593c,6865aeb3a9ed28f9a79ec454b259e5d0,5f02f0889301fd7be1ac972c11bf3e7d}
(1 row)
Step 2. Now, perform the first level of filtering.
select * from tbl where info_arr @> md5_token(two_token('abcde'));
或
select * from tbl where info_arr && md5_token(two_token('abcde'));
Step 3. Perform the second level of filtering by using the power of CPUs. The first level of filtering uses the GIN index and the second level filters require CPUs to perform operations. This ensures both high efficiency and security.
select * from tbl where info_arr && md5_token(two_token('abcde')) and 对称加密解密(info,'秘钥') ~ '正则表达式';
(1) Transparent encryption is of course the best. However, implementing transparent encryption requires the transformation of PostgreSQL kernels.
(2) Irreversible encryption features high security, but can exponentially increase storage usage.
(3) Reversible encryption has an average level of security. The security level is based on the prerequisite that secure key functions are used (not plaintext keys). However, even if secure key functions are used, the plaintext can still be stolen as a result of database attacks.
Data is not secured when the superuser or owner's permission is obtained without authorization.
Build your own PostgreSQL solution on Alibaba Cloud with ApsaraDB for RDS PostgreSQL.
PostgreSQL Responds to Fuzzy or Regular Expression-based Searches within Seconds
digoal - September 12, 2019
digoal - December 11, 2019
Alibaba Cloud Community - December 28, 2022
digoal - September 12, 2019
digoal - December 11, 2019
digoal - December 11, 2019
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreTair is a Redis-compatible in-memory database service that provides a variety of data structures and enterprise-level capabilities.
Learn MoreAn online MPP warehousing service based on the Greenplum Database open source program
Learn MoreIndustry-standard hardware security modules (HSMs) deployed on Alibaba Cloud.
Learn MoreMore Posts by digoal