By digoal
PolarDB's is a cloud-native database that features a computing-storage separation architecture. It offers affordable data storage, efficient scalability, high-speed parallel computing capabilities, and fast data search and processing. By combining PolarDB with computing algorithms, it enables the conversion of data into productivity and drives business data output.
This article explains how the open-source version of PolarDB utilizes pg_similarity to implement 17 types of text similarity search. This includes tokenization and normalization techniques to retrieve similar texts based on their text similarity.
The test environment is macos + docker. For more information about deploying PolarDB, see:
pg_similarity supports the following 17 similarity algorithms
• L1 Distance (as known as City Block or Manhattan Distance);
• Cosine Distance;
• Dice Coefficient;
• Euclidean Distance;
• Hamming Distance;
• Jaccard Coefficient;
• Jaro Distance;
• Jaro-Winkler Distance;
• Levenshtein Distance;
• Matching Coefficient;
• Monge-Elkan Coefficient;
• Needleman-Wunsch Coefficient;
• Overlap Coefficient;
• Q-Gram Distance;
• Smith-Waterman Coefficient;
• Smith-Waterman-Gotoh Coefficient;
• Soundex Distance.
Most of the similarity algorithms mentioned above support index operations. For more information, please refer to: https://github.com/eulerto/pg_similarity
Note:
• The tokenization and normalization algorithm is determined by parameters. If the parameter is set to a when your data is written, the written text will be tokenized using algorithm a. If the parameter is changed to b in the future, the tokenization algorithm used may differ from the previous one. However, if the business allows, the tokenization algorithms can remain the same.
• The same principle applies when comparing text similarity.
1. Download and compile.
git clone --depth 1 https://github.com/eulerto/pg_similarity.git
cd pg_similarity/
USE_PGXS=1 make
USE_PGXS=1 make install
export PGHOST=127.0.0.1
[postgres@67e1eed1b4b6 pg_similarity]$ USE_PGXS=1 make installcheck
/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/pgxs/src/makefiles/../../src/test/regress/pg_regress --inputdir=./ --bindir='/home/postgres/tmp_basedir_polardb_pg_1100_bld/bin' --dbname=contrib_regression test1 test2 test3 test4
(using postmaster on 127.0.0.1, default port)
============== dropping database "contrib_regression" ==============
DROP DATABASE
============== creating database "contrib_regression" ==============
CREATE DATABASE
ALTER DATABASE
============== running regression test queries ==============
test test1 ... ok
test test2 ... ok
test test3 ... ok
test test4 ... ok
==========================================================
All 4 tests passed.
POLARDB:
All 4 tests, 0 tests in ignore, 0 tests in polar ignore.
==========================================================
2. Load the pg_similarity extension.
postgres=# create database db1;
CREATE DATABASE
postgres=# \c db1
You are now connected to database "db1" as user "postgres".
db1=# create extension pg_similarity ;
CREATE EXTENSION
3. The pg_similarity extension adds some functions and operators for similarity searches.
db1=# \df
List of functions
Schema | Name | Result data type | Argument data types | Type
--------+-------------------------+------------------+-------------------------------------------------------------------------------+------
public | block | double precision | text, text | func
public | block_op | boolean | text, text | func
public | cosine | double precision | text, text | func
public | cosine_op | boolean | text, text | func
public | dice | double precision | text, text | func
public | dice_op | boolean | text, text | func
public | euclidean | double precision | text, text | func
public | euclidean_op | boolean | text, text | func
public | gin_extract_query_token | internal | internal, internal, smallint, internal, internal, internal, internal | func
public | gin_extract_value_token | internal | internal, internal, internal | func
public | gin_token_consistent | boolean | internal, smallint, internal, integer, internal, internal, internal, internal | func
public | hamming | double precision | bit varying, bit varying | func
public | hamming_op | boolean | bit varying, bit varying | func
public | hamming_text | double precision | text, text | func
public | hamming_text_op | boolean | text, text | func
public | jaccard | double precision | text, text | func
public | jaccard_op | boolean | text, text | func
public | jaro | double precision | text, text | func
public | jaro_op | boolean | text, text | func
public | jarowinkler | double precision | text, text | func
public | jarowinkler_op | boolean | text, text | func
public | lev | double precision | text, text | func
public | lev_op | boolean | text, text | func
public | matchingcoefficient | double precision | text, text | func
public | matchingcoefficient_op | boolean | text, text | func
public | mongeelkan | double precision | text, text | func
public | mongeelkan_op | boolean | text, text | func
public | needlemanwunsch | double precision | text, text | func
public | needlemanwunsch_op | boolean | text, text | func
public | overlapcoefficient | double precision | text, text | func
public | overlapcoefficient_op | boolean | text, text | func
public | qgram | double precision | text, text | func
public | qgram_op | boolean | text, text | func
public | smithwaterman | double precision | text, text | func
public | smithwaterman_op | boolean | text, text | func
public | smithwatermangotoh | double precision | text, text | func
public | smithwatermangotoh_op | boolean | text, text | func
public | soundex | double precision | text, text | func
public | soundex_op | boolean | text, text | func
(39 rows)
db1=# \do
List of operators
Schema | Name | Left arg type | Right arg type | Result type | Description
--------+------+---------------+----------------+-------------+-------------
public | ~!! | text | text | boolean |
public | ~!~ | text | text | boolean |
public | ~## | text | text | boolean |
public | ~#~ | text | text | boolean |
public | ~%% | text | text | boolean |
public | ~** | text | text | boolean |
public | ~*~ | text | text | boolean |
public | ~++ | text | text | boolean |
public | ~-~ | text | text | boolean |
public | ~== | text | text | boolean |
public | ~=~ | text | text | boolean |
public | ~?? | text | text | boolean |
public | ~@@ | text | text | boolean |
public | ~@~ | text | text | boolean |
public | ~^^ | text | text | boolean |
public | ~|| | text | text | boolean |
public | ~~~ | text | text | boolean |
(17 rows)
4. Common configuration of pg_similarity. We only need to configure pg_similarity to the shared_preload_libraries to start testing.
[postgres@67e1eed1b4b6 pg_similarity]$ cat pg_similarity.conf.sample
#-----------------------------------------------------------------------
# postgresql.conf
#-----------------------------------------------------------------------
# the former needs a restart every time you upgrade pg_similarity and
# the later needs that you create a $libdir/plugins directory and move
# pg_similarity.so to it (it doesn't require a restart; just open a new
# connection).
#shared_preload_libraries = 'pg_similarity'
# - or -
#local_preload_libraries = 'pg_similarity'
#-----------------------------------------------------------------------
# pg_similarity
#-----------------------------------------------------------------------
# - Block -
#pg_similarity.block_tokenizer = 'alnum' # alnum, camelcase, gram, or word
#pg_similarity.block_threshold = 0.7 # 0.0 .. 1.0
#pg_similarity.block_is_normalized = true
# - Cosine -
#pg_similarity.cosine_tokenizer = 'alnum'
#pg_similarity.cosine_threshold = 0.7
#pg_similarity.cosine_is_normalized = true
# - Dice -
#pg_similarity.dice_tokenizer = 'alnum'
#pg_similarity.dice_threshold = 0.7
#pg_similarity.dice_is_normalized = true
# - Euclidean -
#pg_similarity.euclidean_tokenizer = 'alnum'
#pg_similarity.euclidean_threshold = 0.7
#pg_similarity.euclidean_is_normalized = true
# - Hamming -
#pg_similarity.hamming_threshold = 0.7
#pg_similarity.hamming_is_normalized = true
# - Jaccard -
#pg_similarity.jaccard_tokenizer = 'alnum'
#pg_similarity.jaccard_threshold = 0.7
#pg_similarity.jaccard_is_normalized = true
# - Jaro -
#pg_similarity.jaro_threshold = 0.7
#pg_similarity.jaro_is_normalized = true
# - Jaro -
#pg_similarity.jaro_threshold = 0.7
#pg_similarity.jaro_is_normalized = true
# - Jaro-Winkler -
#pg_similarity.jarowinkler_threshold = 0.7
#pg_similarity.jarowinkler_is_normalized = true
# - Levenshtein -
#pg_similarity.levenshtein_threshold = 0.7
#pg_similarity.levenshtein_is_normalized = true
# - Matching Coefficient -
#pg_similarity.matching_tokenizer = 'alnum'
#pg_similarity.matching_threshold = 0.7
#pg_similarity.matching_is_normalized = true
# - Monge-Elkan -
#pg_similarity.mongeelkan_tokenizer = 'alnum'
#pg_similarity.mongeelkan_threshold = 0.7
#pg_similarity.mongeelkan_is_normalized = true
# - Needleman-Wunsch -
#pg_similarity.nw_threshold = 0.7
#pg_similarity.nw_is_normalized = true
# - Overlap Coefficient -
#pg_similarity.overlap_tokenizer = 'alnum'
#pg_similarity.overlap_threshold = 0.7
#pg_similarity.overlap_is_normalized = true
# - Q-Gram -
#pg_similarity.qgram_tokenizer = 'qgram'
#pg_similarity.qgram_threshold = 0.7
#pg_similarity.qgram_is_normalized = true
# - Smith-Waterman -
#pg_similarity.sw_threshold = 0.7
#pg_similarity.sw_is_normalized = true
# - Smith-Waterman-Gotoh -
#pg_similarity.swg_threshold = 0.7
#pg_similarity.swg_is_normalized = true
5. Test the similarity search and import test data.
[postgres@67e1eed1b4b6 ~]$ cd pg_similarity/
[postgres@67e1eed1b4b6 pg_similarity]$ psql
psql (11.9)
Type "help" for help.
postgres=# CREATE TABLE simtst (a text);
CREATE TABLE
postgres=#
postgres=# INSERT INTO simtst (a) VALUES
postgres-# ('Euler Taveira de Oliveira'),
postgres-# ('EULER TAVEIRA DE OLIVEIRA'),
postgres-# ('Euler T. de Oliveira'),
postgres-# ('Oliveira, Euler T.'),
postgres-# ('Euler Oliveira'),
postgres-# ('Euler Taveira'),
postgres-# ('EULER TAVEIRA OLIVEIRA'),
postgres-# ('Oliveira, Euler'),
postgres-# ('Oliveira, E. T.'),
postgres-# ('ETO');
INSERT 0 10
postgres=#
postgres=# \copy simtst FROM 'data/similarity.data'
COPY 2999
6. Test the similarity search and create a GIN index.
https://github.com/eulerto/pg_similarity/blob/master/pg_similarity--1.0.sql
The following operators support index retrieval:
CREATE OPERATOR CLASS gin_similarity_ops
FOR TYPE text USING gin
AS
OPERATOR 1 ~++, -- block
OPERATOR 2 ~##, -- cosine
OPERATOR 3 ~-~, -- dice
OPERATOR 4 ~!!, -- euclidean
OPERATOR 5 ~??, -- jaccard
-- OPERATOR 6 ~%%, -- jaro
-- OPERATOR 7 ~@@, -- jarowinkler
-- OPERATOR 8 ~==, -- lev
OPERATOR 9 ~^^, -- matchingcoefficient
-- OPERATOR 10 ~||, -- mongeelkan
-- OPERATOR 11 ~#~, -- needlemanwunsch
OPERATOR 12 ~**, -- overlapcoefficient
OPERATOR 13 ~~~, -- qgram
-- OPERATOR 14 ~=~, -- smithwaterman
-- OPERATOR 15 ~!~, -- smithwatermangotoh
-- OPERATOR 16 ~*~, -- soundex
FUNCTION 1 bttextcmp(text, text),
FUNCTION 2 gin_extract_value_token(internal, internal, internal),
FUNCTION 3 gin_extract_query_token(internal, internal, int2, internal, internal, internal, internal),
FUNCTION 4 gin_token_consistent(internal, int2, internal, int4, internal, internal, internal, internal),
STORAGE text;
postgres=# create index on simtst using gin (a gin_similarity_ops);
CREATE INDEX
7. Test the similarity search and use the index to lock the target data at high speed according to the similarity.
The target data can be adjusted according to the threshold, and only the similarity that is greater than or equal to it will be returned.
The larger the similarity of the threshold, the more convergent the range and the better the performance.
You can set the threshold in the function and return it in stages.
postgres=# show pg_similarity.cosine_tokenizer;
pg_similarity.cosine_tokenizer
--------------------------------
alnum
(1 row)
postgres=# show pg_similarity.cosine_threshold;
pg_similarity.cosine_threshold
--------------------------------
0.7
(1 row)
postgres=# show pg_similarity.cosine_is_normalized;
pg_similarity.cosine_is_normalized
------------------------------------
on
(1 row)
postgres=# select *, cosine(a, 'hello') from simtst where a ~## 'hello' limit 10;
a | cosine
---+--------
(0 rows)
postgres=# select *, cosine(a, 'EULER TAVEIRA DE OLIVEI') from simtst where a ~## 'EULER TAVEIRA DE OLIVEI' limit 10;
a | cosine
---------------------------+--------
EULER TAVEIRA DE OLIVEIRA | 0.75
(1 row)
postgres=# explain select *, cosine(a, 'EULER TAVEIRA DE OLIVEI') from simtst where a ~## 'EULER TAVEIRA DE OLIVEI' limit 10;
QUERY PLAN
----------------------------------------------------------------------------------
Limit (cost=36.02..44.29 rows=3 width=40)
-> Bitmap Heap Scan on simtst (cost=36.02..44.29 rows=3 width=40)
Recheck Cond: (a ~## 'EULER TAVEIRA DE OLIVEI'::text)
-> Bitmap Index Scan on simtst_a_idx (cost=0.00..36.02 rows=3 width=0)
Index Cond: (a ~## 'EULER TAVEIRA DE OLIVEI'::text)
(5 rows)
postgres=# set pg_similarity.cosine_threshold=0.75;
SET
postgres=# select *, cosine(a, 'EULER TAVEIRA DE OLIVEI') from simtst where a ~## 'EULER TAVEIRA DE OLIVEI' limit 10;
a | cosine
---------------------------+--------
EULER TAVEIRA DE OLIVEIRA | 0.75
(1 row)
postgres=# set pg_similarity.cosine_threshold=0.76;
SET
postgres=# select *, cosine(a, 'EULER TAVEIRA DE OLIVEI') from simtst where a ~## 'EULER TAVEIRA DE OLIVEI' limit 10;
a | cosine
---+--------
(0 rows)
Open Source PolarDB Uses pg_rational Extension to Support Stern-Brocot Trees
Open Source PolarDB Uses Orafce to Support Oracle Compatibility
digoal - October 12, 2023
digoal - February 5, 2020
digoal - October 12, 2022
digoal - February 3, 2020
Alibaba Clouder - March 8, 2017
ApsaraDB - December 11, 2024
Alibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn MoreAlibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAlibaba Cloud PolarDB for Xscale (PolarDB-X) is a cloud-native high-performance distributed database service independently developed by Alibaba Cloud.
Learn MoreMulti-source metrics are aggregated to monitor the status of your business and services in real time.
Learn MoreMore Posts by digoal