Open Source PolarDB Uses pg_similarity to Implement 17 Types of Text Similarity Search

By digoal

Background

PolarDB's is a cloud-native database that features a computing-storage separation architecture. It offers affordable data storage, efficient scalability, high-speed parallel computing capabilities, and fast data search and processing. By combining PolarDB with computing algorithms, it enables the conversion of data into productivity and drives business data output.

This article explains how the open-source version of PolarDB utilizes pg_similarity to implement 17 types of text similarity search. This includes tokenization and normalization techniques to retrieve similar texts based on their text similarity.

The test environment is macos + docker. For more information about deploying PolarDB, see:

Simple Deployment of PolarDB

pg_similarity for PolarDB

pg_similarity supports the following 17 similarity algorithms

• L1 Distance (as known as City Block or Manhattan Distance);
• Cosine Distance;
• Dice Coefficient;
• Euclidean Distance;
• Hamming Distance;
• Jaccard Coefficient;
• Jaro Distance;
• Jaro-Winkler Distance;
• Levenshtein Distance;
• Matching Coefficient;
• Monge-Elkan Coefficient;
• Needleman-Wunsch Coefficient;
• Overlap Coefficient;
• Q-Gram Distance;
• Smith-Waterman Coefficient;
• Smith-Waterman-Gotoh Coefficient;
• Soundex Distance.

Most of the similarity algorithms mentioned above support index operations. For more information, please refer to: https://github.com/eulerto/pg_similarity

Note:

• The tokenization and normalization algorithm is determined by parameters. If the parameter is set to a when your data is written, the written text will be tokenized using algorithm a. If the parameter is changed to b in the future, the tokenization algorithm used may differ from the previous one. However, if the business allows, the tokenization algorithms can remain the same.

• The same principle applies when comparing text similarity.

Deploy pg_similarity for PolarDB

1. Download and compile.

git clone --depth 1 https://github.com/eulerto/pg_similarity.git  
  
  
cd pg_similarity/  
  
USE_PGXS=1 make  
USE_PGXS=1 make install  
export PGHOST=127.0.0.1  
  
[postgres@67e1eed1b4b6 pg_similarity]$ USE_PGXS=1 make installcheck  
/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/pgxs/src/makefiles/../../src/test/regress/pg_regress --inputdir=./ --bindir='/home/postgres/tmp_basedir_polardb_pg_1100_bld/bin'      --dbname=contrib_regression test1 test2 test3 test4  
(using postmaster on 127.0.0.1, default port)  
============== dropping database "contrib_regression" ==============  
DROP DATABASE  
============== creating database "contrib_regression" ==============  
CREATE DATABASE  
ALTER DATABASE  
============== running regression test queries        ==============  
test test1                        ... ok  
test test2                        ... ok  
test test3                        ... ok  
test test4                        ... ok  
  
  
==========================================================  
 All 4 tests passed.   
  
 POLARDB:  
 All 4 tests, 0 tests in ignore, 0 tests in polar ignore.   
==========================================================

2. Load the pg_similarity extension.

postgres=# create database db1;  
CREATE DATABASE  
  
postgres=# \c db1  
You are now connected to database "db1" as user "postgres".  
db1=# create extension pg_similarity ;  
CREATE EXTENSION

3. The pg_similarity extension adds some functions and operators for similarity searches.

db1=# \df  
                                                             List of functions  
 Schema |          Name           | Result data type |                              Argument data types                              | Type   
--------+-------------------------+------------------+-------------------------------------------------------------------------------+------  
 public | block                   | double precision | text, text                                                                    | func  
 public | block_op                | boolean          | text, text                                                                    | func  
 public | cosine                  | double precision | text, text                                                                    | func  
 public | cosine_op               | boolean          | text, text                                                                    | func  
 public | dice                    | double precision | text, text                                                                    | func  
 public | dice_op                 | boolean          | text, text                                                                    | func  
 public | euclidean               | double precision | text, text                                                                    | func  
 public | euclidean_op            | boolean          | text, text                                                                    | func  
 public | gin_extract_query_token | internal         | internal, internal, smallint, internal, internal, internal, internal          | func  
 public | gin_extract_value_token | internal         | internal, internal, internal                                                  | func  
 public | gin_token_consistent    | boolean          | internal, smallint, internal, integer, internal, internal, internal, internal | func  
 public | hamming                 | double precision | bit varying, bit varying                                                      | func  
 public | hamming_op              | boolean          | bit varying, bit varying                                                      | func  
 public | hamming_text            | double precision | text, text                                                                    | func  
 public | hamming_text_op         | boolean          | text, text                                                                    | func  
 public | jaccard                 | double precision | text, text                                                                    | func  
 public | jaccard_op              | boolean          | text, text                                                                    | func  
 public | jaro                    | double precision | text, text                                                                    | func  
 public | jaro_op                 | boolean          | text, text                                                                    | func  
 public | jarowinkler             | double precision | text, text                                                                    | func  
 public | jarowinkler_op          | boolean          | text, text                                                                    | func  
 public | lev                     | double precision | text, text                                                                    | func  
 public | lev_op                  | boolean          | text, text                                                                    | func  
 public | matchingcoefficient     | double precision | text, text                                                                    | func  
 public | matchingcoefficient_op  | boolean          | text, text                                                                    | func  
 public | mongeelkan              | double precision | text, text                                                                    | func  
 public | mongeelkan_op           | boolean          | text, text                                                                    | func  
 public | needlemanwunsch         | double precision | text, text                                                                    | func  
 public | needlemanwunsch_op      | boolean          | text, text                                                                    | func  
 public | overlapcoefficient      | double precision | text, text                                                                    | func  
 public | overlapcoefficient_op   | boolean          | text, text                                                                    | func  
 public | qgram                   | double precision | text, text                                                                    | func  
 public | qgram_op                | boolean          | text, text                                                                    | func  
 public | smithwaterman           | double precision | text, text                                                                    | func  
 public | smithwaterman_op        | boolean          | text, text                                                                    | func  
 public | smithwatermangotoh      | double precision | text, text                                                                    | func  
 public | smithwatermangotoh_op   | boolean          | text, text                                                                    | func  
 public | soundex                 | double precision | text, text                                                                    | func  
 public | soundex_op              | boolean          | text, text                                                                    | func  
(39 rows)  
  
db1=# \do  
                             List of operators  
 Schema | Name | Left arg type | Right arg type | Result type | Description   
--------+------+---------------+----------------+-------------+-------------  
 public | ~!!  | text          | text           | boolean     |   
 public | ~!~  | text          | text           | boolean     |   
 public | ~##  | text          | text           | boolean     |   
 public | ~#~  | text          | text           | boolean     |   
 public | ~%%  | text          | text           | boolean     |   
 public | ~**  | text          | text           | boolean     |   
 public | ~*~  | text          | text           | boolean     |   
 public | ~++  | text          | text           | boolean     |   
 public | ~-~  | text          | text           | boolean     |   
 public | ~==  | text          | text           | boolean     |   
 public | ~=~  | text          | text           | boolean     |   
 public | ~??  | text          | text           | boolean     |   
 public | ~@@  | text          | text           | boolean     |   
 public | ~@~  | text          | text           | boolean     |   
 public | ~^^  | text          | text           | boolean     |   
 public | ~||  | text          | text           | boolean     |   
 public | ~~~  | text          | text           | boolean     |   
(17 rows)

4. Common configuration of pg_similarity. We only need to configure pg_similarity to the shared_preload_libraries to start testing.

[postgres@67e1eed1b4b6 pg_similarity]$ cat pg_similarity.conf.sample   
#-----------------------------------------------------------------------  
# postgresql.conf  
#-----------------------------------------------------------------------  
# the former needs a restart every time you upgrade pg_similarity and   
# the later needs that you create a $libdir/plugins directory and move   
# pg_similarity.so to it (it doesn't require a restart; just open a new  
# connection).  
#shared_preload_libraries = 'pg_similarity'  
# - or -  
#local_preload_libraries = 'pg_similarity'  
  
#-----------------------------------------------------------------------  
# pg_similarity  
#-----------------------------------------------------------------------  
  
# - Block -  
#pg_similarity.block_tokenizer = 'alnum'  # alnum, camelcase, gram, or word  
#pg_similarity.block_threshold = 0.7    # 0.0 .. 1.0  
#pg_similarity.block_is_normalized = true  
  
# - Cosine -  
#pg_similarity.cosine_tokenizer = 'alnum'  
#pg_similarity.cosine_threshold = 0.7  
#pg_similarity.cosine_is_normalized = true  
  
# - Dice -  
#pg_similarity.dice_tokenizer = 'alnum'  
#pg_similarity.dice_threshold = 0.7  
#pg_similarity.dice_is_normalized = true  
  
# - Euclidean -  
#pg_similarity.euclidean_tokenizer = 'alnum'  
#pg_similarity.euclidean_threshold = 0.7  
#pg_similarity.euclidean_is_normalized = true  
  
# - Hamming -  
#pg_similarity.hamming_threshold = 0.7  
#pg_similarity.hamming_is_normalized = true  
  
# - Jaccard -  
#pg_similarity.jaccard_tokenizer = 'alnum'  
#pg_similarity.jaccard_threshold = 0.7  
#pg_similarity.jaccard_is_normalized = true  
  
# - Jaro -  
#pg_similarity.jaro_threshold = 0.7  
#pg_similarity.jaro_is_normalized = true  
  
# - Jaro -  
#pg_similarity.jaro_threshold = 0.7  
#pg_similarity.jaro_is_normalized = true  
  
# - Jaro-Winkler -  
#pg_similarity.jarowinkler_threshold = 0.7  
#pg_similarity.jarowinkler_is_normalized = true  
  
# - Levenshtein -  
#pg_similarity.levenshtein_threshold = 0.7  
#pg_similarity.levenshtein_is_normalized = true  
  
# - Matching Coefficient -  
#pg_similarity.matching_tokenizer = 'alnum'  
#pg_similarity.matching_threshold = 0.7  
#pg_similarity.matching_is_normalized = true  
  
# - Monge-Elkan -  
#pg_similarity.mongeelkan_tokenizer = 'alnum'  
#pg_similarity.mongeelkan_threshold = 0.7  
#pg_similarity.mongeelkan_is_normalized = true  
  
# - Needleman-Wunsch -  
#pg_similarity.nw_threshold = 0.7  
#pg_similarity.nw_is_normalized = true  
  
# - Overlap Coefficient -  
#pg_similarity.overlap_tokenizer = 'alnum'  
#pg_similarity.overlap_threshold = 0.7  
#pg_similarity.overlap_is_normalized = true  
  
# - Q-Gram -  
#pg_similarity.qgram_tokenizer = 'qgram'  
#pg_similarity.qgram_threshold = 0.7  
#pg_similarity.qgram_is_normalized = true  
  
# - Smith-Waterman -  
#pg_similarity.sw_threshold = 0.7  
#pg_similarity.sw_is_normalized = true  
  
# - Smith-Waterman-Gotoh -  
#pg_similarity.swg_threshold = 0.7  
#pg_similarity.swg_is_normalized = true

5. Test the similarity search and import test data.

[postgres@67e1eed1b4b6 ~]$ cd pg_similarity/  
[postgres@67e1eed1b4b6 pg_similarity]$ psql  
psql (11.9)  
Type "help" for help.  
  
postgres=# CREATE TABLE simtst (a text);  
CREATE TABLE  
postgres=#   
postgres=# INSERT INTO simtst (a) VALUES  
postgres-# ('Euler Taveira de Oliveira'),  
postgres-# ('EULER TAVEIRA DE OLIVEIRA'),  
postgres-# ('Euler T. de Oliveira'),  
postgres-# ('Oliveira, Euler T.'),  
postgres-# ('Euler Oliveira'),  
postgres-# ('Euler Taveira'),  
postgres-# ('EULER TAVEIRA OLIVEIRA'),  
postgres-# ('Oliveira, Euler'),  
postgres-# ('Oliveira, E. T.'),  
postgres-# ('ETO');  
INSERT 0 10  
postgres=#   
postgres=# \copy simtst FROM 'data/similarity.data'  
COPY 2999

6. Test the similarity search and create a GIN index.

https://github.com/eulerto/pg_similarity/blob/master/pg_similarity--1.0.sql

The following operators support index retrieval:

CREATE OPERATOR CLASS gin_similarity_ops  
FOR TYPE text USING gin  
AS  
    OPERATOR    1   ~++,    -- block  
    OPERATOR    2   ~##,    -- cosine  
    OPERATOR    3   ~-~,    -- dice  
    OPERATOR    4   ~!!,    -- euclidean  
    OPERATOR    5   ~??,    -- jaccard  
--    OPERATOR    6   ~%%,    -- jaro  
--    OPERATOR    7   ~@@,    -- jarowinkler  
--    OPERATOR    8   ~==,    -- lev  
    OPERATOR    9   ~^^,    -- matchingcoefficient  
--    OPERATOR    10  ~||,    -- mongeelkan  
--    OPERATOR    11  ~#~,    -- needlemanwunsch  
    OPERATOR    12  ~**,    -- overlapcoefficient  
    OPERATOR    13  ~~~,    -- qgram  
--    OPERATOR    14  ~=~,    -- smithwaterman  
--    OPERATOR    15  ~!~,    -- smithwatermangotoh  
--    OPERATOR    16  ~*~,    -- soundex  
    FUNCTION    1   bttextcmp(text, text),  
    FUNCTION    2   gin_extract_value_token(internal, internal, internal),  
    FUNCTION    3   gin_extract_query_token(internal, internal, int2, internal, internal, internal, internal),  
    FUNCTION    4   gin_token_consistent(internal, int2, internal, int4, internal, internal, internal, internal),  
    STORAGE text;  
postgres=# create index on simtst using gin (a gin_similarity_ops);  
CREATE INDEX

7. Test the similarity search and use the index to lock the target data at high speed according to the similarity.

The target data can be adjusted according to the threshold, and only the similarity that is greater than or equal to it will be returned.

The larger the similarity of the threshold, the more convergent the range and the better the performance.

You can set the threshold in the function and return it in stages.

postgres=# show pg_similarity.cosine_tokenizer;  
 pg_similarity.cosine_tokenizer   
--------------------------------  
 alnum  
(1 row)  
  
postgres=# show pg_similarity.cosine_threshold;  
 pg_similarity.cosine_threshold   
--------------------------------  
 0.7  
(1 row)  
  
postgres=# show pg_similarity.cosine_is_normalized;  
 pg_similarity.cosine_is_normalized   
------------------------------------  
 on  
(1 row)  
  
postgres=# select *, cosine(a, 'hello')  from simtst where  a ~## 'hello' limit 10;  
 a | cosine   
---+--------  
(0 rows)  
  
postgres=# select *, cosine(a, 'EULER TAVEIRA DE OLIVEI')  from simtst where  a ~## 'EULER TAVEIRA DE OLIVEI' limit 10;  
             a             | cosine   
---------------------------+--------  
 EULER TAVEIRA DE OLIVEIRA |   0.75  
(1 row)  
  
postgres=# explain select *, cosine(a, 'EULER TAVEIRA DE OLIVEI')  from simtst where  a ~## 'EULER TAVEIRA DE OLIVEI' limit 10;  
                                    QUERY PLAN                                      
----------------------------------------------------------------------------------  
 Limit  (cost=36.02..44.29 rows=3 width=40)  
   ->  Bitmap Heap Scan on simtst  (cost=36.02..44.29 rows=3 width=40)  
         Recheck Cond: (a ~## 'EULER TAVEIRA DE OLIVEI'::text)  
         ->  Bitmap Index Scan on simtst_a_idx  (cost=0.00..36.02 rows=3 width=0)  
               Index Cond: (a ~## 'EULER TAVEIRA DE OLIVEI'::text)  
(5 rows)  
  
postgres=# set pg_similarity.cosine_threshold=0.75;  
SET  
postgres=# select *, cosine(a, 'EULER TAVEIRA DE OLIVEI')  from simtst where  a ~## 'EULER TAVEIRA DE OLIVEI' limit 10;  
             a             | cosine   
---------------------------+--------  
 EULER TAVEIRA DE OLIVEIRA |   0.75  
(1 row)  
  
postgres=# set pg_similarity.cosine_threshold=0.76;  
SET  
postgres=# select *, cosine(a, 'EULER TAVEIRA DE OLIVEI')  from simtst where  a ~## 'EULER TAVEIRA DE OLIVEI' limit 10;  
 a | cosine   
---+--------  
(0 rows)

References

https://github.com/eulerto/pg_similarity

Community

Open Source PolarDB Uses pg_similarity to Implement 17 Types of Text Similarity Search

Background

pg_similarity for PolarDB

Deploy pg_similarity for PolarDB

References

Read previous post:

Read next post:

digoal

You may also like

Comments

digoal

Related Products

PolarDB for MySQL

PolarDB for PostgreSQL

PolarDB for Xscale

Managed Service for Prometheus