ApsaraDB RDS for PostgreSQL provides the fuzzystrmatch extension. This extension supports the Soundex, Levenshtein, Metaphone, and Double Metaphone algorithms. You can use these algorithms to calculate the similarity and distance between strings.
Enable or disable the fuzzystrmatch extension
- Enable the fuzzystrmatch extension.
CREATE EXTENSION fuzzystrmatch;
- Disable the fuzzystrmatch extension.
DROP EXTENSION fuzzystrmatch;
Soundex
The Soundex algorithm converts similar-sounding words into the same code. However, this algorithm is unsuitable for non-English words.
The Soundex algorithm provides the following functions:
soundex(text) returns text
difference(text, text) returns int
- The soundex function converts a string into its Soundex code, such as A550.
- The difference function converts two strings into their Soundex codes. Then, the difference function reports the number of code matching positions between the two strings. A Soundex code consists of four characters. Therefore, the number of code matching positions ranges from 0 to 4. The value 0 indicates a zero match, and the value 4 indicates an exact match.
Examples:
SELECT soundex('hello world!');
SELECT soundex('Anne'), soundex('Andrew'), difference('Anne', 'Andrew');
SELECT soundex('Anne'), soundex('Margaret'), difference('Anne', 'Margaret');
CREATE TABLE s (nm text);
INSERT INTO s VALUES ('john');
INSERT INTO s VALUES ('joan');
INSERT INTO s VALUES ('wobbly');
INSERT INTO s VALUES ('jack');
SELECT * FROM s WHERE soundex(nm) = soundex('john');
SELECT * FROM s WHERE difference(s.nm, 'john') > 2;
Levenshtein
The Levenshtein algorithm calculates the Levenshtein distance between two strings.
The Levenshtein algorithm provides the following functions:
levenshtein(text source, text target, int ins_cost, int del_cost, int sub_cost) returns int
levenshtein(text source, text target) returns int
levenshtein_less_equal(text source, text target, int ins_cost, int del_cost, int sub_cost, int max_d) returns int
levenshtein_less_equal(text source, text target, int max_d) returns int
The following table describes the parameters that you must configure in the preceding functions.
Parameter | Description |
---|---|
source | The first string. The string cannot be empty and can contain up to 255 characters in length. |
target | The second string. The string cannot be empty and can contain up to 255 characters in length. |
ins_cost | The overhead that is required to insert characters. |
del_cost | The overhead that is required to delete characters. |
sub_cost | The overhead that is required to replace characters. |
max_d | The maximum Levenshtein distance that is allowed between the two specified strings. |
- If the actual distance is less than or equal to the value of the max_d parameter, the levenshtein_less_equal function returns the exact distance that is calculated.
- If the actual distance is greater than the value of the max_d parameter, the levenshtein_less_equal function returns a random distance that is greater than the value of the max_d parameter.
- If the value of the max_d parameter is negative, the levenshtein_less_equal and levenshtein functions return the same distance.
Examples:
SELECT levenshtein('GUMBO', 'GAMBOL');
SELECT levenshtein('GUMBO', 'GAMBOL', 2,1,1);
SELECT levenshtein_less_equal('extensive', 'exhaustive',2);
SELECT levenshtein_less_equal('extensive', 'exhaustive',4);
Metaphone
The Metaphone algorithm works in the same way as the Soundex algorithm. The Metaphone algorithm constructs a representative code for each specified string. If two strings have the same representative code, the Metaphone algorithm considers them to be similar.
The Metaphone algorithm provides the following functions:
metaphone(text source, int max_output_length) returns text
The following table describes the parameters that you must configure in the preceding functions.
Parameter | Description |
---|---|
source | A string that is not empty. The string can contain up to 255 characters in length. |
max_output_length | The maximum length of the Metaphone code that can be returned. If the Metaphone code exceeds the maximum length, the Metaphone algorithm truncates the Metaphone code to the maximum length. |
Example:
SELECT metaphone('GUMBO', 4);
Double Metaphone
The Double Metaphone algorithm obtains two similar-sounding codes for a specified string. These codes include a primary code and a secondary code. In most cases, the two codes are the same. They may be slightly different when you specify a non-English word. The difference varies based on the pronunciation.
The Double Metaphone algorithm provides the following functions:
dmetaphone(text source) returns text
dmetaphone_alt(text source) returns text
Examples:
select dmetaphone('gumbo');
select dmetaphone_alt('gumbo');