Similarity calculation of large amounts of data is key in areas such as e-commerce and search engine. Simple implementations of similarity calculation consume large amounts of resources and provides low performance. smlar is an open-source third-party extension for PolarDB for PostgreSQL. It provides functions that can calculate data similarity in databases and similarity operators that support GiST and GIN indexes. The smlar extension supports all built-in data types of PostgreSQL.
Prerequisites
The extension is supported on the PolarDB for PostgreSQL clusters that run the following engine:
PostgreSQL 14 (revision version 14.5.1.0 or later)
PostgreSQL 11 (revision version 1.1.28 or later)
You can execute one of the following statements to view the revision version of a PolarDB for PostgreSQL cluster:
PostgreSQL 14
SELECT version();
PostgreSQL 11
SHOW polar_version;
Usage
Install the smlar extension.
CREATE EXTENSION smlar;
NoteThe
%
operator of smlar conflicts with that of rum. Therefore, you cannot create thses two extensions in the same schema.Execute the following statement to calculate the similarity of arrays.
SELECT smlar('{3,2}'::int[], '{3,2,1}'); smlar ---------- 0.816497 (1 row) SELECT smlar('{1,4,6}'::int[], '{5,4,6}', 'N.i / (N.a + N.b)' ); smlar ---------- 0.333333 (1 row)
NoteFor more information about other functions, see Functions and operators.
Remove the smlar extension.
DROP EXTENSION smlar;
Functions and operators
Function and operator | Description |
float4 smlar(anyarray, anyarray) | Calculates the similarity of two arrays of the same data type. The arrays must use the same data type. |
float4 smlar(anyarray, anyarray, bool useIntersect) | Calculates the similarity of two arrays of custom composite types (elements and weights). The useIntersect parameter specifies whether only overlapping elements or all elements are involved in the operation. Note Execute the following statement to define composite types:
|
float4 smlar(anyarray a, anyarray b, text formula) | Calculates the similarity between two arrays of the same data type. The arrays are specified by the formula parameter. The predefined variables for the formula parameter:
|
anyarray % anyarray | Returns TRUE if the similarity of the two arrays exceeds the threshold. Otherwise, returns FALSE. |
text[] tsvector2textarray(tsvector) | Converts an array of the tsvector type to a string. |
anyarray array_unique(anyarray) | Sorts and deduplicates the array. |
float4 inarray(anyarray, anyelement) | Returns 1.0 if the element exists in the array. Otherwise, returns 0. |
float4 inarray(anyarray, anyelement, float4, float4) | Returns the third parameter if the element exists in the array. Otherwise, returns the fourth parameter. |
Parameters
Parameter | Description |
smlar.threshold FLOAT | The similarity threshold. It is used to determine whether two arrays combined with the percent sign (%) operator are similar. |
smlar.persistent_cache BOOL | Specifies whether to store the cached global statistics in transaction-independent memory. |
smlar.type STRING | The formula used to calculate the similarity. The following similarity types are available: consine (default), tfidf, and overlap. |
smlar.stattable STRING | The name of the table that stores collection range statistics. You can execute the following statement to create a table:
|
smlar.tf_method STRING | The method used to calculate the term frequency (TF). Default value: n. Valid values:
|
smlar.idf_plus_one BOOL | The method used to calculate the inverse document frequency (IDF). Default value: FALSE. Valid values:
|
References
For more information about the smlar extension, see: