All Products
Search
Document Center

PolarDB:smlar

Last Updated:Oct 27, 2024

Similarity calculation of large amounts of data is key in areas such as e-commerce and search engine. Simple implementations of similarity calculation consume large amounts of resources and provides low performance. smlar is an open-source third-party extension for PolarDB for PostgreSQL. It provides functions that can calculate data similarity in databases and similarity operators that support GiST and GIN indexes. The smlar extension supports all built-in data types of PostgreSQL.

Prerequisites

The extension is supported on the PolarDB for PostgreSQL clusters that run the following engine:

  • PostgreSQL 14 (revision version 14.5.1.0 or later)

  • PostgreSQL 11 (revision version 1.1.28 or later)

Note

You can execute one of the following statements to view the revision version of a PolarDB for PostgreSQL cluster:

  • PostgreSQL 14

    SELECT version();
  • PostgreSQL 11

    SHOW polar_version;

Usage

  1. Install the smlar extension.

    CREATE EXTENSION smlar;
    Note

    The % operator of smlar conflicts with that of rum. Therefore, you cannot create thses two extensions in the same schema.

  2. Execute the following statement to calculate the similarity of arrays.

    SELECT smlar('{3,2}'::int[], '{3,2,1}');
      smlar
    ----------
     0.816497
    (1 row)
    
    SELECT smlar('{1,4,6}'::int[], '{5,4,6}', 'N.i / (N.a + N.b)' );
      smlar
    ----------
     0.333333
    (1 row)
    Note

    For more information about other functions, see Functions and operators.

  3. Remove the smlar extension.

    DROP EXTENSION smlar;

Functions and operators

Function and operator

Description

float4 smlar(anyarray, anyarray)

Calculates the similarity of two arrays of the same data type. The arrays must use the same data type.

float4 smlar(anyarray, anyarray, bool useIntersect)

Calculates the similarity of two arrays of custom composite types (elements and weights). The useIntersect parameter specifies whether only overlapping elements or all elements are involved in the operation.

Note

Execute the following statement to define composite types:

CREATE TYPE type_name AS (element_name anytype, weight_name FLOAT4);

float4 smlar(anyarray a, anyarray b, text formula)

Calculates the similarity between two arrays of the same data type. The arrays are specified by the formula parameter. The predefined variables for the formula parameter:

  • N.i: the number of identical elements in two arrays (intersection).

  • N.a: the number of unique elements in the first array.

  • N.b: the number of unique elements in the second array.

anyarray % anyarray

Returns TRUE if the similarity of the two arrays exceeds the threshold. Otherwise, returns FALSE.

text[] tsvector2textarray(tsvector)

Converts an array of the tsvector type to a string.

anyarray array_unique(anyarray)

Sorts and deduplicates the array.

float4 inarray(anyarray, anyelement)

Returns 1.0 if the element exists in the array. Otherwise, returns 0.

float4 inarray(anyarray, anyelement, float4, float4)

Returns the third parameter if the element exists in the array. Otherwise, returns the fourth parameter.

Parameters

Parameter

Description

smlar.threshold FLOAT

The similarity threshold. It is used to determine whether two arrays combined with the percent sign (%) operator are similar.

smlar.persistent_cache BOOL

Specifies whether to store the cached global statistics in transaction-independent memory.

smlar.type STRING

The formula used to calculate the similarity. The following similarity types are available: consine (default), tfidf, and overlap.

smlar.stattable STRING

The name of the table that stores collection range statistics. You can execute the following statement to create a table:

CREATE TABLE table_name (
  value   data_type UNIQUE,
  ndoc    int4 (or bigint)  NOT NULL CHECK (ndoc>0)
);

smlar.tf_method STRING

The method used to calculate the term frequency (TF). Default value: n. Valid values:

  • n: simple counting

  • log: 1 + log(n)

  • const: 1

smlar.idf_plus_one BOOL

The method used to calculate the inverse document frequency (IDF). Default value: FALSE. Valid values:

  • FALSE :log(d/df)

  • TRUE :log(1 + d/df).

References

For more information about the smlar extension, see: