Full-text search - PolarDB - Alibaba Cloud Documentation Center

Full-text search is an information retrieval technology designed to quickly and accurately find queried information from a large amount of text data. PolarDB for PostgreSQL provides various full-text search capabilities.

Background information

Compared with traditional searches based on keywords, full-text search is designed to quickly and accurately find queried information from a large amount of text data. Full-text search can process the content of the entire document, not just specific fields or tags.

In most cases, a full-text search includes the following steps:

Text preprocessing: prepares the text to enhance retrieval efficiency and accuracy. This step includes tasks such as tokenization, removing stopwords, and stemming.
Index construction: constructs an index on the processed text, usually by using an inverted index structure, to track the position of each word within the documents.
Query processing: analyzes the query statement after you enter a query and converts the query into a format suitable for information search and retrieval.
Result sorting: sorts the search results based on relevance algorithms and returns the results that best meet your requirements.

Scenarios

Full-text search is applicable to various scenarios.

Document management system: allows you to quickly find internal documents, reports, and contracts, which significantly improves overall work efficiency.
Online search engine: helps you quickly find information on the Internet.
Academic research: allows researchers to quickly find relevant literature and data from academic databases and literature.
E-commerce: allows customers to easily find the products they are looking for, which improves their shopping experience.
Social media: allows you to search for posts, comments, and images based on keywords, which improve access to information.
Legal document retrieval: helps lawyers and legal professionals quickly find relevant cases and legal provisions.
Medical records management: allows hospitals to quickly find patient information and historical records from medical records and reports.
Customer support: helps customers quickly find frequently asked questions (FAQ) and support documents on the online customer service system.
Content management system: helps visitors quickly find relevant articles and materials from websites and blogs.
Library and information retrieval: allows readers to easily find books and articles in the library.

Features

Tokenization

The full-text search feature of PolarDB for PostgreSQL can preprocess a document and save an index for subsequent searches. Test preprocessing includes the following steps:

Parse the document into various types of tokens. The tokens can include numbers, words, complex words, and email addresses. This differentiation allows the tokens to be processed in different ways based on their types.
Normalize the tokens into lexemes by standardizing the forms of words. This process includes converting uppercase letters into lowercase letters and removing suffixes such as s or es. This allows different variations of a word to be treated as a single entity, which enhances search efficiency and accuracy.
Store the preprocessed documents in a manner that facilitates searching. For example, each document is represented as an ordered array of lexemes. In addition to lexemes, positional information for ranking purposes also needs to be stored. This information helps determine the relevance of a document to a query. For instance, a document that contains the queried terms in a "dense" area ranks higher compared to a document in which the queried terms are scattered.

tsvector

The tsvector data type in PolarDB for PostgreSQL is designed for efficient full-text search. The tsvector data type can efficiently store processed text and supports fast searching and matching. The tsvector data type stores a sorted list of lexemes of a document together with their positions within the document.

pg_bigm

pg_bigm is an extension of PolarDB for PostgreSQL. The pg_bigm extension allows you to perform fuzzy search on text and is effective especially when you want to find similar strings. The pg_bigm extension is most commonly used for applications that contain large amounts of text data, such as search engines and content management systems. The main idea for the pg_bigm extension is to improve the efficiency and accuracy of text search by using "n-grams" (n-tuples).

Note

The pg_bigm extension can significantly improve the efficiency of searches by using wildcards, for example, using wildcards in the %xxxx% format.

pg_trgm

pg_trgm is an extension of PolarDB for PostgreSQL that supports trigrams. A trigram is a contiguous sequence of three characters extracted from a string and is particularly useful for fuzzy matching and text similarity searching. The pg_trgm extension improves query efficiency for large text datasets by creating indexes and query operators. The pg_trgm extension is suitable for various scenarios, such as full-text search, auto-completion, and spelling correction.

Note

The pg_trgm extension can significantly improve the efficiency of searches by using wildcards, for example, using wildcards in the %xxxx% format.

Chinese text segmentation

In Chinese, words are the smallest meaningful units. Compared with English text, Chinese text does not use spaces to separate words. This makes it difficult for the default full-text search engine of PostgreSQL to accurately perform word segmentation in accordance with Chinese semantics.

To effectively process Chinese text, PolarDB for PostgreSQL provides the pg_jieba and zhparser extensions.

pg_jieba

Jieba is a widely used Chinese text segmentation library that can accurately identify and segment words in Chinese sentences. The pg_jieba extension integrates the word segmentation capability of Jieba into databases to enable efficient processing of Chinese text words and enhance full-text search.

Zhparser

Simple Chinese Word Segmentation (SCWS) is an open source Chinese word segmentation engine based on word frequency dictionaries. SCWS can accurately segment Chinese text into individual words.

Zhparser is a Chinese word segmentation extension developed based on SCWS. The Zhparser extension is compatible with the full-text search feature of PostgreSQL and provides a wide range of feature configuration options and custom dictionaries.

Indexing

PolarDB for PostgreSQL provides various indexing structures to enhance full-text search capabilities.

GIN indexing

The generalized inverted index (GIN) is a type of index in PostgreSQL that supports full-text search. GIN indexing is particularly advantageous for handling large amounts of text data. GIN indexing supports rapid query operations, especially in complex text queries that use tsvector and tsquery data types. GIN indexing supports data types such as JSONB.

RUM indexing

RUM is an extension of PostgreSQL that provides a RUM indexing type for full-text search and other indexing requirements. RUM indexing is designed to enhance the performance of full-text searches, particularly in scenarios that require ranking documents based on relevance.

A RUM index is an inverted index similar to the built-in GIN index. The key difference between the two indexes lies in the additional information that RUM indexes can store. The additional information facilitates faster access to results of sorting or other related operations. For example, in a full-text search, the RUM index may store position information of terms in a document. This way, the position information is used to calculate the relevance ranking during queries. For example, in a full-text search, a RUM index can keep track of the positions of words within documents. This information is used during queries to compute relevance rankings more efficiently.

Query processing

tsquery

The tsquery data type is designed to handle text data queries in full text search. tsquery allows you to create complex search conditions to efficiently find information from large amounts of text data. PolarDB for PostgreSQL also provides the to_tsquery method, which you can use to convert text to tsqueries. You can use the tsvector data type and full-text search operators to complete full-text search queries.

tsquery supports the @@ (contain) operator and Boolean operators such as & (AND), | (OR), and ! (NOT). This allows you to construct compound condition search queries.

Sorting

ts_rank

ts_rank is a function in PostgreSQL and is used in full-text search to calculate a score that indicates the relevance of a document to a query. You can use the score to assess the importance or relevance of a document in relation to specific search criteria.