The Alibaba Cloud 2021 Double 11 Cloud Services Sale is live now! For a limited time only you can turbocharge your cloud journey with core Alibaba Cloud products available from just $1, while you can win up to $1,111 in cash plus $1,111 in Alibaba Cloud credits in the Number Guessing Contest.
PostgreSQL has a good reputation in the search field. This is particularly evident in the GIS field where it has taken the leading position in the industry for many years. Fortunately for us, PostgreSQL turns out to be a great solution for e-commerce applications as well. In this article, we will introduce PostgreSQL index searching and word segmentation technologies used for Alibaba's Double Eleven (Singles' Day) annual online shopping festival.
Full text search involves two data types: phrases and search word combinations.
https://www.postgresql.org/docs/9.6/static/datatype-textsearch.html
Character strings are converted into phrases based on the selected phrasing rule. The phrases can be considered as a group of abstracted lexemes.
Example:
postgres=> select * from pg_ts_config;
cfgname | cfgnamespace | cfgowner | cfgparser
------------+--------------+----------+-----------
simple | 11 | 10 | 3722
danish | 11 | 10 | 3722
dutch | 11 | 10 | 3722
english | 11 | 10 | 3722
finnish | 11 | 10 | 3722
french | 11 | 10 | 3722
german | 11 | 10 | 3722
hungarian | 11 | 10 | 3722
italian | 11 | 10 | 3722
norwegian | 11 | 10 | 3722
portuguese | 11 | 10 | 3722
romanian | 11 | 10 | 3722
russian | 11 | 10 | 3722
spanish | 11 | 10 | 3722
swedish | 11 | 10 | 3722
turkish | 11 | 10 | 3722
(16 rows)
Phrase conversion: followed by location information
postgres=> select to_tsvector('english', 'Hi i''m digoal, a pger, are you pger?');
to_tsvector
------------------------------------
'digoal':4 'hi':1 'm':3 'pger':6,9
(1 row)
Stop words are filtered out, which are meaningless, for example, i, an, a, am, yes, and no (user defined).
The search words refer to the words you want to search, for example, postgresql, or postgresql or mysql.
You also need to select a configuration rule to judge and filter lexemes. The phrases you want to search can be combined randomly.
postgres=> select to_tsquery('a');
NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
to_tsquery
------------
(1 row)
postgres=> select to_tsquery('english', 'mysql|postgresql');
to_tsquery
------------------------
'mysql' | 'postgresql'
(1 row)
postgres=> select to_tsquery('english', 'mysql&postgresql');
to_tsquery
------------------------
'mysql' & 'postgresql'
(1 row)
postgres=> select to_tsquery('english', 'mysql&postgresql|abc');
to_tsquery
--------------------------------
'mysql' & 'postgresql' | 'abc'
(1 row)
postgres=> select to_tsquery('english', '(mysql&postgresql)|abc');
to_tsquery
--------------------------------
'mysql' & 'postgresql' | 'abc'
(1 row)
postgres=> select to_tsquery('english', 'mysql&(postgresql|abc)');
to_tsquery
------------------------------------
'mysql' & ( 'postgresql' | 'abc' )
(1 row)
Moreover, tsquery supports prefix query, for example:
postgres=> select to_tsquery('english', 'postgres:*' );
to_tsquery
------------
'postgr':*
(1 row)
Lexeme starting with postgr
Tsvector also involves the section and weight concepts, for example, body, title, and sub-title. There are four levels, A, B, C, and D, indicating the layer and location where the lexeme resides.
Lexemes that have positions can further be labeled with a weight, which can be A, B, C, or D. D is the default and hence is not shown on output:
SELECT 'a:1A fat:2B,4C cat:5D'::tsvector;
tsvector
----------------------------
'a':1A 'cat':5 'fat':2B,4C
We have talked about data types. Next, let's learn about the data type operators. Like the operators in mathematics, word segmentation type has the match, include, add, and adjacent operators.
Example:
Whether the phrase includes the words to be searched.
to_tsvector('fat cats ate rats') @@ to_tsquery('cat & rat')
return
true
Combine the two phrases.
'a:1 b:2'::tsvector || 'c:1 d:2 b:3'::tsvector
return
'a':1 'b':2,5 'c':3 'd':4
Execute the AND operation on the two search words
'fat | rat'::tsquery && 'cat'::tsquery
return
( 'fat' | 'rat' ) & 'cat'
Execute the OR operation on the two search words
'fat | rat'::tsquery || 'cat'::tsquery
return
( 'fat' | 'rat' ) | 'cat'
Not search word.
!! 'cat'::tsquery
return
!'cat'
It indicates two adjacent search words. This is a brilliant function newly added to 9.6. For example, if you type in "hello <-> world", it matches hello world, but not hello digoal world. This is because the location information is used.
to_tsquery('fat') <-> to_tsquery('rat')
return
'fat' <-> 'rat'
postgres=> select to_tsvector('english', 'Hi i''m digoal, a pger, are you pger?') @@ to_tsquery(
$$
'digoal' <-> 'pger'
$$
);
?column?
----------
f
(1 row)
postgres=> select to_tsvector('english', 'Hi i''m digoal, a pger, are you pger?');
to_tsvector
------------------------------------
'digoal':4 'hi':1 'm':3 'pger':6,9
(1 row)
postgres=> select to_tsvector('english', 'hello world');
to_tsvector
---------------------
'hello':1 'world':2
(1 row)
postgres=> select to_tsvector('english', 'hello world') @@ to_tsquery(
$$
'hello' <-> 'world'
$$
);
?column?
----------
t
(1 row)
postgres=> select to_tsvector('english', 'hello world') @@ to_tsquery(
$$
'world' <-> 'hello'
$$
);
?column?
----------
f
(1 row)
postgres=> select to_tsvector('english', 'hello world') @@ to_tsquery(
$$
'world' & 'hello'
$$
);
?column?
----------
t
(1 row)
Include relationship between two words.
'cat'::tsquery @> 'cat & rat'::tsquery
return
false
'cat'::tsquery <@ 'cat & rat'::tsquery
return
true
These functions are database built-in functions of the tsvector or tsquery type, supporting various commonly used functions.
Example:
Converting array into the phrase type, obtaining the current tsconfig (English, Chinese, etc.), obtaining the phrase length (number of lexemes), converting character strings into search words, and converting phrases into search words (including location information)
The following describes the phrase conversion function added to PG 9.6, which supports adjacency.
For example, when you type in Chinese Taoist Culture, it is segmented into Chinese <-> Taoist <-> Culture. The match is positive only when the words are adjacent. Otherwise, the words cannot be matched.
The word segmentation such as "Chinese population census, Taoist proportion xx, education level xx" cannot be matched. If you want a positive match, type in Chinese & Taoist & Culture. This function is convenient.
postgres=# select phraseto_tsquery('hello digoal');
phraseto_tsquery
----------------------
'hello' <-> 'digoal'
(1 row)
postgres=# select phraseto_tsquery('hello digoal zhou');
phraseto_tsquery
---------------------------------
'hello' <-> 'digoal' <-> 'zhou'
(1 row)
postgres=# select plainto_tsquery('hello digoal zhou');
plainto_tsquery
-----------------------------
'hello' & 'digoal' & 'zhou'
(1 row)
postgres=# select plainto_tsquery('hello digoal zhou, this is china');
plainto_tsquery
---------------------------------------
'hello' & 'digoal' & 'zhou' & 'china'
(1 row)
postgres=# select phraseto_tsquery('hello digoal zhou, this is china');
phraseto_tsquery
---------------------------------------------
'hello' <-> 'digoal' <-> 'zhou' <3> 'china'
(1 row)
Phrases supporting or not supporting indexing.
querytree('foo & ! bar'::tsquery)
foo supports indexing.
!bar does not support indexing.
Add or remove the weight (namely, ABCD) of tsvector.
setweight('fat:2,4 cat:3 rat:5B'::tsvector, 'A') -- Add weight A to all lexemes.
'cat':3A 'fat':2,4 'rat':5A
setweight('fat:2,4 cat:3 rat:5B'::tsvector, 'A', '{cat,rat}') -- Add weight A to cat and rat.
'cat':3A 'fat':2,4 'rat':5A
strip('fat:2,4 cat:3 rat:5A'::tsvector) -- Remove weights.
'cat' 'fat' 'rat'
Remove the specified lexeme from tsvector. If you know some words are meaningless, remove them.
ts_delete('fat:2,4 cat:3 rat:5A'::tsvector, 'fat') Return 'cat':3 'rat':5A
ts_delete('fat:2,4 cat:3 rat:5A'::tsvector, ARRAY['fat','rat']) Return 'cat':3
Filter phrases based on weight. For example, only check whether the title and sub-title are matched (assume that the weights of title and sub-title are A and B respectively).
ts_filter('fat:2,4 cat:3b rat:5A'::tsvector, '{a,b}') Return 'cat':3B 'rat':5A
The words in bold are positively matched. This is a useful function, especially in demonstration.
For example, some blog platforms set sensitive words, which will be filtered out. With this function, I can quickly find the words that match the sensitive word rules.
ts_headline('x y z', 'z'::tsquery) Return x y <b>z</b>
The match percentage is useful in many scenarios. For example, rank the words based on match degree. The word with the highest match percentage is ranked first.
ts_rank(textsearch, query) 0.818
select *, ts_rank(fts,to_tsquery('supernovae & x-ray')) as rank
from apod
where fts @@ to_tsquery('supernovae & x-ray')
order by rank desc limit 5;
Or use the <=> operator.
select *
from apod
where fts @@ to_tsquery('supernovae & x-ray')
order by fts <=> to_tsquery('supernovae & x-ray') desc limit 5;
Phrases include the title, sub-title, body, and section (namely, weight), so different coefficients can be set for different weights, for example:
That is, the match degree is calculated based on the weights you have set.
Set the coefficients of A, B, C, and D to {0.1, 0.2, 0.4, 1.0}.
ts_rank_cd('{0.1, 0.2, 0.4, 1.0}', textsearch, query) Return 2.01317
Rewrite search words, like SQL REWRITE or the text replacement function.
Replace a with foo|bar.
ts_rewrite('a & b'::tsquery, 'a'::tsquery, 'foo|bar'::tsquery) Return 'b' & ( 'foo' | 'bar' )
Batch replacement is supported. For example, use QUERY to replace the s fields with t.
SELECT ts_rewrite('a & b'::tsquery, 'SELECT t,s FROM aliases') Return 'b' & ( 'foo' | 'bar' )
Calculate the phrase and convert it into tsquery, including the lexeme distance information.
tsquery_phrase(to_tsquery('fat'), to_tsquery('cat')) Return 'fat' <-> 'cat'-- Create a phrase in which fat and cat are adjacent.
tsquery_phrase(to_tsquery('fat'), to_tsquery('cat'), 10) Return 'fat' <10> 'cat'-- Create a phrase in which 10 tokens (including the tokens filtered out) exist between fat and cat.
Convert tsvector into array, which does not include location information.
tsvector_to_array('fat:2,4 cat:3 rat:5A'::tsvector) Return {cat,fat,rat}
The automatic phrase update function is important. If you have updated the text fields, how are the phrases automatically updated?
The answer is two built-in triggers.
CREATE TRIGGER ... tsvector_update_trigger(tsvcol, 'pg_catalog.swedish', title, body)
CREATE TRIGGER ... tsvector_update_trigger_column(tsvcol, configcol, title, body)
Expand tsvector to convert it into multiple records.
unnest('fat:2,4 cat:3 rat:5A'::tsvector) (cat,{3},{D}) ...
If you have added a dictionary, and set or modified the phrasing rule, you may need to view the phrasing result after modification.
Phrasing includes several steps, for example, splitting a character string into tokens (including location information and weight) according to the dictionary, filtering unneeded tokens based on the token properties and ts config, and returning tsvector.
Using the debugging functions can display the original splitting information.
https://www.postgresql.org/docs/9.6/static/functions-textsearch.html
Example:
In addition to type and function, a database must support indexing of a data type; otherwise, the database cannot fully support this data type.
The phrases in PostgreSQL support indexes, including GIN, GiST, and SP-Gist.
The following is an example of using indexes.
create table test(id serial primary key, content text, ts tsvector);
create index idx_test_ts on test using gin (ts);
select * from test where ts @@ to_tsquery('english', 'hello <-> digoal'); -- Indexing is implemented.
PostgreSQL supports customized dictionary and phrasing configuration.
Chinese phrases can be added. There are many methods of adding Chinese phrases on the Internet.
https://github.com/fxsjy/jieba
https://github.com/jaiminpan/pg_scws
For this purpose, you can consider using Alibaba Cloud RDS PostgreSQL because it contains the zhparser phrase.
Word segmentation and fuzzy search are two functions. Word segmentation cannot provide the fuzzy match function. For example, if you type in a regular expression or condition with fuzzy prefix and suffix for search, fuzzy search can hit the text that cannot be searched by word segmentation.
Word segmentation can match only the prefixes, but do not support the fuzzy prefixes, suffixes, or regular expressions.
postgres=# select to_tsquery('postgresql:*');
to_tsquery
----------------
'postgresql':*
(1 row)
postgres=# select to_tsquery('postgres:*');
to_tsquery
------------
'postgr':*
(1 row)
If you require the fuzzy search function, use the pg_trgm plugin in PostgreSQL for indexing.
Similarity-based search and fuzzy match are supported.
postgres=# create extension pg_trgm;
CREATE EXTENSION
postgres=# explain select * from tb where info ~ '5821a'; -- Fuzzy prefix and suffix.
QUERY PLAN
----------------------------------------------------------------------------
Bitmap Heap Scan on tb (cost=103.75..3677.71 rows=1000 width=9)
Recheck Cond: (info ~ '5821a'::text)
-> Bitmap Index Scan on idx_tb_2 (cost=0.00..103.50 rows=1000 width=0)
Index Cond: (info ~ '5821a'::text)
(4 rows)
Time: 0.647 ms
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_split where info ~ '^33.+7.+9$' limit 10; -- Regular expression search.
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Limit (cost=43.75..80.78 rows=10 width=57) (actual time=19.573..21.212 rows=10 loops=1)
Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8
Buffers: shared hit=566
-> Bitmap Heap Scan on public.t_split (cost=43.75..3746.56 rows=1000 width=57) (actual time=19.571..21.206 rows=10 loops=1)
Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8
Recheck Cond: (t_split.info ~ '^33.+7.+9$'::text)
Rows Removed by Index Recheck: 647
Heap Blocks: exact=552
Buffers: shared hit=566
-> Bitmap Index Scan on idx9 (cost=0.00..43.50 rows=1000 width=0) (actual time=11.712..11.712 rows=39436 loops=1)
Index Cond: (t_split.info ~ '^33.+7.+9$'::text)
Buffers: shared hit=14
Planning time: 0.301 ms
Execution time: 21.255 ms
(14 rows)
Time: 21.995 ms
As you can see from the examples above, it is really easy to apply PostgreSQL to real-life e-commerce applications. PostgreSQL's powerful search and indexing features has helped Alibaba make the Double Eleven online shopping festival successful year after year. The problems used to be solved by search engine or external plugin can be solved by PostgreSQL today. Here are some of the advantages of using PostgreSQL for word segmentation and index searching:
PostgreSQL has a long history. It originated from University of California, Berkeley with a BSD-LIKE license. PostgreSQL is user-friendly; many products have been developed based on PostgreSQL.
To learn more about PostgreSQL on Alibaba Cloud, visit www.alibabacloud.com/product/apsaradb-for-rds-postgresql
Double Eleven Technology Series: Flash Sales Optimization on PostgreSQL
Double Eleven Technology Series: Logistics and Dynamic Path Planning
digoal - October 23, 2018
ApsaraDB - June 15, 2023
Alibaba Cloud MaxCompute - October 18, 2021
Alibaba Cloud Storage - February 27, 2020
Alibaba Cloud Community - September 5, 2024
GarvinLi - October 24, 2018
Get started on cloud with $1. Start your cloud innovation journey here and now.
Learn MoreAlibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAn online MPP warehousing service based on the Greenplum Database open source program
Learn MoreMore Posts by digoal