The open source third-party extension pg_jieba is used for full-text search in Chinese.
Prerequisites
The extension is supported on the PolarDB for PostgreSQL clusters that run the following engines:
PostgreSQL 15 (revision version 15.7.2.0 or later)
PostgreSQL 14 (revision version 14.5.2.0 or later)
PostgreSQL 11 (revision version 1.1.28 or later)
You can execute the following statements to view the revision version of a PolarDB for PostgreSQL cluster:
PostgreSQL 15 and PostgreSQL 14
SELECT version();
PostgreSQL 11
SHOW polar_version;
Overview
The pg_jieba extension provides the following text search modes:
jiebacfg: exact mode. This mode precisely splits the text without redundant words.
jiebaqry: full mode. This mode obtains all matching words in the text. Redundant words may exist.
jiebacfg_pos: exact mode (with subscript position). The subscript information of each word is added to the result of the exact mode, The stopwords that are ignored in exact mode (highly frequent words that are not used in the search) are also displayed.
Usage
Create and delete the extension
Create the pg_jieba extension.
CREATE EXTENSION pg_jieba;
Delete the pg_jieba extension.
DROP EXTENSION pg_jieba;
Only privileged accounts are authorized to execute the preceding statements.
Examples
Example 1:
Exact mode
SELECT * FROM to_tsvector('jiebacfg', '小明硕士毕业于中国科学院计算所,后在日本京都大学深造');
Sample result:
to_tsvector ---------------------------------------------------------------------------------- '中国科学院':5 '小明':1 '日本京都大学':10 '毕业':3 '深造':11 '硕士':2 '计算所':6 (1 row)
Full-mode
SELECT * FROM to_tsvector('jiebaqry', '小明硕士毕业于中国科学院计算所,后在日本京都大学深造');
Sample result:
to_tsvector ----------------------------------------------------------------------------------------------------------------------------------------------------------------- '中国':5 '中国科学院':9 '京都':16 '大学':17 '学院':7 '小明':1 '日本':15 '日本京都大学':18 '毕业':3 '深造':19 '硕士':2 '科学':6 '科学院':8 '计算':10 '计算所':11 (1 row)
Exact mode (with subscript position)
SELECT * FROM to_tsvector('jiebacfg_pos', '小明硕士毕业于中国科学院计算所,后在日本京都大学深造');
Sample result:
to_tsvector ------------------------------------------------------------------------------------------------------------------------------------------ '中国科学院:7':5 '于:6':4 '后:16':8 '在:17':9 '小明:0':1 '日本京都大学:18':10 '毕业:4':3 '深造:24':11 '硕士:2':2 '计算所:12':6 ',:15':7 (1 row)
Example 2:
Exact mode
SELECT * FROM to_tsvector('jiebacfg', '李小福是创新办主任也是云计算方面的专家');
Sample result:
to_tsvector ------------------------------------------------------------------- '专家':11 '主任':5 '云计算':8 '创新':3 '办':4 '方面':9 '李小福':1 (1 row)
Full-mode
SELECT * FROM to_tsvector('jiebaqry', '李小福是创新办主任也是云计算方面的专家');
Sample result:
to_tsvector ----------------------------------------------------------------------------- '专家':12 '主任':5 '云计算':9 '创新':3 '办':4 '方面':10 '李小福':1 '计算':8 (1 row)
Exact mode (with subscript position)
SELECT * FROM to_tsvector('jiebacfg_pos', '李小福是创新办主任也是云计算方面的专家');
Sample result:
to_tsvector --------------------------------------------------------------------------------------------------------------------------- '专家:17':11 '主任:7':5 '也:9':6 '云计算:11':8 '创新:4':3 '办:6':4 '方面:14':9 '是:10':7 '是:3':2 '李小福:0':1 '的:16':10 (1 row)
Extended features
The pg_jieba extension allows you to configure multiple custom dictionaries and switch between the dictionaries.
To use custom dictionaries, you need to add pg_jieba
to the shared_preload_libraries
parameter. You can configure shared_preload_libraries
in the console. For more information, see Specify cluster parameters. The cluster is restarted after you configure this parameter. Procedd with caution.
Insert data into the first custom dictionary. By default, data is inserted in to the first custom dictionary. The first custom dictionary is represented by 0. The weight value of the first custom dictionary is 10.
INSERT INTO jieba_user_dict VALUES ('阿里云'); INSERT INTO jieba_user_dict VALUES ('研发工程师',0,10);
Use the dictionary predefined in the pg_jieba extension to segment Chinese text.
SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师');
Sample result:
to_tsvector ------------------------------------------------------ 'zth':1 '一个':6 '云':4 '工程师':8 '研发':7 '阿里':3 (1 row)
Switch to custom dictionary 0.
SELECT jieba_load_user_dict(0);
Sample result:
jieba_load_user_dict ---------------------- (1 row)
Use custom dictionary 0 to split words.
SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师');
Sample result:
to_tsvector -------------------------------------------- 'zth':1 '一个':5 '研发工程师':6 '阿里云':3 (1 row)