This topic describes how to use the pg_jieba extension to run Chinese full-text searches on an ApsaraDB RDS for PostgreSQL instance.
Prerequisites
Your RDS instance runs PostgreSQL 10.0 or a later version.
The minor engine version of the RDS instance is updated if the major engine version of the RDS instance meets the requirements but the extension is still not supported. For more information, see Update the minor engine version.
pg_jieba is added to the value of the shared_preload_libraries parameter of your RDS instance.
For more information about how to add pg_jieba to the value of the shared_preload_libraries parameter, see Modify the parameters of an ApsaraDB RDS for PostgreSQL instance.
Methods to use the pg_jieba extension
Create the pg_jieba extension.
CREATE EXTENSION pg_jieba;
NoteOnly privileged accounts are authorized to run the preceding command.
Delete the pg_jieba extension.
DROP EXTENSION pg_jieba;
NoteOnly privileged accounts are authorized to run the preceding command.
Example 1:
SELECT * FROM to_tsvector('jiebacfg', '小明硕士毕业于中国科学院计算所,后在日本京都大学深造'); to_tsvector -------------------------------------------------------------------------------------------------------------- '中国科学院':5 '于':4 '后':8 '在':9 '小明':1 '日本京都大学':10 '毕业':3 '深造':11 '硕士':2 '计算所':6 ',':7 (1 row)
Example 2:
SELECT * FROM to_tsvector('jiebacfg', '李小福是创新办主任也是云计算方面的专家'); to_tsvector ------------------------------------------------------------------------------------------- '专家':11 '主任':5 '也':6 '云计算':8 '创新':3 '办':4 '方面':9 '是':2,7 '李小福':1 '的':10 (1 row)
Extended features
You can view the extended features of the pg_jieba extension based on the version of the extension that you have installed.
Execute the following SQL statement to query the version of the pg_jieba extension:
SELECT * FROM pg_available_extensions WHERE name='pg_jieba';
Extended features in version 1.1.0
The pg_jieba extension allows you to configure multiple custom dictionaries and switch between the dictionaries.
-- Insert data into the first custom dictionary. By default, data is inserted into the first custom dictionary. The first custom dictionary is represented by 0. The weight value of the first custom dictionary is 10. INSERT INTO jieba_user_dict VALUES ('阿里云'); INSERT INTO jieba_user_dict VALUES ('研发工程师',0,10); -- Use the dictionary predefined in the pg_jieba extension to segment Chinese text. SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师'); to_tsvector ------------------------------------------------------ 'zth':1 '一个':6 '云':4 '工程师':8 '研发':7 '阿里':3 (1 row) -- Switch to the first custom dictionary. The jieba_load_user_dict() parameter specifies the sequence number of the custom dictionary. SELECT jieba_load_user_dict(0); jieba_load_user_dict ---------------------- (1 row) SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师'); to_tsvector -------------------------------------------- 'zth':1 '一个':5 '研发工程师':6 '阿里云':3 (1 row)
The pg_jieba extension allows you to view the text segmentation results based on offsets.
SELECT * FROM to_tsvector('jiebacfg_pos', 'zth是阿里云的一个研发工程师'); to_tsvector -------------------------------------------------------------------------------------- 'zth:0':1 '一个:8':6 '云:6':4 '工程师:12':8 '是:3':2 '的:7':5 '研发:10':7 '阿里:4':3'zth:0':1 ' One: 8':6 ' Cloud: 6':4 ' Engineer: 12':8 ' Yes: 3':2':7':5 ' R&D: 10':7 ' Ali: 4':3 (1 row)
Extended features in version 1.2.0
The
jieba_load_user_dict()
function is optimized to reduce its CPU utilization and memory usage.A new parameter is added to the
jieba_load_user_dict()
function to specify whether to use custom dictionaries during retrieval.Syntax
jieba_load_user_dict(parameter1, parameter2)
Parameter description
Parameter
Description
parameter1
Specifies the sequence number of the custom dictionary that you want to load.
parameter2
Specifies whether to load the default dictionary.
0: loads the default dictionary.
1: does not load the default dictionary.
Examples
INSERT INTO jieba_user_dict VALUES ('阿里云'); INSERT 0 1 INSERT INTO jieba_user_dict VALUES ('研发工程师',0,10); INSERT 0 1 -- The first 0 indicates the sequence number of the custom dictionary, and the second 0 indicates that the default dictionary is loaded. SELECT jieba_load_user_dict(0,0); jieba_load_user_dict ---------------------- (1 row) SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师'); to_tsvector -------------------------------------------- 'zth':1 '一个':5 '研发工程师':6 '阿里云':3 (1 row) SELECT jieba_load_user_dict(0,1); jieba_load_user_dict ---------------------- (1 row) SELECT * FROM to_tsvector('jiebacfg', 'zth是阿里云的一个研发工程师'); to_tsvector ------------------------------------------------------ 'zth':1 '一个':6 '云':4 '工程师':8 '研发':7 '阿里':3 (1 row)
NoteIf the
jieba_user_dict
table or thejieba_load_user_dict()
function does not exist, you must update the minor engine version of your RDS instance to 20220730 and reinstall the extension.For more information about how to update the minor engine version, see Update the minor engine version of an ApsaraDB RDS for PostgreSQL instance.
Execute the following statements to reinstall the extension:
DROP EXTENSION pg_jieba; CREATE EXTENSION pg_jieba;
References
For more information about how to use the pg_jieba extension, see pg_jieba official documentation.