All Products
Search
Document Center

ApsaraDB RDS:Use the zhparser extension

Last Updated:Dec 04, 2024

This topic describes how to use the zhparser extension to run Chinese full-text searches on an ApsaraDB RDS for PostgreSQL instance.

Prerequisites

  • The RDS instance runs PostgreSQL 10 or later.

  • Your RDS instance runs a minor engine version of 20230830 or later.

    Important

    This extension is supported in some minor engine versions earlier than 20230830. However, ApsaraDB RDS plans to optimize vulnerable extensions in minor engine version updates for standardized extension management and enhanced security. Therefore, you cannot create this extension for RDS instances that run a minor engine version earlier than 20230830. For more information, see [Product changes/Feature changes] Limits on extension creation for ApsaraDB RDS for PostgreSQL instances.

    • If you have already created this extension for your RDS instance that runs a minor engine version earlier than 20230830, you can continue using this extension.

    • If you are creating this extension for the first time or need to recreate the extension for your RDS instance, you must update the minor engine version of the RDS instance to the latest version. For more information, see Update the minor engine version.

  • zhparser is added to the value of the shared_preload_libraries parameter of your RDS instance.

    For more information about how to add the name of the zhparser extension to the value of the shared_preload_libraries parameter, see Modify the parameters of an ApsaraDB RDS for PostgreSQL instance.

Background information

Open source PostgreSQL provides a built-in parser extension that can segment text in languages such as English that not only have punctuation marks but also have spaces between words. However, text in Chinese does not have spaces between words and needs to be segmented based on semantics, and the text length is not fixed. Therefore, we recommend that you use the zhparser extension rather than the parser extension to segment text in Chinese.

The zhparser extension of PostgreSQL can segment text in Chinese. After you install the zhparser extension on your RDS instance, you can run Chinese full-text searches on your RDS instance.

Enable the zhparser extension

Execute the following statements to enable the zhparser extension:

CREATE EXTENSION zhparser;
CREATE TEXT SEARCH CONFIGURATION testzhcfg (PARSER = zhparser);
ALTER TEXT SEARCH CONFIGURATION testzhcfg ADD MAPPING FOR n,v,a,i,e,l WITH simple;
--Optional parameter configuration
ALTER ROLE CURRENT_ROLE SET zhparser.multi_short=on;
--Simple test
SELECT * FROM ts_parse('zhparser', 'hello world! 2010年保障房建设在全国范围内获全面启动,从中央到地方纷纷加大 了 保 障 房 的 建 设 和 投 入 力 度 。 2011年,保障房进入了更大规模的建设阶段。 住房城乡建设部党组书记、部长姜伟新去年底在全国住房城乡建设工作会议上表示,要继续推进保障性安居工程建设。 ');
SELECT to_tsvector('testzhcfg','“今年保障房新开工数量虽然有所下调,但实际的年度在建规模以及竣工规模会超以往年份,相对应的对资金的需求也会创历史纪录。”陈国强说。 在他看来,与2011年相比,2012年的保障房建设在资金配套上的压力将更为严峻。 ');
SELECT to_tsquery('testzhcfg', '保障房资金压力');

Execute the following statements to enable the zhparser extension to run a full-text search:

-- Create a full-text index for the name field of the t1 table. You must replace t1 and name in the SQL statements with the actual table name and field value based on your business requirements. 
CREATE index idx_t1 ON t1 using gin (to_tsvector('testzhcfg',upper(name) ));
-- Use the full-text index.
SELECT * FROM t1 WHERE to_tsvector('testzhcfg',upper(t1.name)) @@ to_tsquery('testzhcfg','(防火)');

Customize a Chinese word segment dictionary

Execute the following statements to customize a Chinese word segment dictionary

-- The segmentation result
SELECT to_tsquery('testzhcfg', '保障房资金压力');
-- Insert a new word segment to the dictionary
INSERT INTO pg_ts_custom_word VALUES ('保障房资');
-- Make the inserted word segment effective
SELECT zhprs_sync_dict_xdb();
-- End the connection
\c
-- Requery to obtain new segmentation results
SELECT to_tsquery('testzhcfg', '保障房资金压力');

Instructions to use custom word segments:

  • A maximum of 1 million custom word segments can be added. If the number of word segments exceed the limit, the word segments outside the limit are not processed. Ensure that the number of word segments is within this range. The custom and default word segmentation dictionaries take effect at the same time.

  • Each word segment can be a maximum of 128 bytes in length. The section after the 128th byte will be truncated.

  • After adding, deleting, or changing word segments, execute the SELECT zhprs_sync_dict_xdb(); statement and re-establish a connection to make the operation take effect.