By Digoal.
Full-text search feature of PostgreSQL offers a rich array of functions. Besides the open parser and dictionary, it also support built-in and also some extended ranking algorithms.
In fact, location filtering is also an important aspect involved in search queries. Consider the term "Fast and Furious", for example, it contains location information based on how the words are segmented. Therefore, when conducting a search, we should mind the distance between the words "Fast" and "Furious" so to be able to improve the accuracy of the search.
The distance search syntax of PostgreSQL is as follows:
select * from tbl where ts @@ 'Speed <Distance Value> Passion'::tsquery;
For example:
select * from tbl where ts @@ 'Speed <1> Passion'::tsquery;
In the above example, the first line is the general template, and the last line is an example. The distance value of 1 indicates the relative distance between the terms 'speed' and 'passion'.
Now let's consider a more complex example. For this example, let's look at two long phrases, and see how this system can analysis and work with these long, dense phrases.
For this you'll want to create a test table.
postgres=# create table ts_test (id int, info text, ts tsvector);
CREATE TABLE
After that, you'll want to write test data. Consider this data below:
postgres=# insert into ts_test values (1, 'Passion, Innovation, Persistence, Speed—Beijing North Venture Taxi Co., LTD.'s quality service highlights', to_tsvector('scwscfg', 'Passion, Innovation, Persistence, Speed—Beijing North Venture Taxi Co., LTD.'s quality service highlights'));
INSERT 0 1
postgres=# insert into ts_test values (1, 'The Speed and Passion of Film 8 at the Box Office', to_tsvector('scwscfg', 'The Speed and Passion of Film 8 at the Box Office'));
INSERT 0 1
Next, view the segmentation results. In these results, you'll see that the location information has been included in the segmentation.
postgres=# select * from ts_test;
id | info | ts
----+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------
1 | Passion, Innovation, Persistence, and Speed—Beijing North Venture Taxi CO., LTD.'s quality service highlights | 'quality service':9 'CO.':8 'Taxi':7 'Venture':6 'Innovation':2 'Beijing':5 'Persistence':3 'high':10 'Passion':1 'lights':11 'Speed':4
1 | The Speed and Passion of Film 8 at the Box Office | 'Passion':3 'Film':1 'at the':5 'Box Office':4 'Speed':2
(2 rows)
Now you'll want to write more interference data.
postgres=# insert into ts_test select 2, 'Passion, Innovation, Persistence, and Speed—Beijing North Venture Taxi CO., LTD.'s quality service highlights', to_tsvector('scwscfg', 'Passion, Innovation, Persistence, and Speed—Beijing North Venture Taxi CO., LTD.'s quality service highlights') from generate_series(1,1000000);
INSERT 0 1000000
After this, create an index plug-in RUM, which supports location search.
postgres=# create extension rum;
CREATE EXTENSION
Create a RUM segmentation index.
postgres=# CREATE INDEX rumidx ON ts_test USING rum (ts rum_tsvector_ops);
CREATE INDEX
Next, you'll want to check the query results. The first query does not include the location, so we can see that all records have been matched.
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from ts_test where ts @@ 'Speed & Passion'::tsquery;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Index Scan using rumidx on public.ts_test (cost=9.60..41.24 rows=25 width=68) (actual time=223.602..456.956 rows=1000002 loops=1)
Output: id, info, ts
Index Cond: (ts_test.ts @@ '''Speed'' & ''Passion'''::tsquery)
Buffers: shared hit=38132
Planning time: 0.226 ms
Execution time: 541.545 ms
(6 rows)
The second query includes a location, and only matches one record, which is the "Fast and Furious" we want.
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from ts_test where ts @@ 'Speed <1> Passion'::tsquery;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Index Scan using rumidx on public.ts_test (cost=9.60..41.24 rows=25 width=68) (actual time=270.675..270.677 rows=1 loops=1)
Output: id, info, ts
Index Cond: (ts_test.ts @@ '''Speed'' <-> ''Passion'''::tsquery)
Buffers: shared hit=1095
Planning time: 0.111 ms
Execution time: 273.840 ms
(6 rows)
PostgreSQL segmenter is relatively simple to use. Simply follow the instructions given below.
With a Distance Value Search
postgres=# select * from ts_test where ts @@ 'Speed <1> Passion'::tsquery;
id | info | ts
----+---------------------------+--------------------------------------------
1 | The Speed and Passion of Film 8 at the Box Office | 'Passion':3 'Film':1 'at the':5 'Box Office':4 'Speed':2
(1 row)
Without a Distance Value Search
postgres=# select * from ts_test where ts @@ 'Speed & Passion'::tsquery limit 5;
id | info | ts
----+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------
1 | Passion, Innovation, Persistence, and Speed—Beijing North Venture Taxi CO., LTD.'s quality service highlights | 'Quality Service':9 'Co.':8 'Taxi':7 'Venture':6 'Innovation':2 'Beijing':5 'Persistence':3 'High':10 'Passion':1 'light':11 'Speed':4
1 | The Speed and Passion of Film 8 at the Box Office | 'Passion':3 'Film':1 'at the':5 'Box Office':4 'Speed':2
2 | Passion, Innovation, Persistence, and Speed—Beijing North Venture Taxi CO., LTD.'s quality service highlights | 'Quality Service':9 'Co.':8 'Tax':7 'Venture':6 'Innovation':2 'Beijing':5 'Persistence':3 'High':10 'passion':1 'light':11 'Speed':4
2 | Passion, Innovation, Persistence, and Speed—Beijing North Venture Taxi CO., LTD.'s quality service highlights | 'Quality Service':9 'Co.':8 'Taxi':7 'Venture':6 'Innovation':2 'Beijing':5 'Persistence':3 'High':10 'Passion':1 'light':11 'Speed':4
2 | Passion, Innovation, Persistence, and Speed—Beijing North Venture Taxi CO., LTD.'s quality service highlights | 'Quality Service':9 'Co.':8 'Taxi':7 'Venture':6 'Innovation':2 'Beijing':5 'Persistence':3 'High':10 'Passion':1 'light':11 'Speed':4
(5 rows)
Last, if the distance is a range, we can also use user-defined function (UDF) to implement filtering.
-- Create a UDF that retrieves the distance internal between two words.
create or replace function get_lexeme_pos_range(tsvector, text, text) returns int4range as
$$
declare
a_pos int[];
b_pos int[];
m1 int;
m2 int;
begin
-- All the positions where the first word appears.
select positions into a_pos from (select * from unnest($1)) t where lexeme=$2;
-- All the positions where the second word appears.
select positions into b_pos from (select * from unnest($1)) t where lexeme=$3;
-- The position of where the two words appear, find the interval.
-- Return the range type.
select min(abs(t1.pos-t2.pos)), max(abs(t1.pos-t2.pos)) into m1,m2 from
(select unnest(a_pos) pos) t1 cross join
(select unnest(b_pos) pos) t2;
return int4range(m1,m2+1);
end;
$$
language plpgsql strict immutable;
-- Test, find the distance of 1 to 2 (excluding 2).
postgres=# select get_lexeme_pos_range(ts, 'Speed', 'Passion'), * from ts_test where ts @@ tsquery 'Speed & Passion' and get_lexeme_pos_range(ts, 'Speed', 'Passion') && int4range(1,2) limit 1;
get_lexeme_pos_range | id | info | ts
----------------------+----+---------------------------+--------------------------------------------
[1,2) | 1 | The Speed and Passion of Film 8 at the Box Office | 'Passion':3 'Film':1 'at the':5 'Box Office':4 'Speed':2
(1 row)
Time: 0.713 ms
-- Test, take the distance of 2 to 5 (excluding 5)
postgres=# select get_lexeme_pos_range(ts, 'Speed', 'Passion'), * from ts_test where ts @@ tsquery 'Speed & Passion' and get_lexeme_pos_range(ts, 'Speed', 'Passion') && int4range(2,5) limit 1;
get_lexeme_pos_range | id | info | ts
----------------------+----+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------
[3,4) | 1 | Passion, Innovation, Persistence, Speed—Beijing North Venture Taxi Co., LTD.'s quality service highlights | 'quality service':9 'Co.':8 'Taxi':7 'Venture':6 'Innovation':2 'Beijing':5 'Persistence':3 'High':10 'Passion':1 'light':11 'Speed':4
(1 row)
Time: 0.682 ms
How to Use PostgreSQL to Efficiently Search With Split Fields?
digoal - September 12, 2019
digoal - September 12, 2019
digoal - December 11, 2019
ApsaraDB - June 15, 2023
Data Geek - April 25, 2024
digoal - December 11, 2019
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAn online MPP warehousing service based on the Greenplum Database open source program
Learn MoreOpenSearch helps develop intelligent search services.
Learn MoreAn intelligent image search service with product search and generic search features to help users resolve image search requests.
Learn MoreMore Posts by digoal