By Digoal
This article mainly describes PostgreSQL's performance in K-nearest neighbor (KNN) queries among location information.
The test data type is point, and the index type is GiST.
PostGIS also supports KNN queries, and its performance is similar to that of PostgreSQL.
The test covers tens of billions of data records.
Test result: Test random points with a concurrency of 64 threads. The average response time for a single KNN query request is 0.848 milliseconds.
postgres=# create table tbl_point(id serial8, poi point);
CREATE TABLE
postgres=# \d tbl_point
Table "benchmarksql.tbl_point"
Column | Type | Modifiers
--------+--------+--------------------------------------------------------
id | bigint | not null default nextval('tbl_point_id_seq'::regclass)
poi | point |
postgres=# alter sequence tbl_point_id_seq cache 10000;
ALTER SEQUENCE
The values of x and y of a point range from -50000 to +50000. Therefore, 10 billion different points can be generated in total.
This meets the test data volume.
vi test.sql
insert into tbl_point(poi) select point(trunc(100000*(0.5-random())), trunc(100000*(0.5-random()))) from generate_series(1,10000);
About 2.33 million records of location information are inserted per second by using pgbench.
pgbench -M prepared -n -r -f ./test.sql -P 1 -c 96 -j 96 -T 1100
tps = 233.018365 (including connections establishing)
tps = 233.150940 (excluding connections establishing)
Data volume:
postgres=# select count(*) from tbl_point;
count
------------
2532820000
(1 row)
Size of the current table:
postgres=# \dt+
benchmarksql | tbl_point | table | postgres | 123 GB |
Create a GiST index for the point type.
postgres=# create index idx_tbl_point on tbl_point using gist(poi) with (buffering=on);
postgres=# \d+ tbl_point
Table "benchmarksql.tbl_point"
Column | Type | Modifiers | Storage | Stats target | Description
--------+--------+--------------------------------------------------------+---------+--------------+-------------
id | bigint | not null default nextval('tbl_point_id_seq'::regclass) | plain | |
poi | point | | plain | |
Indexes:
"idx_tbl_point" gist (poi) WITH (buffering='on')
Index size:
\di+
benchmarksql | idx_tbl_point | index | postgres | tbl_point | 170 GB |
After the index is created, the insertion performance deteriorates, and approximately 550,000 location information records are inserted per second.
pgbench -M prepared -n -r -f ./test.sql -P 1 -c 96 -j 96 -T 100
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 96
number of threads: 96
duration: 100 s
number of transactions actually processed: 5587
latency average: 1726.947 ms
latency stddev: 118.223 ms
tps = 55.390665 (including connections establishing)
tps = 55.419003 (excluding connections establishing)
statement latencies in milliseconds:
1726.946947 insert into tbl_point(poi) select point(trunc(100000*(0.5-random())), trunc(100000*(0.5-random()))) from generate_series(1,10000);
The insertion operation continues until all the 10 billion records are inserted.
pgbench -M prepared -n -r -f ./test.sql -P 1 -c 64 -j 64 -T 13600
Table size: 500GB
Index size: 720GB
postgres=# select *,poi <-> point(1000,1000) dist from tbl_point where poi <-> point(1000,1000) < 100 order by poi <-> point(1000,1000) limit 10;
id | poi | dist
------------+-------------+------------------
399588621 | (1000,999) | 1
1030719903 | (1001,999) | 1.4142135623731
2698052191 | (1001,1001) | 1.4142135623731
3291219762 | (999,999) | 1.4142135623731
2757190006 | (1002,1000) | 2
2862610530 | (998,1001) | 2.23606797749979
3450459141 | (998,1001) | 2.23606797749979
3124756442 | (1002,1001) | 2.23606797749979
3105439886 | (1001,998) | 2.23606797749979
473144305 | (998,1002) | 2.82842712474619
(10 rows)
Both the sorting and query are based on the GiST index.
For example, 16 data blocks are scanned for the following query, where eight data blocks are hit in the shared buffer, and eight data blocks are read from the OS cache or directly from the block dev.
postgres=# explain (analyze,verbose,buffers,timing,costs) select *,poi <-> point(10090,10090) dist from tbl_point where poi <-> point(10090,10090) < 100 order by poi <-> point(10090,10090) limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.56..13.15 rows=10 width=24) (actual time=0.469..1.309 rows=10 loops=1)
Output: id, poi, ((poi <-> '(10090,10090)'::point))
Buffers: shared hit=8 read=8 dirtied=1
-> Index Scan using idx_tbl_point on benchmarksql.tbl_point (cost=0.56..1510464450.86 rows=1199422376 width=24) (actual time=0.468..1.306 rows=10 loops=1)
Output: id, poi, (poi <-> '(10090,10090)'::point)
Order By: (tbl_point.poi <-> '(10090,10090)'::point)
Filter: ((tbl_point.poi <-> '(10090,10090)'::point) < '100'::double precision)
Buffers: shared hit=8 read=8 dirtied=1
Planning time: 0.084 ms
Execution time: 1.347 ms
(10 rows)
Test the performance of PostgreSQL in KNN queries among 10 billion geographic location records.
Use the following test script to generate a random point, search for the points within a distance of 100 meters from the point, sort the obtained points by distance, and retrieve one point.
vi test.sql
\setrandom x -50000 50000
\setrandom y -50000 50000
select * from tbl_point where poi <-> point(:x,:y) <100 order by poi <-> point(:x,:y) limit 1;
Test result: The average response time for a single request is 0.848 millisecond.
pgbench -M prepared -n -r -f ./test.sql -P 1 -c 64 -j 64 -T 100
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 100 s
number of transactions actually processed: 7418337
latency average: 0.858 ms
latency stddev: 0.564 ms
tps = 74151.604194 (including connections establishing)
tps = 74184.255934 (excluding connections establishing)
statement latencies in milliseconds:
0.007518 \setrandom x -50000 50000
0.002193 \setrandom y -50000 50000
0.847847 select * from tbl_point where poi <-> point(:x,:y) <100 order by poi <-> point(:x,:y) limit 1;
The data volume of the memory and indexes has exceeded 1 TB, far greater than the memory size. The response time of 0.858 milliseconds is benefited from the performance of the AliFlash PCI-E SSD card, which completes a single request within 0.01 millisecond on average. The wait time in the queue is 0.19 millisecond.
avg-cpu: %user %nice %system %iowait %steal %idle
69.54 0.00 24.11 5.87 0.00 0.47
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
dfa 0.00 0.00 26100.00 2096.00 417600.00 33536.00 16.00 5.05 0.18 0.03 98.00
dfb 0.00 0.00 26150.00 2038.00 418400.00 32600.00 16.00 5.01 0.18 0.03 98.40
dfc 0.00 0.00 25931.00 2026.00 414896.00 32384.00 16.00 6.15 0.22 0.04 99.70
dm-0 0.00 0.00 78178.00 6160.00 1250848.00 98520.00 16.00 16.73 0.19 0.01 101.00
When the distance of the requested data is beyond the specified range, the scanned GiST index page is enlarged. To resolve this problem, use the ORDER BY and LIMIT clauses and filter points that meet the distance condition.
Example:
Execution of this query takes a long time, and the result may be 0 records.
explain (analyze,verbose,buffers,timing,costs) select *,poi <-> point(10090,10090000) dist from tbl_point where poi <-> point(10090,10090000) < 100 order by poi <-> point(10090,10090000) limit 10;
To resolve this problem, use the ORDER BY and LIMIT clauses and filter points that meet the distance condition.
postgres=# explain (analyze,verbose,buffers,timing,costs) select * from (select *,poi <-> point(10090,10090000) dist from tbl_point order by poi <-> point(10090,10090000) limit 1000 ) t where poi <-> point(10090,10090000) < 100 limit 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.56..13.51 rows=10 width=32) (actual time=3.769..3.769 rows=0 loops=1)
Output: t.id, t.poi, t.dist
Buffers: shared hit=1174
-> Subquery Scan on t (cost=0.56..432.00 rows=333 width=32) (actual time=3.769..3.769 rows=0 loops=1)
Output: t.id, t.poi, t.dist
Filter: ((t.poi <-> '(10090,10090000)'::point) < '100'::double precision)
Rows Removed by Filter: 1000
Buffers: shared hit=1174
-> Limit (cost=0.56..417.00 rows=1000 width=24) (actual time=0.106..3.596 rows=1000 loops=1)
Output: tbl_point.id, tbl_point.poi, ((tbl_point.poi <-> '(10090,10090000)'::point))
Buffers: shared hit=1174
-> Index Scan using idx_tbl_point on benchmarksql.tbl_point (cost=0.56..1498470227.10 rows=3598267127 width=24) (actual time=0.105..3.505 rows=1000 loops=1)
Output: tbl_point.id, tbl_point.poi, (tbl_point.poi <-> '(10090,10090000)'::point)
Order By: (tbl_point.poi <-> '(10090,10090000)'::point)
Buffers: shared hit=1174
Planning time: 0.069 ms
Execution time: 3.793 ms
(17 rows)
This measure is also applicable to PostGIS.
digoal=# select * from (select *,ST_Distance(jwd, ST_Transform(ST_GeomFromText('POINT(120.19 30.26)', 4326), 2163)) AS dist from cust_jw order by jwd <-> ST_Transform(ST_GeomFromText('POINT(120.19 30.26)', 4326), 2163) limit 1000) t where dist<15000;
dz | jwd | dist
------+----------------------------------------------------+------------------
杭州 | 0101000020730800004C94087D5D4F54C173AA7759E8FB5D41 | 0
余杭 | 0101000020730800000E6E5A20494854C121FC688DA9EF5D41 | 14483.9823187612
(2 rows)
Time: 0.634 ms
The following is a more economical and resource-saving optimization measure that achieves optimal resource utilization and solves the foregoing problem by using cursors. In this measure, at most, one more page needs to be scanned.
digoal=# do language plpgsql $$
declare
v_rec record;
v_limit int := 1000;
begin
set enable_seqscan=off; -- 强制索引, 因为扫描行数够就退出.
for v_rec in select *,ST_Distance(jwd, ST_Transform(ST_GeomFromText('POINT(120.19 30.26)', 4326), 2163)) AS dist from cust_jw order by jwd <-> ST_Transform(ST_GeomFromText('POINT(120.19 30.26)', 4326), 2163) loop
if v_limit <=0 then
raise notice '已经取足数据';
return;
end if;
if v_rec.dist > 20000 then
raise notice '满足条件的点已输出完毕';
return;
else
raise notice 'do someting, v_rec:%', v_rec;
end if;
v_limit := v_limit -1;
end loop;
end;
$$;
NOTICE: do someting, v_rec:(杭州,0101000020730800004C94087D5D4F54C173AA7759E8FB5D41,0)
NOTICE: do someting, v_rec:(余杭,0101000020730800000E6E5A20494854C121FC688DA9EF5D41,14483.9823187612)
NOTICE: 满足条件的点已输出完毕
DO
1) http://www.postgresql.org/docs/9.5/static/gist-intro.html
Sorting GiST Indexes by Leveraging Nearest-neighbor Enhancement in PostgreSQL 9.1
PostgreSQL: Optimizing Single-column Conditions-based Queries
ApsaraDB - November 16, 2020
Proxima - April 30, 2021
digoal - December 16, 2020
digoal - December 14, 2020
Data Geek - April 8, 2024
digoal - December 21, 2020
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAn online MPP warehousing service based on the Greenplum Database open source program
Learn MoreAn on-demand database hosting service for PostgreSQL with automated monitoring, backup and disaster recovery capabilities
Learn MoreLeverage cloud-native database solutions dedicated for FinTech.
Learn MoreMore Posts by digoal