PostgreSQL: Nearest Neighbor Query Performance on Billions of Geolocation Records

This article discusses the performance of nearest neighbor queries in PostgreSQL and talks about optimization measures that address performance issues.

By Digoal

Background

This article mainly describes PostgreSQL's performance in K-nearest neighbor (KNN) queries among location information.

The test data type is point, and the index type is GiST.

PostGIS also supports KNN queries, and its performance is similar to that of PostgreSQL.

The test covers tens of billions of data records.

Test result: Test random points with a concurrency of 64 threads. The average response time for a single KNN query request is 0.848 milliseconds.

Test Details

Create a Test Table

postgres=# create table tbl_point(id serial8, poi point);  
CREATE TABLE  
postgres=# \d tbl_point  
                      Table "benchmarksql.tbl_point"  
 Column |  Type  |                       Modifiers                          
--------+--------+--------------------------------------------------------  
 id     | bigint | not null default nextval('tbl_point_id_seq'::regclass)  
 poi    | point  |   
  
postgres=# alter sequence tbl_point_id_seq cache 10000;  
ALTER SEQUENCE

Generate Test Data

The values of x and y of a point range from -50000 to +50000. Therefore, 10 billion different points can be generated in total.

This meets the test data volume.

vi test.sql  
insert into tbl_point(poi) select point(trunc(100000*(0.5-random())), trunc(100000*(0.5-random()))) from generate_series(1,10000);

About 2.33 million records of location information are inserted per second by using pgbench.

pgbench -M prepared -n -r -f ./test.sql -P 1 -c 96 -j 96 -T 1100  
tps = 233.018365 (including connections establishing)  
tps = 233.150940 (excluding connections establishing)

Data volume:

postgres=# select count(*) from tbl_point;  
   count      
------------  
 2532820000  
(1 row)

Size of the current table:

postgres=# \dt+  
 benchmarksql | tbl_point  | table | postgres | 123 GB   |

Create a GiST index for the point type.

postgres=# create index idx_tbl_point on tbl_point using gist(poi) with (buffering=on);  
  
postgres=# \d+ tbl_point  
                                         Table "benchmarksql.tbl_point"  
 Column |  Type  |                       Modifiers                        | Storage | Stats target | Description   
--------+--------+--------------------------------------------------------+---------+--------------+-------------  
 id     | bigint | not null default nextval('tbl_point_id_seq'::regclass) | plain   |              |   
 poi    | point  |                                                        | plain   |              |   
Indexes:  
    "idx_tbl_point" gist (poi) WITH (buffering='on')

Index size:

\di+  
 benchmarksql | idx_tbl_point      | index | postgres | tbl_point  | 170 GB  |

After the index is created, the insertion performance deteriorates, and approximately 550,000 location information records are inserted per second.

pgbench -M prepared -n -r -f ./test.sql -P 1 -c 96 -j 96 -T 100  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 96  
number of threads: 96  
duration: 100 s  
number of transactions actually processed: 5587  
latency average: 1726.947 ms  
latency stddev: 118.223 ms  
tps = 55.390665 (including connections establishing)  
tps = 55.419003 (excluding connections establishing)  
statement latencies in milliseconds:  
        1726.946947     insert into tbl_point(poi) select point(trunc(100000*(0.5-random())), trunc(100000*(0.5-random()))) from generate_series(1,10000);

The insertion operation continues until all the 10 billion records are inserted.

pgbench -M prepared -n -r -f ./test.sql -P 1 -c 64 -j 64 -T 13600

Table size: 500GB

Index size: 720GB

Sample KNN Query

postgres=# select *,poi <-> point(1000,1000) dist from tbl_point where poi <-> point(1000,1000) < 100 order by poi <-> point(1000,1000) limit 10;  
     id     |     poi     |       dist         
------------+-------------+------------------  
  399588621 | (1000,999)  |                1  
 1030719903 | (1001,999)  |  1.4142135623731  
 2698052191 | (1001,1001) |  1.4142135623731  
 3291219762 | (999,999)   |  1.4142135623731  
 2757190006 | (1002,1000) |                2  
 2862610530 | (998,1001)  | 2.23606797749979  
 3450459141 | (998,1001)  | 2.23606797749979  
 3124756442 | (1002,1001) | 2.23606797749979  
 3105439886 | (1001,998)  | 2.23606797749979  
  473144305 | (998,1002)  | 2.82842712474619  
(10 rows)

KNN Execution Plan

Both the sorting and query are based on the GiST index.

For example, 16 data blocks are scanned for the following query, where eight data blocks are hit in the shared buffer, and eight data blocks are read from the OS cache or directly from the block dev.

postgres=# explain (analyze,verbose,buffers,timing,costs) select *,poi <-> point(10090,10090) dist from tbl_point where poi <-> point(10090,10090) < 100 order by poi <-> point(10090,10090) limit 10;  
                                                                           QUERY PLAN                                                                             
----------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=0.56..13.15 rows=10 width=24) (actual time=0.469..1.309 rows=10 loops=1)  
   Output: id, poi, ((poi <-> '(10090,10090)'::point))  
   Buffers: shared hit=8 read=8 dirtied=1  
   ->  Index Scan using idx_tbl_point on benchmarksql.tbl_point  (cost=0.56..1510464450.86 rows=1199422376 width=24) (actual time=0.468..1.306 rows=10 loops=1)  
         Output: id, poi, (poi <-> '(10090,10090)'::point)  
         Order By: (tbl_point.poi <-> '(10090,10090)'::point)  
         Filter: ((tbl_point.poi <-> '(10090,10090)'::point) < '100'::double precision)  
         Buffers: shared hit=8 read=8 dirtied=1  
 Planning time: 0.084 ms  
 Execution time: 1.347 ms  
(10 rows)

KNN Query Stress Test

Test the performance of PostgreSQL in KNN queries among 10 billion geographic location records.

Use the following test script to generate a random point, search for the points within a distance of 100 meters from the point, sort the obtained points by distance, and retrieve one point.

vi test.sql  
\setrandom x -50000 50000  
\setrandom y -50000 50000  
select * from tbl_point where poi <-> point(:x,:y) <100 order by poi <-> point(:x,:y) limit 1;

Test result: The average response time for a single request is 0.848 millisecond.

pgbench -M prepared -n -r -f ./test.sql -P 1 -c 64 -j 64 -T 100  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 64  
number of threads: 64  
duration: 100 s  
number of transactions actually processed: 7418337  
latency average: 0.858 ms  
latency stddev: 0.564 ms  
tps = 74151.604194 (including connections establishing)  
tps = 74184.255934 (excluding connections establishing)  
statement latencies in milliseconds:  
        0.007518        \setrandom x -50000 50000  
        0.002193        \setrandom y -50000 50000  
        0.847847        select * from tbl_point where poi <-> point(:x,:y) <100 order by poi <-> point(:x,:y) limit 1;

The data volume of the memory and indexes has exceeded 1 TB, far greater than the memory size. The response time of 0.858 milliseconds is benefited from the performance of the AliFlash PCI-E SSD card, which completes a single request within 0.01 millisecond on average. The wait time in the queue is 0.19 millisecond.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle  
          69.54    0.00   24.11    5.87    0.00    0.47  
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util  
dfa               0.00     0.00 26100.00 2096.00 417600.00 33536.00    16.00     5.05    0.18   0.03  98.00  
dfb               0.00     0.00 26150.00 2038.00 418400.00 32600.00    16.00     5.01    0.18   0.03  98.40  
dfc               0.00     0.00 25931.00 2026.00 414896.00 32384.00    16.00     6.15    0.22   0.04  99.70  
dm-0              0.00     0.00 78178.00 6160.00 1250848.00 98520.00    16.00    16.73    0.19   0.01 101.00

Optimization Measure 1

When the distance of the requested data is beyond the specified range, the scanned GiST index page is enlarged. To resolve this problem, use the ORDER BY and LIMIT clauses and filter points that meet the distance condition.

Example:

Execution of this query takes a long time, and the result may be 0 records.

explain (analyze,verbose,buffers,timing,costs) select *,poi <-> point(10090,10090000) dist from tbl_point where poi <-> point(10090,10090000) < 100 order by poi <-> point(10090,10090000) limit 10;

To resolve this problem, use the ORDER BY and LIMIT clauses and filter points that meet the distance condition.

postgres=# explain (analyze,verbose,buffers,timing,costs) select * from (select *,poi <-> point(10090,10090000) dist from tbl_point order by poi <-> point(10090,10090000) limit 1000 ) t where poi <-> point(10090,10090000) < 100 limit 10;   
                                                                                  QUERY PLAN                                                                                    
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=0.56..13.51 rows=10 width=32) (actual time=3.769..3.769 rows=0 loops=1)  
   Output: t.id, t.poi, t.dist  
   Buffers: shared hit=1174  
   ->  Subquery Scan on t  (cost=0.56..432.00 rows=333 width=32) (actual time=3.769..3.769 rows=0 loops=1)  
         Output: t.id, t.poi, t.dist  
         Filter: ((t.poi <-> '(10090,10090000)'::point) < '100'::double precision)  
         Rows Removed by Filter: 1000  
         Buffers: shared hit=1174  
         ->  Limit  (cost=0.56..417.00 rows=1000 width=24) (actual time=0.106..3.596 rows=1000 loops=1)  
               Output: tbl_point.id, tbl_point.poi, ((tbl_point.poi <-> '(10090,10090000)'::point))  
               Buffers: shared hit=1174  
               ->  Index Scan using idx_tbl_point on benchmarksql.tbl_point  (cost=0.56..1498470227.10 rows=3598267127 width=24) (actual time=0.105..3.505 rows=1000 loops=1)  
                     Output: tbl_point.id, tbl_point.poi, (tbl_point.poi <-> '(10090,10090000)'::point)  
                     Order By: (tbl_point.poi <-> '(10090,10090000)'::point)  
                     Buffers: shared hit=1174  
 Planning time: 0.069 ms  
 Execution time: 3.793 ms  
(17 rows)

This measure is also applicable to PostGIS.

digoal=# select * from (select *,ST_Distance(jwd, ST_Transform(ST_GeomFromText('POINT(120.19 30.26)', 4326), 2163)) AS dist from cust_jw order by jwd <-> ST_Transform(ST_GeomFromText('POINT(120.19 30.26)', 4326), 2163) limit 1000) t where dist<15000;  
  dz  |                        jwd                         |       dist         
------+----------------------------------------------------+------------------  
 杭州 | 0101000020730800004C94087D5D4F54C173AA7759E8FB5D41 |                0  
 余杭 | 0101000020730800000E6E5A20494854C121FC688DA9EF5D41 | 14483.9823187612  
(2 rows)  
Time: 0.634 ms

Optimization Measure 2

The following is a more economical and resource-saving optimization measure that achieves optimal resource utilization and solves the foregoing problem by using cursors. In this measure, at most, one more page needs to be scanned.

digoal=# do language plpgsql $$  
declare  
  v_rec record;  
  v_limit int := 1000;  
begin  
  set enable_seqscan=off;  -- 强制索引, 因为扫描行数够就退出.  
  for v_rec in select *,ST_Distance(jwd, ST_Transform(ST_GeomFromText('POINT(120.19 30.26)', 4326), 2163)) AS dist from cust_jw order by jwd <-> ST_Transform(ST_GeomFromText('POINT(120.19 30.26)', 4326), 2163) loop  
    if v_limit <=0 then   
      raise notice '已经取足数据';  
      return;  
    end if;  
    if v_rec.dist > 20000 then   
      raise notice '满足条件的点已输出完毕';  
      return;  
    else  
      raise notice 'do someting, v_rec:%', v_rec;  
    end if;  
    v_limit := v_limit -1;  
  end loop;  
end;  
$$;  
NOTICE:  do someting, v_rec:(杭州,0101000020730800004C94087D5D4F54C173AA7759E8FB5D41,0)  
NOTICE:  do someting, v_rec:(余杭,0101000020730800000E6E5A20494854C121FC688DA9EF5D41,14483.9823187612)  
NOTICE:  满足条件的点已输出完毕  
DO

References

1) http://www.postgresql.org/docs/9.5/static/gist-intro.html

Community

PostgreSQL: Nearest Neighbor Query Performance on Billions of Geolocation Records

Background

Test Details

Create a Test Table

Generate Test Data

Sample KNN Query

KNN Execution Plan

KNN Query Stress Test

Optimization Measure 1

Optimization Measure 2

References

Read previous post:

Read next post:

digoal

You may also like

Comments

digoal

Related Products

PolarDB for PostgreSQL

AnalyticDB for PostgreSQL

ApsaraDB RDS for PostgreSQL

Database for FinTech Solution