Use Geometry Split to Optimize PostgreSQL Performance in GiST Index-based Polygon Search

In this article, the author discusses geometry split in PostgreSQL and describes how to use it to optimize performance in GiST index-based polygon search with example code.

By digoal

Background

Our article "Optimize Spatial Searches in PostgreSQL Using ST_Contains and ST_Within" describes a method for reducing CPU usage during searches. This method splits a long-strip geometry (polygon) that accounts for a small area proportion in its bounding box (covering a relatively small area in its bounding box) into a spatial geometries collection. After the split, the new geometries account for higher area proportions in their bounding boxes than the original long-strip polygon, significantly reducing the number of data blocks to be scanned and increasing the search hits.

The key to this method is to split the irregularly shaped polygon. PostGIS provides a split function in this regard, but it can only split a geometry into two parts. This article describes a method that allows splitting a geometry into as many parts as needed. All you have to do is to enter the target geometry to be split, the number of horizontal cuts, the number of vertical cuts, and the area proportion above which cutting is not performed.

Does it sound like cutting a Hearts and Arrows diamond?

The purpose of geometry split is to reduce the invalid areas in a geometry's bounding box. In the following figure, the invalid area is too large and seriously affects the database's search performance.

Example of Geometry Split

Logic

1) Enter the target geometry to be split, the number of horizontal cuts, the number of vertical cuts, and the area ratio to stop splitting.

2) Calculate the target geometry's area.

3) Determine the target geometry's bounding box and calculate the bounding box's area.

4) Determine whether it is necessary to split the geometry. If it is necessary to split the geometry, determine the cutting lines first. Then cut the geometry.

5) Determine the geometry collection obtained through the split operation.

6) Convert the geometry collection into a geo array.

Code Sample

create or replace function split_geo(  
  i_geo geometry,   -- 被切割的目标对象  
  i_srid int,      -- SRID  
  i_x int2,         -- X方向切多少刀  
  i_y int2,         -- Y方向切多少刀  
  i_aratio float4   -- 面积占比阈值，高于它则不切割  
)   
returns geometry[] as $$  
declare  
  res geometry[];         -- 结果  
  tmp_geo geometry;       -- 切割后的临时对象(geometry collection)  
  split_geos geometry[];  -- 切割线数组  
  split_line geometry;    -- 线段  
  v_area_obj float8;      -- 目标对象面积  
  v_area_box float8;      -- 目标对象的bound box的面积  
  v_xmin float8 := ST_XMin(i_geo);          -- 目标对象BOUND BOX，XMIN  
  v_ymin float8 := ST_YMin(i_geo);          -- 目标对象BOUND BOX，YMIN  
  v_xmax float8 := ST_XMax(i_geo);          -- 目标对象BOUND BOX，XMAX  
  v_ymax float8 := ST_YMax(i_geo);          -- 目标对象BOUND BOX，YMAX  
  v_box geometry;         -- 目标对象的BOUND BOX  
  x_geo geometry;         -- 分解geometry collection临时对象  
begin  
  -- 求边界  
  v_box := st_setsrid(ST_MakeBox2D(st_makepoint(v_xmin,v_ymin), st_makepoint(v_xmax,v_ymax)),i_srid);  
    
  -- 求面积  
  v_area_obj := st_area(i_geo);  
  v_area_box := st_area(v_box);  
  
  -- split 前的空间占比  
  raise notice '%', (v_area_obj/v_area_box);  
    
  -- 计算面积占比，判断是否需要切割  
  if (v_area_obj/v_area_box) > i_aratio then    
    -- 大于面积比，不切割  
    return array[i_geo];  
  else  
    -- 计算切割线段X位点  
    for i in 1..i_x  
    loop  
      split_geos := coalesce  
        (   
        array_append(split_geos, st_setsrid(st_makeline(st_makepoint(v_xmin+i*((v_xmax-v_xmin)/(i_x+1)), v_ymin), st_makepoint(v_xmin+i*((v_xmax-v_xmin)/(i_x+1)), v_ymax)), i_srid))  
        ,   
        array[st_setsrid(st_makeline(st_makepoint(v_xmin+i*((v_xmax-v_xmin)/(i_x+1)), v_ymin), st_makepoint(v_xmin+i*((v_xmax-v_xmin)/(i_x+1)), v_ymax)), i_srid)]   
        );   
    end loop;  
  
    -- 计算切割线段Y位点  
    for i in 1..i_y  
    loop  
      split_geos := coalesce  
        (   
        array_append(split_geos, st_setsrid(st_makeline(st_makepoint(v_xmin, v_ymin+i*((v_ymax-v_ymin)/(i_y+1))), st_makepoint(v_xmax, v_ymin+i*((v_ymax-v_ymin)/(i_y+1)))), i_srid))  
        ,   
        array[st_setsrid(st_makeline(st_makepoint(v_xmin, v_ymin+i*((v_ymax-v_ymin)/(i_y+1))), st_makepoint(v_xmax, v_ymin+i*((v_ymax-v_ymin)/(i_y+1)))), i_srid)]   
        );   
    end loop;  
  
    -- 切割  
    foreach split_line in array split_geos  
    loop  
      tmp_geo := coalesce  
        (  
st_split(tmp_geo, split_line)  
    ,  
        st_split(i_geo, split_line)  
    );  
    end loop;  
  end if;  
  
  -- 将geometry collection转换为geometry数组  
  for i in 1..ST_NumGeometries(tmp_geo)  
  loop  
    res := coalesce(array_append(res, ST_GeometryN(tmp_geo, i)), array[ST_GeometryN(tmp_geo, i)]);  
  end loop;  
  
  -- split 后的空间占比  
  v_area_obj := 0;  
  v_area_box := 0;  
  
  foreach x_geo in array res  
  loop  
    v_area_obj := v_area_obj + st_area(x_geo);  
    v_area_box := v_area_box + st_area(st_setsrid(ST_MakeBox2D(st_makepoint(st_xmin(x_geo),st_ymin(x_geo)), st_makepoint(st_xmax(x_geo),st_ymax(x_geo))),i_srid));  
  end loop;  
  
  -- split 后的空间占比  
  raise notice '%', (v_area_obj/v_area_box);  
  
  return res;  
end;  
$$ language plpgsql strict immutable;

Verify Whether the Split is Successful

1) If two horizontal cuts and two vertical cuts are made, we can expect at most nine geometries after the split. If a cut fails to fall on a valid location, no parts are returned. Therefore, the actual number of new geometries may be smaller than nine.

select st_astext(  
unnest(  
split_geo(  
  st_setsrid(st_makepolygon(ST_GeomFromText('LINESTRING(0 0,1 0,1 2.5,6 2.5,6 4,7 4,7 5,5 5,5 3,0 3,0 0)')), 4326),  
  4326::int4,  
  2::int2,  
  2::int2,  
  0.9::float4  
)));  
  
  
NOTICE:  0.242857142857143  
NOTICE:  0.818181818181818  
                                                        st_astext                                                          
-------------------------------------------------------------------------------------------------------------------------  
 POLYGON((2.33333333333333 2.5,1 2.5,1 1.66666666666667,0 1.66666666666667,0 3,2.33333333333333 3,2.33333333333333 2.5))  
 POLYGON((1 1.66666666666667,1 0,0 0,0 1.66666666666667,1 1.66666666666667))  
 POLYGON((2.33333333333333 3,4.66666666666667 3,4.66666666666667 2.5,2.33333333333333 2.5,2.33333333333333 3))  
 POLYGON((4.66666666666667 3,5 3,5 3.33333333333333,6 3.33333333333333,6 2.5,4.66666666666667 2.5,4.66666666666667 3))  
 POLYGON((5 3.33333333333333,5 5,7 5,7 4,6 4,6 3.33333333333333,5 3.33333333333333))  
(5 rows)

Before the split, the original geometry accounts for an area proportion of 24%. After the split, the new geometries account for an area proportion of 82%, which significantly reduces the number of data blocks to be scanned.

2) If two horizontal cuts and two vertical cuts are made, we can expect at most nine geometries after the split.

Before the split, the original geometry accounts for an area proportion of 24%. After the split, the new geometries account for an area proportion of 98%, even better.

postgres=# select st_astext(  
unnest(  
split_geo(  
  st_setsrid(st_makepolygon(ST_GeomFromText('LINESTRING(0 0,1 0,1 2.5,6 2.5,6 4,7 4,7 5,5 5,5 3,0 3,0 0)')), 4326),  
  4326::int4,  
  4::int2,  
  4::int2,  
  0.9::float4  
)));  
  
  
NOTICE:  0.242857142857143  
NOTICE:  0.977011494252874  
                     st_astext                        
----------------------------------------------------  
 POLYGON((1.4 2.5,1 2.5,1 2,0 2,0 3,1.4 3,1.4 2.5))  
 POLYGON((1 2,1 1,0 1,0 2,1 2))  
 POLYGON((1 1,1 0,0 0,0 1,1 1))  
 POLYGON((1.4 3,2.8 3,2.8 2.5,1.4 2.5,1.4 3))  
 POLYGON((2.8 3,4.2 3,4.2 2.5,2.8 2.5,2.8 3))  
 POLYGON((4.2 3,5 3,5.6 3,5.6 2.5,4.2 2.5,4.2 3))  
 POLYGON((5 3,5 4,5.6 4,5.6 3,5 3))  
 POLYGON((5 4,5 5,5.6 5,5.6 4,5 4))  
 POLYGON((5.6 5,7 5,7 4,6 4,5.6 4,5.6 5))  
 POLYGON((6 4,6 3,5.6 3,5.6 4,6 4))  
 POLYGON((6 3,6 2.5,5.6 2.5,5.6 3,6 3))  
(11 rows)

Performance Comparison Before and After Optimization

Before the geometry split, 1,648 data blocks are scanned, and 26,590 invalid data records are filtered out.

explain (analyze,verbose,timing,costs,buffers)  
select * from f where st_within(pos, st_setsrid(st_makepolygon(ST_GeomFromText('LINESTRING(0 0,1 0,1 2.5,6 2.5,6 4,7 4,7 5,5 5,5 3,0 3,0 0)')), 4326));  
  
  
 Index Scan using idx_f on public.f  (cost=0.42..15026.72 rows=3333 width=40) (actual time=1.519..35.773 rows=8491 loops=1)  
   Output: id, pos  
   Index Cond: ('0103000020E6100000010000000B00000000000000000000000000000000000000000000000000F03F0000000000000000000000000000F03F000000000000044000000000000018400000000000000440000000000000184000000000000010400000000000001C4000000000000010400000000000001C40000000000000144000000000000014400000000000001440000000000000144000000000000008400000000000000000000000000000084000000000000000000000000000000000'::geometry ~ f.pos)  
   Filter: _st_contains('0103000020E6100000010000000B00000000000000000000000000000000000000000000000000F03F0000000000000000000000000000F03F000000000000044000000000000018400000000000000440000000000000184000000000000010400000000000001C4000000000000010400000000000001C40000000000000144000000000000014400000000000001440000000000000144000000000000008400000000000000000000000000000084000000000000000000000000000000000'::geometry, f.pos)  
   Rows Removed by Filter: 26590  
   Buffers: shared hit=1648  
 Planning time: 0.274 ms  
 Execution time: 36.212 ms

After the geometry split, 610 data blocks are scanned, and 1,932 invalid data records are filtered out.

explain (analyze,verbose,timing,costs,buffers)  
select * from f where st_within(pos, st_setsrid(st_makepolygon(ST_GeomFromText('LINESTRING(2.33333333333333 2.5,1 2.5,1 1.66666666666667,0 1.66666666666667,0 3,2.33333333333333 3,2.33333333333333 2.5)')), 4326))  
union all  
select * from f where st_within(pos, st_setsrid(st_makepolygon(ST_GeomFromText('LINESTRING(1 1.66666666666667,1 0,0 0,0 1.66666666666667,1 1.66666666666667)')), 4326))  
union all  
select * from f where st_within(pos, st_setsrid(st_makepolygon(ST_GeomFromText('LINESTRING(2.33333333333333 3,4.66666666666667 3,4.66666666666667 2.5,2.33333333333333 2.5,2.33333333333333 3)')), 4326))  
union all  
select * from f where st_within(pos, st_setsrid(st_makepolygon(ST_GeomFromText('LINESTRING(4.66666666666667 3,5 3,5 3.33333333333333,6 3.33333333333333,6 2.5,4.66666666666667 2.5,4.66666666666667 3)')), 4326))  
union all  
select * from f where st_within(pos, st_setsrid(st_makepolygon(ST_GeomFromText('LINESTRING(5 3.33333333333333,5 5,7 5,7 4,6 4,6 3.33333333333333,5 3.33333333333333)')), 4326))  
;  
  
  
 Append  (cost=0.42..75300.24 rows=16665 width=40) (actual time=0.113..11.690 rows=8491 loops=1)  
   Buffers: shared hit=610  
   ->  Index Scan using idx_f on public.f  (cost=0.42..15026.72 rows=3333 width=40) (actual time=0.113..3.365 rows=2053 loops=1)  
         Output: f.id, f.pos  
         Index Cond: ('0103000020E61000000100000007000000A3AAAAAAAAAA02400000000000000440000000000000F03F0000000000000440000000000000F03FBAAAAAAAAAAAFA3F0000000000000000BAAAAAAAAAAAFA3F00000000000000000000000000000840A3AAAAAAAAAA02400000000000000840A3AAAAAAAAAA02400000000000000440'::geometry ~ f.pos)  
         Filter: _st_contains('0103000020E61000000100000007000000A3AAAAAAAAAA02400000000000000440000000000000F03F0000000000000440000000000000F03FBAAAAAAAAAAAFA3F0000000000000000BAAAAAAAAAAAFA3F00000000000000000000000000000840A3AAAAAAAAAA02400000000000000840A3AAAAAAAAAA02400000000000000440'::geometry, f.pos)  
         Rows Removed by Filter: 1142  
         Buffers: shared hit=189  
   ->  Index Scan using idx_f on public.f f_1  (cost=0.42..15026.72 rows=3333 width=40) (actual time=0.084..1.734 rows=1699 loops=1)  
         Output: f_1.id, f_1.pos  
         Index Cond: ('0103000020E61000000100000005000000000000000000F03FBAAAAAAAAAAAFA3F000000000000F03F0000000000000000000000000000000000000000000000000000000000000000BAAAAAAAAAAAFA3F000000000000F03FBAAAAAAAAAAAFA3F'::geometry ~ f_1.pos)  
         Filter: _st_contains('0103000020E61000000100000005000000000000000000F03FBAAAAAAAAAAAFA3F000000000000F03F0000000000000000000000000000000000000000000000000000000000000000BAAAAAAAAAAAFA3F000000000000F03FBAAAAAAAAAAAFA3F'::geometry, f_1.pos)  
         Buffers: shared hit=92  
   ->  Index Scan using idx_f on public.f f_2  (cost=0.42..15026.72 rows=3333 width=40) (actual time=0.075..1.283 rows=1158 loops=1)  
         Output: f_2.id, f_2.pos  
         Index Cond: ('0103000020E61000000100000005000000A3AAAAAAAAAA02400000000000000840AEAAAAAAAAAA12400000000000000840AEAAAAAAAAAA12400000000000000440A3AAAAAAAAAA02400000000000000440A3AAAAAAAAAA02400000000000000840'::geometry ~ f_2.pos)  
         Filter: _st_contains('0103000020E61000000100000005000000A3AAAAAAAAAA02400000000000000840AEAAAAAAAAAA12400000000000000840AEAAAAAAAAAA12400000000000000440A3AAAAAAAAAA02400000000000000440A3AAAAAAAAAA02400000000000000840'::geometry, f_2.pos)  
         Buffers: shared hit=74  
   ->  Index Scan using idx_f on public.f f_3  (cost=0.42..15026.72 rows=3333 width=40) (actual time=0.095..1.214 rows=986 loops=1)  
         Output: f_3.id, f_3.pos  
         Index Cond: ('0103000020E61000000100000007000000AEAAAAAAAAAA12400000000000000840000000000000144000000000000008400000000000001440A3AAAAAAAAAA0A400000000000001840A3AAAAAAAAAA0A4000000000000018400000000000000440AEAAAAAAAAAA12400000000000000440AEAAAAAAAAAA12400000000000000840'::geometry ~ f_3.pos)  
         Filter: _st_contains('0103000020E61000000100000007000000AEAAAAAAAAAA12400000000000000840000000000000144000000000000008400000000000001440A3AAAAAAAAAA0A400000000000001840A3AAAAAAAAAA0A4000000000000018400000000000000440AEAAAAAAAAAA12400000000000000440AEAAAAAAAAAA12400000000000000840'::geometry, f_3.pos)  
         Rows Removed by Filter: 127  
         Buffers: shared hit=68  
   ->  Index Scan using idx_f on public.f f_4  (cost=0.42..15026.72 rows=3333 width=40) (actual time=0.105..3.331 rows=2595 loops=1)  
         Output: f_4.id, f_4.pos  
         Index Cond: ('0103000020E610000001000000070000000000000000001440A3AAAAAAAAAA0A40000000000000144000000000000014400000000000001C4000000000000014400000000000001C400000000000001040000000000000184000000000000010400000000000001840A3AAAAAAAAAA0A400000000000001440A3AAAAAAAAAA0A40'::geometry ~ f_4.pos)  
         Filter: _st_contains('0103000020E610000001000000070000000000000000001440A3AAAAAAAAAA0A40000000000000144000000000000014400000000000001C4000000000000014400000000000001C400000000000001040000000000000184000000000000010400000000000001840A3AAAAAAAAAA0A400000000000001440A3AAAAAAAAAA0A40'::geometry, f_4.pos)  
         Rows Removed by Filter: 663  
         Buffers: shared hit=187  
 Planning time: 0.397 ms  
 Execution time: 12.150 ms  
(32 rows)

PostgreSQL supports the ANY, SOME, and ALL (array) operators. Therefore, we can simplify the code to further reduce the number of data blocks to be scanned to 278 and the invalid data records to 1,932.

explain (analyze,verbose,timing,costs,buffers) select * from f where pos @ any(split_geo(  
  st_setsrid(st_makepolygon(ST_GeomFromText('LINESTRING(0 0,1 0,1 2.5,6 2.5,6 4,7 4,7 5,5 5,5 3,0 3,0 0)')), 4326),  
  4326::int4,  
  2::int2,  
  2::int2,  
  0.9::float4  
))
and
st_within(pos, st_setsrid(st_makepolygon(ST_GeomFromText('LINESTRING(0 0,1 0,1 2.5,6 2.5,6 4,7 4,7 5,5 5,5 3,0 3,0 0)')), 4326));

NOTICE:  0.242857142857143
NOTICE:  0.818181818181818


 Bitmap Heap Scan on public.f  (cost=9.09..87.16 rows=17 width=40) (actual time=2.270..10.578 rows=8491 loops=1)
   Output: id, pos
   Recheck Cond: ((f.pos @ ANY ('{0103000020E61000000100000007000000ABAAAAAAAAAA02400000000000000440000000000000F03F0000000000000440000000000000F03FABAAAAAAAAAAFA3F0000000000000000ABAAAAAAAAAAFA3F00000000000000000000000000000840ABAAAAAAAAAA02400000000000000840ABAAAAAAAAAA02400000000000000440:0103000020E61000000100000005000000000000000000F03FABAAAAAAAAAAFA3F000000000000F03F0000000000000000000000000000000000000000000000000000000000000000ABAAAAAAAAAAFA3F000000000000F03FABAAAAAAAAAAFA3F:0103000020E61000000100000005000000ABAAAAAAAAAA02400000000000000840ABAAAAAAAAAA12400000000000000840ABAAAAAAAAAA12400000000000000440ABAAAAAAAAAA02400000000000000440ABAAAAAAAAAA02400000000000000840:0103000020E61000000100000007000000ABAAAAAAAAAA12400000000000000840000000000000144000000000000008400000000000001440ABAAAAAAAAAA0A400000000000001840ABAAAAAAAAAA0A4000000000000018400000000000000440ABAAAAAAAAAA12400000000000000440ABAAAAAAAAAA12400000000000000840:0103000020E610000001000000070000000000000000001440ABAAAAAAAAAA0A40000000000000144000000000000014400000000000001C4000000000000014400000000000001C400000000000001040000000000000184000000000000010400000000000001840ABAAAAAAAAAA0A400000000000001440ABAAAAAAAAAA0A40}'::geometry[])) AND ('0103000020E6100000010000000B00000000000000000000000000000000000000000000000000F03F0000000000000000000000000000F03F000000000000044000000000000018400000000000000440000000000000184000000000000010400000000000001C4000000000000010400000000000001C40000000000000144000000000000014400000000000001440000000000000144000000000000008400000000000000000000000000000084000000000000000000000000000000000'::geometry ~ f.pos))
   Filter: _st_contains('0103000020E6100000010000000B00000000000000000000000000000000000000000000000000F03F0000000000000000000000000000F03F000000000000044000000000000018400000000000000440000000000000184000000000000010400000000000001C4000000000000010400000000000001C40000000000000144000000000000014400000000000001440000000000000144000000000000008400000000000000000000000000000084000000000000000000000000000000000'::geometry, f.pos)
   Rows Removed by Filter: 1932
   Heap Blocks: exact=133
   Buffers: shared hit=278
   ->  Bitmap Index Scan on idx_f  (cost=0.00..9.09 rows=50 width=0) (actual time=2.244..2.244 rows=10423 loops=1)
         Index Cond: ((f.pos @ ANY ('{0103000020E61000000100000007000000ABAAAAAAAAAA02400000000000000440000000000000F03F0000000000000440000000000000F03FABAAAAAAAAAAFA3F0000000000000000ABAAAAAAAAAAFA3F00000000000000000000000000000840ABAAAAAAAAAA02400000000000000840ABAAAAAAAAAA02400000000000000440:0103000020E61000000100000005000000000000000000F03FABAAAAAAAAAAFA3F000000000000F03F0000000000000000000000000000000000000000000000000000000000000000ABAAAAAAAAAAFA3F000000000000F03FABAAAAAAAAAAFA3F:0103000020E61000000100000005000000ABAAAAAAAAAA02400000000000000840ABAAAAAAAAAA12400000000000000840ABAAAAAAAAAA12400000000000000440ABAAAAAAAAAA02400000000000000440ABAAAAAAAAAA02400000000000000840:0103000020E61000000100000007000000ABAAAAAAAAAA12400000000000000840000000000000144000000000000008400000000000001440ABAAAAAAAAAA0A400000000000001840ABAAAAAAAAAA0A4000000000000018400000000000000440ABAAAAAAAAAA12400000000000000440ABAAAAAAAAAA12400000000000000840:0103000020E610000001000000070000000000000000001440ABAAAAAAAAAA0A40000000000000144000000000000014400000000000001C4000000000000014400000000000001C400000000000001040000000000000184000000000000010400000000000001840ABAAAAAAAAAA0A400000000000001440ABAAAAAAAAAA0A40}'::geometry[])) AND ('0103000020E6100000010000000B00000000000000000000000000000000000000000000000000F03F0000000000000000000000000000F03F000000000000044000000000000018400000000000000440000000000000184000000000000010400000000000001C4000000000000010400000000000001C40000000000000144000000000000014400000000000001440000000000000144000000000000008400000000000000000000000000000084000000000000000000000000000000000'::geometry ~ f.pos))
         Buffers: shared hit=145
 Planning time: 2.828 ms
 Execution time: 11.024 ms
(12 rows)

ST_Subdivide

ST_Subdivide is also a split function that splits a spatial geometry into a collection of spatial geometries.

Summary

1) After the split, the valid area proportion is increased. This reduces the I/Os and the CPU usage caused by invalid data scanning and significantly improves the search performance.

2) Note that if the geometry is split into too many new ones, the non-leaf nodes of the GiST index created on the bounding boxes of these new geometries may be repeatedly scanned. However, the overhead can be ignored if most of these nodes can be found in the cache.

References

The splitting methods described in this article can be further improved. We encourage you to share your ideas with us.

Community

Use Geometry Split to Optimize PostgreSQL Performance in GiST Index-based Polygon Search

Background

Example of Geometry Split

Logic

Code Sample

Verify Whether the Split is Successful

Performance Comparison Before and After Optimization

ST_Subdivide

Summary

References

Read previous post:

Read next post:

digoal

You may also like

Comments

digoal

Related Products

PolarDB for PostgreSQL

Database for FinTech Solution

Oracle Database Migration Solution

Database Migration Solution