By digoal
The goal of optimization is to minimize computing, IO, filter, recheck, etc. For example, the following query is difficult to optimize at first glance:
select * from tbl where ts between x and y
order by z limit x;
Because ts and z are two fields, index (ts,z) or index (z,ts) cannot be used to completely perform indexing.
First, let’s understand the background. z is pk. z and ts have linear correlation (positive correlation or negative correlation), and both have directionality, such as increasing or decreasing. Therefore, this query can be modified.
select * from tbl where ts between x and y
and z?
order by z limit x;
n records are queried each time, and all records meeting the conditions of ts between x and y
are successively filtered and processed.
Optimization Ideas:
1. Supplement the condition of z as the initial condition, so there is no need to filter too much.
select min(z), max(z) into id1,id2 from tbl where ts between x and y;
select * from tbl where ts between x and y
and z>=id1 and z<=id2
order by z desc limit x;
min(z) into id2
2. Each time a batch of records is found, the id is passed in as a new condition, so there is no need to filter too much every time.
select * from tbl where ts between x and y and z<id2 order by z desc limit x;
min(z) into id2
create table tbl (id int primary key, info text, crt_time timestamp);
insert into tbl select generate_series(1,100000000), md5(random()::text) , clock_timestamp();
Original SQL statement:
postgres=# select min(crt_time), max(crt_time) from tbl;
min | max
----------------------------+----------------------------
2020-07-10 14:04:46.600332 | 2020-07-10 14:08:06.737822
(1 row)
select * from tbl where crt_time between '2020-07-10 14:05:45' and '2020-07-10 14:07:00'
order by id limit 100;
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where crt_time between '2020-07-10 14:05:45' and '2020-07-10 14:07:00'
order by id limit 100;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.57..9.22 rows=100 width=45) (actual time=4003.046..4003.071 rows=100 loops=1)
Output: id, info, crt_time
Buffers: shared hit=288650 read=63193
-> Index Scan using tbl_pkey on public.tbl (cost=0.57..3208769.85 rows=37078282 width=45) (actual time=4003.045..4003.062 rows=100 loops=1)
Output: id, info, crt_time
Filter: ((tbl.crt_time >= '2020-07-10 14:05:45'::timestamp without time zone) AND (tbl.crt_time <= '2020-07-10 14:07:00'::timestamp without time zone))
Rows Removed by Filter: 29130419
Buffers: shared hit=288650 read=63193
Planning Time: 0.093 ms
Execution Time: 4003.096 ms
(10 rows)
-- 大量filter
Optimize the initial SQL:
select min(id),max(id) from tbl where crt_time between '2020-07-10 14:05:45' and '2020-07-10 14:07:00';
min | max
----------+----------
29130420 | 66423729
(1 row)
select * from tbl where crt_time between '2020-07-10 14:05:45' and '2020-07-10 14:07:00'
and id>=29130420 and id<=66423729
order by id desc limit 100;
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where crt_time between '2020-07-10 14:05:45' and '2020-07-10 14:07:00'
and id>=29130420 and id<=66423729
order by id desc limit 100;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.57..10.57 rows=100 width=45) (actual time=0.022..0.051 rows=100 loops=1)
Output: id, info, crt_time
Buffers: shared hit=6
-> Index Scan Backward using tbl_pkey on public.tbl (cost=0.57..1374883.82 rows=13745329 width=45) (actual time=0.021..0.043 rows=100 loops=1)
Output: id, info, crt_time
Index Cond: ((tbl.id >= 29130420) AND (tbl.id <= 66423729))
Filter: ((tbl.crt_time >= '2020-07-10 14:05:45'::timestamp without time zone) AND (tbl.crt_time <= '2020-07-10 14:07:00'::timestamp without time zone))
Buffers: shared hit=6
Planning Time: 0.121 ms
Execution Time: 0.068 ms
(10 rows)
Optimize SQL statements for batch query:
select * from tbl where
crt_time between '2020-07-10 14:05:45' and '2020-07-10 14:07:00'
and id<66423630
order by id desc limit 100;
select * from tbl where
crt_time between '2020-07-10 14:05:45' and '2020-07-10 14:07:00'
and id<66423530
order by id desc limit 100;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.57..9.90 rows=100 width=45) (actual time=0.042..0.074 rows=100 loops=1)
Output: id, info, crt_time
Buffers: shared hit=8
-> Index Scan Backward using tbl_pkey on public.tbl (cost=0.57..2298576.97 rows=24640916 width=45) (actual time=0.041..0.065 rows=100 loops=1)
Output: id, info, crt_time
Index Cond: (tbl.id < 66423530)
Filter: ((tbl.crt_time >= '2020-07-10 14:05:45'::timestamp without time zone) AND (tbl.crt_time <= '2020-07-10 14:07:00'::timestamp without time zone))
Buffers: shared hit=8
Planning Time: 0.107 ms
Execution Time: 0.097 ms
(10 rows)
It has 40000 times better performance.
Use Mixed Storage of Rows and Columns in PostgreSQL Zedstore
ApsaraDB - May 18, 2022
Alibaba Cloud MaxCompute - February 18, 2024
digoal - December 23, 2020
digoal - December 11, 2019
zhuodao - July 30, 2020
digoal - January 19, 2021
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAn online MPP warehousing service based on the Greenplum Database open source program
Learn MoreAn on-demand database hosting service for PostgreSQL with automated monitoring, backup and disaster recovery capabilities
Learn MoreLeverage cloud-native database solutions dedicated for FinTech.
Learn MoreMore Posts by digoal