By digoal
In e-commerce scenarios, there may be situations where upstream and downstream relationships exist. After a user places an order, both the upstream and downstream manufacturers generate an order record in their respective systems and exchange the information. Upon receiving the feedback order, the local system caches the order records in a queue. The background process then retrieves the orders from the cache and proceeds with the processing.
The core of this process involves high-speed writing to the queue, sequential extraction of orders from the queue, high-speed processing, and removal of processed order records from the queue.
However, in highly concurrent processing, there is a possibility of generating hotspots. This can lead to bottlenecks such as lock conflicts, wastage of I/O scanning and CPU calculation, and I/O and CPU calculation wastage caused by delayed clearing of index versions after processing the orders.
This article provides optimization methods and a demo for database performance optimization for queue processing business. These optimizations can improve performance by 10 to 20 times.
1. Test environment
MacBook Pro (15-inch, 2018)
2.2 GHz six-core Intel Core i7
32 GB 2400 MHz DDR4
PostgreSQL 15.1
Because the environment is macos, you may need to set ulimit.
ulimit -n 1000000
2. The processing queue list of the upstream write orders.
create table t_order_q (
id serial8 primary key, --Auto-increment primary key
order_id uuid unique, --The order number passed from the upstream
cts timestamp not null --The creation time of the order passed from the upstream.
);
-- create index on t_order_q (cts); --If the order is retrieved according to the order time, you need to create a time field index. You can also process the order according to the auto-increment primary key order. At this time, the time index is not required.
3. State table of orders taken out and processed
create table t_order_u (
id serial8 primary key, --Auto-increment primary key
order_id uuid unique, --The order number passed from the upstream
cts timestamp not null, --The creation time of the order passed from the upstream
uts timestamp not null, --Order processing time
status int not null --Order processing status marker
);
4. Write 1 million order queues
insert into t_order_q (order_id, cts) select gen_random_uuid(), clock_timestamp() from generate_series(1,1000000);
5. Create a pgbench stress test script that retrieves data from the queue and applies AdLock to add a transaction lock to the queue ID to determine if it is being processed. The transaction automatically releases the AdLock after completion. Additionally, AdLock is commonly used to alleviate stress in flash sale scenarios.
vi t.sql
with tmp as
(delete from t_order_q where ctid = (select ctid from t_order_q where pg_try_advisory_xact_lock(id) order by id limit 1) returning order_id, cts)
insert into t_order_u (order_id,cts,uts,status) select tmp.order_id, tmp.cts, now(), 1 from tmp;
Or
begin;
select id as v_id from t_order_q where pg_try_advisory_xact_lock(id) order by id limit 1 \gset
with tmp as (delete from t_order_q where id = :v_id returning order_id, cts)
insert into t_order_u (order_id,cts,uts,status) select tmp.order_id, tmp.cts, now(), 1 from tmp;
end;
Or (The increased duration of application-side operations after the sleep simulation application receives the order to be processed.)
begin;
select id as v_id from t_order_q where pg_try_advisory_xact_lock(id) order by id limit 1 \gset
\sleep 10ms
with tmp as (delete from t_order_q where id = :v_id returning order_id, cts)
insert into t_order_u (order_id,cts,uts,status) select tmp.order_id, tmp.cts, now(), 1 from tmp;
end;
6. Perform stress tests on 256 concurrent consumption queues, with an average of 3906 transactions handled by each connection.
select 1000000/256.0;
3906.2500000000000
7. Results of stress test
pgbench -M extended -f ./t.sql -n -r -P 1 -c 256 -j 2 -t 3906
transaction type: ./t.sql
scaling factor: 1
query mode: extended
number of clients: 256
number of threads: 2
maximum number of tries: 1
number of transactions per client: 3906
number of transactions actually processed: 999936/999936
number of failed transactions: 0 (0.000%)
latency average = 8.111 ms
latency stddev = 5.376 ms
initial connection time = 429.698 ms
tps = 25379.081141 (without initial connection time)
statement latencies in milliseconds and failures:
8.114 0 with tmp as
1. Create a pgbench stress test script that retrieves data from the queue and applies AdLock to add a transaction lock to the queue ID to determine if it is being processed. The transaction automatically releases the AdLock after completion. Additionally, AdLock is commonly used to alleviate stress in flash sale scenarios.
vi t1.sql
begin;
select id as vid from t_order_q order by id for update limit 1 \gset
with tmp as
(delete from t_order_q where id = :vid returning order_id, cts)
insert into t_order_u (order_id,cts,uts,status) select tmp.order_id, tmp.cts, now(), 1 from tmp;
end;
2. Results of stress test
pgbench -M extended -f ./t1.sql -n -r -P 1 -c 256 -j 2 -t 3906
TPS is about 1200.
After the skip locked is added, TPS can only reach about 2500. After reducing concurrency, the performance of using skip locked can be improved to about 8K TPS.
begin;
select id as vid from t_order_q order by id for update skip locked limit 1 \gset
with tmp as
(delete from t_order_q where id = :vid returning order_id, cts)
insert into t_order_u (order_id,cts,uts,status) select tmp.order_id, tmp.cts, now(), 1 from tmp;
end;
1. Reduce wasted I/O and CPU calculations:
• In the case of concurrency, order by id limit 1 needs to scan several rows instead of 1 row, because some IDs may have been touched by AdLock, the number of wasted pg_try_advisory_xact_lock() cpu ops calculations is approximately equal to n + n-1 + n-2 + ... + n-n, and the wasted I/O is approximately equal to n.
Optimization method
• Implement fixed N connections and retrieve different data shards based on the ID hash modulus to minimize wasted I/O and CPU calculations.
• Alternatively, split the queue table into several partition tables. Assign each partition to a different process for retrieval based on the ID hash modulus during warehousing. This reduces conflicts, wasted scans, and improves concurrency.
2. Increase the frequency of index vacuuming to reduce CPU and I/O waste caused by garbage data judgments when the index version is absent. Adjust the autovacuum_work_mem parameter to accommodate all dead tuple ctid and avoid multiple index scans.
Optimization method
• Configure the following parameters: autovacuum_naptime, autovacuum_work_mem (or old version maintenance_work_mem).
3. Use the parallel vacuum to configure max_parallel_maintenance_workers.
4. Configure vacuum to use prefetch blocks, reducing the issue of prolonged vacuuming due to I/O delays. (Suitable for disks with high single I/O delays but no throughput bottleneck.)
5. Retrieve and process multiple pieces at once in batches.
6. Use a local NVMe SSD with high IOPS and low single I/O delays.
digoal - May 28, 2024
ApsaraDB - June 22, 2021
ApsaraDB - October 19, 2020
Alibaba Cloud MaxCompute - November 15, 2021
Alibaba Cloud Native - November 13, 2024
ApsaraDB - August 13, 2024
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAlibaba Cloud PolarDB for Xscale (PolarDB-X) is a cloud-native high-performance distributed database service independently developed by Alibaba Cloud.
Learn MoreAlibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn MoreAn online MPP warehousing service based on the Greenplum Database open source program
Learn MoreMore Posts by digoal