Database Performance Optimization for High-concurrency Queue Processing Business

By digoal

Background

In e-commerce scenarios, there may be situations where upstream and downstream relationships exist. After a user places an order, both the upstream and downstream manufacturers generate an order record in their respective systems and exchange the information. Upon receiving the feedback order, the local system caches the order records in a queue. The background process then retrieves the orders from the cache and proceeds with the processing.

The core of this process involves high-speed writing to the queue, sequential extraction of orders from the queue, high-speed processing, and removal of processed order records from the queue.

However, in highly concurrent processing, there is a possibility of generating hotspots. This can lead to bottlenecks such as lock conflicts, wastage of I/O scanning and CPU calculation, and I/O and CPU calculation wastage caused by delayed clearing of index versions after processing the orders.

This article provides optimization methods and a demo for database performance optimization for queue processing business. These optimizations can improve performance by 10 to 20 times.

DEMO

1. Test environment

MacBook Pro (15-inch, 2018)  
2.2 GHz six-core Intel Core i7  
32 GB 2400 MHz DDR4  
  
PostgreSQL 15.1

Because the environment is macos, you may need to set ulimit.

ulimit -n 1000000

2. The processing queue list of the upstream write orders.

create table t_order_q (  
  id serial8 primary key,   --Auto-increment primary key
  order_id uuid unique,     --The order number passed from the upstream
  cts timestamp not null    --The creation time of the order passed from the upstream.
);   
  
-- create index on t_order_q (cts);   --If the order is retrieved according to the order time, you need to create a time field index. You can also process the order according to the auto-increment primary key order. At this time, the time index is not required.

3. State table of orders taken out and processed

create table t_order_u (  
  id serial8 primary key,   --Auto-increment primary key
  order_id uuid unique,     --The order number passed from the upstream
  cts timestamp not null,    --The creation time of the order passed from the upstream
  uts timestamp not null,   --Order processing time
  status int not null       --Order processing status marker
);

4. Write 1 million order queues

insert into t_order_q (order_id, cts) select gen_random_uuid(), clock_timestamp() from generate_series(1,1000000);

5. Create a pgbench stress test script that retrieves data from the queue and applies AdLock to add a transaction lock to the queue ID to determine if it is being processed. The transaction automatically releases the AdLock after completion. Additionally, AdLock is commonly used to alleviate stress in flash sale scenarios.

vi t.sql  
  
with tmp as   
  (delete from t_order_q where ctid = (select ctid from t_order_q where pg_try_advisory_xact_lock(id) order by id limit 1) returning order_id, cts)  
insert into t_order_u (order_id,cts,uts,status) select tmp.order_id, tmp.cts, now(), 1 from tmp;

begin;
select id as v_id from t_order_q where pg_try_advisory_xact_lock(id) order by id limit 1 \gset
with tmp as (delete from t_order_q where id = :v_id returning order_id, cts) 
  insert into t_order_u (order_id,cts,uts,status) select tmp.order_id, tmp.cts, now(), 1 from tmp;   
end;

Or (The increased duration of application-side operations after the sleep simulation application receives the order to be processed.)

begin; 
select id as v_id from t_order_q where pg_try_advisory_xact_lock(id) order by id limit 1 \gset
\sleep 10ms
with tmp as (delete from t_order_q where id = :v_id returning order_id, cts) 
  insert into t_order_u (order_id,cts,uts,status) select tmp.order_id, tmp.cts, now(), 1 from tmp;   
end;

6. Perform stress tests on 256 concurrent consumption queues, with an average of 3906 transactions handled by each connection.

select 1000000/256.0;  
3906.2500000000000

7. Results of stress test

pgbench -M extended -f ./t.sql -n -r -P 1 -c 256 -j 2 -t 3906  
transaction type: ./t.sql  
scaling factor: 1  
query mode: extended  
number of clients: 256  
number of threads: 2  
maximum number of tries: 1  
number of transactions per client: 3906  
number of transactions actually processed: 999936/999936  
number of failed transactions: 0 (0.000%)  
latency average = 8.111 ms  
latency stddev = 5.376 ms  
initial connection time = 429.698 ms  
tps = 25379.081141 (without initial connection time)  
statement latencies in milliseconds and failures:  
         8.114           0  with tmp as

What Is the Performance Before Optimization?

1. Create a pgbench stress test script that retrieves data from the queue and applies AdLock to add a transaction lock to the queue ID to determine if it is being processed. The transaction automatically releases the AdLock after completion. Additionally, AdLock is commonly used to alleviate stress in flash sale scenarios.

vi t1.sql  
  
begin;  
select id as vid from t_order_q order by id for update limit 1 \gset  
with tmp as   
  (delete from t_order_q where id = :vid returning order_id, cts)  
insert into t_order_u (order_id,cts,uts,status) select tmp.order_id, tmp.cts, now(), 1 from tmp;   
end;

2. Results of stress test

pgbench -M extended -f ./t1.sql -n -r -P 1 -c 256 -j 2 -t 3906  
TPS is about 1200.

After the skip locked is added, TPS can only reach about 2500. After reducing concurrency, the performance of using skip locked can be improved to about 8K TPS.

begin;  
select id as vid from t_order_q order by id for update skip locked limit 1 \gset  
with tmp as   
  (delete from t_order_q where id = :vid returning order_id, cts)  
insert into t_order_u (order_id,cts,uts,status) select tmp.order_id, tmp.cts, now(), 1 from tmp;   
end;

What Else Can be Done to Improve Performance?

1. Reduce wasted I/O and CPU calculations:

• In the case of concurrency, order by id limit 1 needs to scan several rows instead of 1 row, because some IDs may have been touched by AdLock, the number of wasted pg_try_advisory_xact_lock() cpu ops calculations is approximately equal to n + n-1 + n-2 + ... + n-n, and the wasted I/O is approximately equal to n.

Optimization method

• Implement fixed N connections and retrieve different data shards based on the ID hash modulus to minimize wasted I/O and CPU calculations.

• Alternatively, split the queue table into several partition tables. Assign each partition to a different process for retrieval based on the ID hash modulus during warehousing. This reduces conflicts, wasted scans, and improves concurrency.

2. Increase the frequency of index vacuuming to reduce CPU and I/O waste caused by garbage data judgments when the index version is absent. Adjust the autovacuum_work_mem parameter to accommodate all dead tuple ctid and avoid multiple index scans.

Optimization method
• Configure the following parameters: autovacuum_naptime, autovacuum_work_mem (or old version maintenance_work_mem).

3. Use the parallel vacuum to configure max_parallel_maintenance_workers.

4. Configure vacuum to use prefetch blocks, reducing the issue of prolonged vacuuming due to I/O delays. (Suitable for disks with high single I/O delays but no throughput bottleneck.)

5. Retrieve and process multiple pieces at once in batches.

6. Use a local NVMe SSD with high IOPS and low single I/O delays.

Community

Database Performance Optimization for High-concurrency Queue Processing Business

Background

DEMO

What Is the Performance Before Optimization?

What Else Can be Done to Improve Performance?

Read previous post:

Read next post:

digoal

You may also like

Comments

digoal

Related Products

PolarDB for PostgreSQL

PolarDB for Xscale

PolarDB for MySQL

AnalyticDB for PostgreSQL