By digoal
Zedstore is an open-source storage plug-in developed based on the AM interfaces of PostgreSQL. It is an open-source plug-in developed by pivotal, a major commercial PG company. Its source code is available via GitHub.
Click the following links to see the design documents:
Why is the zedstore a string-column hybrid storage? Some columns that do not need to be counted can be packed together for row storage to improve kv retrieval efficiency. The following example shows how to split and store columns for statistics. This helps reduce the I/O throughput of statistical queries and increases the compression ratio.
What business scenario problems does zedstore solve?
The first two are the common problems of the heap engine for PG. The third is the problem that PG must solve as a high-performance OLTP + OLAP hybrid database.
If zedstore is released in a stable manner, it can be a superb storage engine for PG.
git clone https://github.com/greenplum-db/postgres/
git branch -a
git checkout -b remotes/origin/zedstore
useradd digoal
mkdir /data01/digoal
chown digoal:digoal /data01/digoal
cp -r postgres /home/digoal/
chown -R digoal:digoal /home/digoal/postgres
Install lz4. Otherwise, zedstore will be slow:
yum install -y lz4-devel
su - digoal
cd postgres
./configure --prefix=/home/digoal/pg_zedstore --with-lz4
make world -j 4
make install-world
vi ~/.bash_profile
# 追加
export PS1="$USER@`/bin/hostname -s`-> "
export PGPORT=1921
export PGDATA=/data01/digoal/pg_root
export LANG=en_US.utf8
export PGHOME=/home/digoal/pg_zedstore
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH
export DATE=`date +"%Y%m%d%H%M"`
export PATH=$PGHOME/bin:$PATH:.
export MANPATH=$PGHOME/share/man:$MANPATH
export PGHOST=$PGDATA
export PGUSER=postgres
export PGDATABASE=postgres
alias rm='rm -i'
alias ll='ls -lh'
unalias vi
initdb -D $PGDATA -U postgres -E UTF8 --lc-collate=C --lc-ctype=en_US.utf8
vi postgresql.conf
listen_addresses = '0.0.0.0'
port = 1921
max_connections = 300
unix_socket_directories = '/tmp, .'
unix_socket_permissions = 0700
tcp_keepalives_idle = 60
tcp_keepalives_interval = 10
tcp_keepalives_count = 10
shared_buffers = 8GB
maintenance_work_mem = 1GB
dynamic_shared_memory_type = posix
vacuum_cost_delay = 0
bgwriter_delay = 10ms
bgwriter_lru_maxpages = 1000
bgwriter_lru_multiplier = 10.0
synchronous_commit = off
wal_compression = on
wal_writer_delay = 10ms
max_wal_size = 24GB
min_wal_size = 8GB
random_page_cost = 1.1
log_destination = 'csvlog'
logging_collector = on
log_checkpoints = on
log_timezone = 'Asia/Shanghai'
log_autovacuum_min_duration = 0
autovacuum_freeze_max_age = 1200000000
Weight = 1400000000
autovacuum_vacuum_cost_delay = 0ms
vacuum_freeze_table_age = 1150000000
Price = 1150000000
datestyle = 'iso, mdy'
timezone = 'Asia/Shanghai'
lc_messages = 'en_US.utf8'
lc_monetary = 'en_US.utf8'
lc_numeric = 'en_US.utf8'
lc_time = 'en_US.utf8'
default_text_search_config = 'pg_catalog.english'
pg_ctl start
Test the data:
postgres=# do language plpgsql $$
declare
sql text;
begin
sql := 'create table t_heap (id int8,';
for i in 1..10 loop
sql := sql||'int'||i||' int,';
end loop;
for i in 1..10 loop
sql := sql||'text'||i||' text,';
end loop;
sql := sql||'crt_time timestamp) using heap';
execute sql;
end;
$$;
postgres=# do language plpgsql $$
declare
sql text;
begin
sql := 'create table t_zedstore (id int8,';
for i in 1..10 loop
sql := sql||'int'||i||' int,';
end loop;
for i in 1..10 loop
sql := sql||'text'||i||' text,';
end loop;
sql := sql||'crt_time timestamp) using zedstore';
execute sql;
end;
$$;
postgres=# \d+ t_heap
Table "public.t_heap"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
----------+-----------------------------+-----------+----------+---------+----------+--------------+-------------
id | bigint | | | | plain | |
int1 | integer | | | | plain | |
int2 | integer | | | | plain | |
int3 | integer | | | | plain | |
int4 | integer | | | | plain | |
int5 | integer | | | | plain | |
int6 | integer | | | | plain | |
int7 | integer | | | | plain | |
int8 | integer | | | | plain | |
int9 | integer | | | | plain | |
int10 | integer | | | | plain | |
text1 | text | | | | extended | |
text2 | text | | | | extended | |
text3 | text | | | | extended | |
text4 | text | | | | extended | |
text5 | text | | | | extended | |
text6 | text | | | | extended | |
text7 | text | | | | extended | |
text8 | text | | | | extended | |
text9 | text | | | | extended | |
text10 | text | | | | extended | |
crt_time | timestamp without time zone | | | | plain | |
Access method: heap
postgres=# \d+ t_zedstore
Table "public.t_zedstore"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
----------+-----------------------------+-----------+----------+---------+----------+--------------+-------------
id | bigint | | | | plain | |
int1 | integer | | | | plain | |
int2 | integer | | | | plain | |
int3 | integer | | | | plain | |
int4 | integer | | | | plain | |
int5 | integer | | | | plain | |
int6 | integer | | | | plain | |
int7 | integer | | | | plain | |
int8 | integer | | | | plain | |
int9 | integer | | | | plain | |
int10 | integer | | | | plain | |
text1 | text | | | | extended | |
text2 | text | | | | extended | |
text3 | text | | | | extended | |
text4 | text | | | | extended | |
text5 | text | | | | extended | |
text6 | text | | | | extended | |
text7 | text | | | | extended | |
text8 | text | | | | extended | |
text9 | text | | | | extended | |
text10 | text | | | | extended | |
crt_time | timestamp without time zone | | | | plain | |
Access method: zedstore
\timing
insert into t_heap
select
id,
random()*10,
random()*100,
random()*1000,
random()*10000,
random()*100000,
random()*10,
random()*100,
random()*1000,
random()*10000,
random()*100000,
md5((random()*10)::int::text),
md5((random()*100)::int::text),
md5((random()*1000)::int::text),
md5((random()*10000)::int::text),
md5((random()*10)::int::text),
md5((random()*100)::int::text),
md5((random()*1000)::int::text),
md5((random()*10000)::int::text),
md5((random()*100)::int::text),
md5((random()*100)::int::text),
clock_timestamp()
from
generate_series(1,10000000) t(id);
INSERT 0 10000000
Time: 55137.382 ms (00:55.137)
checkpoint;
insert into t_zedstore
select
id,
random()*10,
random()*100,
random()*1000,
random()*10000,
random()*100000,
random()*10,
random()*100,
random()*1000,
random()*10000,
random()*100000,
md5((random()*10)::int::text),
md5((random()*100)::int::text),
md5((random()*1000)::int::text),
md5((random()*10000)::int::text),
md5((random()*10)::int::text),
md5((random()*100)::int::text),
md5((random()*1000)::int::text),
md5((random()*10000)::int::text),
md5((random()*100)::int::text),
md5((random()*100)::int::text),
clock_timestamp()
from
generate_series(1,10000000) t(id);
INSERT 0 10000000
Time: 156857.767 ms (02:36.858)
postgres=# \dt+
List of relations
Schema | Name | Type | Owner | Persistence | Size | Description
--------+------------+-------+----------+-------------+---------+-------------
public | t_heap | table | postgres | permanent | 4113 MB |
public | t_zedstore | table | postgres | permanent | 1941 MB |
(2 rows)
Statistical analysis test:
postgres=# explain select int1,count(*) from t_heap group by 1;
QUERY PLAN
-----------------------------------------------------------------------
HashAggregate (cost=676315.01..676315.12 rows=11 width=12)
Group Key: int1
-> Seq Scan on t_heap (cost=0.00..626315.34 rows=9999934 width=4)
(3 rows)
Time: 0.405 ms
postgres=# explain select int1,count(*) from t_zedstore group by 1;
QUERY PLAN
---------------------------------------------------------------------------
HashAggregate (cost=151860.92..151861.03 rows=11 width=12)
Group Key: int1
-> Seq Scan on t_zedstore (cost=0.00..101861.95 rows=9999795 width=4)
(3 rows)
Time: 0.155 ms
postgres=# select int1,count(*) from t_zedstore group by 1;
int1 | count
------+---------
5 | 999956
4 | 1000395
10 | 501530
6 | 998901
2 | 999192
7 | 1000936
1 | 999489
8 | 998940
9 | 1000325
3 | 1000801
0 | 499535
(11 rows)
Time: 1343.633 ms (00:01.344)
postgres=# select text1,count(*) from t_zedstore group by 1;
text1 | count
----------------------------------+---------
c4ca4238a0b923820dcc509a6f75849b | 1001059
8f14e45fceea167a5a36dedd4bea2543 | 999705
d3d9446802a44259755d38e6d163e820 | 499813
1679091c5a880faf6fb5e6087eb1b2dc | 1001648
e4da3b7fbbce2345d7772b0674a318d5 | 999392
eccbc87e4b5ce2fe28308fd9f2a7baf3 | 999374
cfcd208495d565ef66e7dff9f98764da | 501183
c9f0f895fb98ab9159f51fd0297e236d | 1001479
45c48cce2e2d7fbdea1afc51c7c6ad26 | 998835
a87ff679a2f3e71d9181a67b7542122c | 999037
c81e728d9d4c2f636f067f89cc14862c | 998475
(11 rows)
Time: 1863.111 ms (00:01.863)
postgres=# select int1,count(*) from t_heap group by 1;
int1 | count
------+---------
5 | 1000662
4 | 999872
10 | 499901
6 | 1000942
2 | 1000378
7 | 999201
1 | 1001494
8 | 998922
9 | 998764
3 | 999863
0 | 500001
(11 rows)
Time: 2655.144 ms (00:02.655)
postgres=# select text1,count(*) from t_heap group by 1;
text1 | count
----------------------------------+---------
c4ca4238a0b923820dcc509a6f75849b | 999925
8f14e45fceea167a5a36dedd4bea2543 | 1000277
d3d9446802a44259755d38e6d163e820 | 497869
1679091c5a880faf6fb5e6087eb1b2dc | 1000732
e4da3b7fbbce2345d7772b0674a318d5 | 999516
eccbc87e4b5ce2fe28308fd9f2a7baf3 | 1002371
cfcd208495d565ef66e7dff9f98764da | 500511
c9f0f895fb98ab9159f51fd0297e236d | 996678
45c48cce2e2d7fbdea1afc51c7c6ad26 | 1000367
a87ff679a2f3e71d9181a67b7542122c | 1001158
c81e728d9d4c2f636f067f89cc14862c | 1000596
(11 rows)
Time: 3062.112 ms (00:03.062)
alter table t_heap add constraint pk primary key(id);
alter table t_zedstore add constraint pk1 primary key(id);
postgres=# \di+
List of relations
Schema | Name | Type | Owner | Table | Persistence | Size | Description
--------+------+-------+----------+------------+-------------+--------+-------------
public | pk | index | postgres | t_heap | permanent | 214 MB |
public | pk1 | index | postgres | t_zedstore | permanent | 214 MB |
(2 rows)
Large width indicates fine KV query test:
vi test1.sql
\set id random(1,10000000)
select * from t_heap where id=:id;
select * from t_zedstore where id=:id;
pgbench -M prepared -n -r -P 1 -f ./test1.sql -c 52 -j 52 -T 120
transaction type: ./test1.sql
scaling factor: 1
query mode: prepared
number of clients: 52
number of threads: 52
duration: 120 s
number of transactions actually processed: 6932745
latency average = 0.900 ms
latency stddev = 0.176 ms
tps = 57772.190636 (including connections establishing)
tps = 57774.347762 (excluding connections establishing)
statement latencies in milliseconds:
0.001 \set id random(1,10000000)
0.061 select * from t_heap where id=:id;
0.838 select * from t_zedstore where id=:id;
High Concurrency update test:
vi test2.sql
\set id random(1,100000000)
insert into t_heap values (:id,
random()*10,
random()*100,
random()*1000,
random()*10000,
random()*100000,
random()*10,
random()*100,
random()*1000,
random()*10000,
random()*100000,
md5((random()*10)::int::text),
md5((random()*100)::int::text),
md5((random()*1000)::int::text),
md5((random()*10000)::int::text),
md5((random()*10)::int::text),
md5((random()*100)::int::text),
md5((random()*1000)::int::text),
md5((random()*10000)::int::text),
md5((random()*100)::int::text),
md5((random()*100)::int::text),
clock_timestamp()
) on conflict (id)
do update set
int1=excluded.int1,
int2=excluded.int2,
int3=excluded.int3,
int4=excluded.int4,
int5=excluded.int5,
int6=excluded.int6,
int7=excluded.int7,
int8=excluded.int8,
int9=excluded.int9,
int10=excluded.int10,
text1=excluded.text1,
text2=excluded.text2,
text3=excluded.text3,
text4=excluded.text4,
text5=excluded.text5,
text6=excluded.text6,
text7=excluded.text7,
text8=excluded.text8,
text9=excluded.text9,
text10=excluded.text10,
crt_time=excluded.crt_time;
insert into t_zedstore values (:id,
random()*10,
random()*100,
random()*1000,
random()*10000,
random()*100000,
random()*10,
random()*100,
random()*1000,
random()*10000,
random()*100000,
md5((random()*10)::int::text),
md5((random()*100)::int::text),
md5((random()*1000)::int::text),
md5((random()*10000)::int::text),
md5((random()*10)::int::text),
md5((random()*100)::int::text),
md5((random()*1000)::int::text),
md5((random()*10000)::int::text),
md5((random()*100)::int::text),
md5((random()*100)::int::text),
clock_timestamp()
) on conflict (id)
do update set
int1=excluded.int1,
int2=excluded.int2,
int3=excluded.int3,
int4=excluded.int4,
int5=excluded.int5,
int6=excluded.int6,
int7=excluded.int7,
int8=excluded.int8,
int9=excluded.int9,
int10=excluded.int10,
text1=excluded.text1,
text2=excluded.text2,
text3=excluded.text3,
text4=excluded.text4,
text5=excluded.text5,
text6=excluded.text6,
text7=excluded.text7,
text8=excluded.text8,
text9=excluded.text9,
text10=excluded.text10,
crt_time=excluded.crt_time;
pgbench -M prepared -n -r -P 1 -f ./test2.sql -c 52 -j 52 -T 120
progress: 1.0 s, 1096.0 tps, lat 6.098 ms stddev 2.041
progress: 2.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 3.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 4.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 5.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 6.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 7.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 8.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 9.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
postgres=# select wait_event from pg_stat_activity ;
wait_event
---------------------
AutoVacuumMain
LogicalLauncherMain
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
page
buffer_content
buffer_content
page
page
buffer_content
buffer_content
page
buffer_content
buffer_content
page
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
page
buffer_content
page
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
page
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
buffer_content
BgWriterHibernate
CheckpointerMain
WalWriterMain
There are a large number of conflicts, and changing it to do nothing
still leads to a large number of conflicts. Currently, the high-concurrency write operation in zedstore has some problems.
insert into t_zedstore values (:id,
random()*10,
random()*100,
random()*1000,
random()*10000,
random()*100000,
random()*10,
random()*100,
random()*1000,
random()*10000,
random()*100000,
md5((random()*10)::int::text),
md5((random()*100)::int::text),
md5((random()*1000)::int::text),
md5((random()*10000)::int::text),
md5((random()*10)::int::text),
md5((random()*100)::int::text),
md5((random()*1000)::int::text),
md5((random()*10000)::int::text),
md5((random()*100)::int::text),
md5((random()*100)::int::text),
clock_timestamp()
) on conflict (id)
do nothing;
pgbench -M prepared -n -r -P 1 -f ./test2.SQL -c 52 -j 52 -T 120
progress:1.0 s, 9536.4 tps, lat 5.403 ms stddev 2.373
progress:2.0 s, 9579.0 tps, lat 5.421 ms stddev 2.730
progress:3.0 s, 9886.0 tps, lat 5.265 ms stddev 2.483
progress:4.0 s, 10011.0 tps, lat 5.201 ms stddev 2.710
progress:5.0 s, 10345.0 tps, lat 5.026 ms stddev 2.434
progress:6.0 s, 9983.0 tps, lat 5.188 ms stddev 2.554
progress:7.0 s, 9892.0 tps, lat 5.277 ms stddev 2.674
progress:8.0 s, 10375.0 tps, lat 5.011 ms stddev 2.390
progress:9.0 s, 10059.0 tps, lat 5.165 ms stddev 2.678
progress:10.0 s, 10123.0 tps, lat 5.135 ms stddev 2.470
progress:11.0 s, 5129.5 tps, lat 4.941 ms stddev 2.311
progress:12.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress:13.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress:14.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress:15.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress:16.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress:17.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress:18.0 s, 0.0 tps, lat 0.000 ms stddev 0.000
Waiting for page lock:
[root@iZbp135pwcjjoxqgfpw9k1Z ~]# pstack 41065
#0 0x00007f5441b9ae43 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1 0x000000000075888e in WaitEventSetWait ()
#2 0x0000000000758ce9 in WaitLatchOrSocket ()
#3 0x00000000007701df in ProcSleep ()
#4 0x0000000000764cef in WaitOnLock ()
#5 0x0000000000766076 in LockAcquireExtended ()
#6 0x0000000000763f46 in LockPage ()
#7 0x000000000052d0e3 in zsundo_trim ()
#8 0x000000000052e32e in zsundo_get_oldest_undo_ptr ()
#9 0x0000000000520b95 in zsbt_tid_begin_scan ()
#10 0x000000000052a54f in zedstoream_fetch_row ()
#11 0x000000000052a9c1 in zedstoream_index_fetch_tuple ()
#12 0x00000000004d29b2 in index_fetch_heap ()
#13 0x00000000004d2a1b in index_getnext_slot ()
#14 0x000000000062cd2b in check_exclusion_or_unique_constraint ()
#15 0x000000000062d73e in ExecCheckIndexConstraints ()
#16 0x0000000000653de6 in ExecInsert ()
#17 0x0000000000655219 in ExecModifyTable ()
#18 0x000000000062e092 in standard_ExecutorRun ()
#19 0x000000000077c02a in ProcessQuery ()
#20 0x000000000077c258 in PortalRunMulti ()
#21 0x000000000077cc6d in PortalRun ()
#22 0x000000000077a902 in PostgresMain ()
#23 0x0000000000482278 in ServerLoop ()
#24 0x0000000000709d63 in PostmasterMain ()
#25 0x0000000000482ebe in main ()
Wait for buffer_content lock:
[root@iZbp135pwcjjoxqgfpw9k1Z ~]# pstack 41063
#0 0x00007f544279aafb in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1 0x00007f544279ab8f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2 0x00007f544279ac2b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3 0x00000000006f7f52 in PGSemaphoreLock ()
#4 0x0000000000769a5c in LWLockAcquire ()
#5 0x000000000051c8c3 in zsbt_descend ()
#6 0x0000000000521143 in zsbt_tid_multi_insert ()
#7 0x0000000000528010 in zedstoream_insert_speculative ()
#8 0x000000000065417f in ExecInsert ()
#9 0x0000000000655219 in ExecModifyTable ()
#10 0x000000000062e092 in standard_ExecutorRun ()
#11 0x000000000077c02a in ProcessQuery ()
#12 0x000000000077c258 in PortalRunMulti ()
#13 0x000000000077cc6d in PortalRun ()
#14 0x000000000077a902 in PostgresMain ()
#15 0x0000000000482278 in ServerLoop ()
#16 0x0000000000709d63 in PostmasterMain ()
#17 0x0000000000482ebe in main ()
Performance comparison:
Database Engine | Storage Capacity | OLTP KV Query | OLAP Column-Level Aggregate Statistics |
heap | 10 million 4GB | 0.061 milliseconds | 2.6 seconds |
zedstore | 10 million 2GB | 0.838 milliseconds | 1.3 seconds |
There is still a problem with the concurrent writing of zedstore. There is no problem with the query.
PostgreSQL Optimized Case – Where a Field Range Order by B Field Sort Limit X
ApsaraDB - August 4, 2023
digoal - April 12, 2019
digoal - May 28, 2021
zhuodao - July 30, 2020
digoal - April 12, 2019
digoal - July 24, 2019
An online MPP warehousing service based on the Greenplum Database open source program
Learn MoreAlibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAn on-demand database hosting service for PostgreSQL with automated monitoring, backup and disaster recovery capabilities
Learn MoreAnalyticDB for MySQL is a real-time data warehousing service that can process petabytes of data with high concurrency and low latency.
Learn MoreMore Posts by digoal