Evaluating the performance of a database usually involves either industry standard testing or or modeling testing based on the business model.
For example, PostgreSQL pgbench supports both tpc-b testing and custom modeling testing; benchmarksql supports tpc-c testing; and gp_tpch supports tpc-h testing.
References :
TPC-H testing - PostgreSQL 10 vs Deepgreen(Greenplum)
Testing a sysbench test case with pgbench of PostgreSQL
Analysis of PostgreSQL pgbench SQL RT and transaction RT
Performance evaluation in the database industry_tpc.org
These testing methods can only be used after the database has been built. Is there any way to evaluate performance before building the database?
These indicators have the biggest influence on the performance of a database:
• CPU clock speed
• CPU instruction set
• Number of CPU cores
• Memory clock speed, bus bandwidth
• Hard drive capacity - random IOPS performance
• Hard drive capacity - sequential IOPS performance
• Hard drive bandwidth
• Network bandwidth
For a Greenplum database, the main influencers are:
1.CPU clock speed
Determines the calculation speed of the database. But what operations involve calculations? For example, WHERE clause filtering, operator calculations in a SELECT sub-statement, aggregate calculations, sorting, etc.
2.CPU instruction set
The instruction set determines the performance of certain optimizations in the database. For example, vector computation.
3.Number of CPU cores
CPU clock speed determines the computing capacity of a single CPU core, while the number of CPU cores determines the parallel computing capacity of the database.
4.Memory clock speed, bus bandwidth
When reading/writing in the memory, the memory clock speed and bus bandwidth determine the total read/write capacity, which is a very important factor.
For example, for DDR 2 667, the bandwidth is 64bit×667MHz÷8≈5.3GB/s. If it is a dual channel memory, we have to multiply by 2, so the memory data bandwidth of a dual channel DDR 2 667 is 10.6GB/s.
https://www.cyberciti.biz/faq/check-ram-speed-linux/
https://en.wikipedia.org/wiki/Memory_bandwidth
For example, the maximum memory read/write bandwidth is
64*2*2400/8/1024= 37.5 GB/s
dmidecode --type 17
Array Handle: 0x0034
Error Information Handle: Not Provided
Total Width: 72 bits ## ECC, 64+8
Data Width: 72 bits
Size: 32 GB
Form Factor: DIMM
Set: None
Locator: CPU0_A0
Bank Locator: NODE 1
Type: DDR4
Type Detail:
Speed: 2400 MHz
Manufacturer:
Serial Number:
Asset Tag:
Part Number:
Rank: 2
Configured Clock Speed: 2133 MHz
Note: This is the technical hard cap of the memory and will not typically be reached by a single CPU core.
What is the computing speed of a single CPU core? We can find out with a simple test.
Memory speed
#dd if=/dev/zero of=/dev/null bs=4k count=1024000000
^C68517474+0 records in
68517473+0 records out
280647569408 bytes (281 GB) copied, 34.1855 s, 8.2 GB/s
Block device speed
#dd if=/dev/Block device name of=/dev/null bs=4k count=102300000
^C2687957+0 records in
2687956+0 records out
11009867776 bytes (11 GB) copied, 4.6525 s, 2.4 GB/s
In reality, the speed of a single core can't possibly reach 8.2 GB/s when used in calculations for a database.
5.Hard drive capacity - random IOPS performance
Random IOPS performance will be involved in index access, and (concurrent) access to data in the same hard drive by multiple sessions or processes.
(Cached read can improve the performance of parallel access, nearing sequential IOPS performance.)
6.Hard drive capacity - sequential IOPS performance
Leaving alone concurrencies, an AP system generally reads/writes files in a sequential manner as long as it is not an index scan.
7.Hard drive bandwidth and interface speed
The bandwidth interface speeds of a hard drive determine its maximum data scanning speed.
For example, some manufacturers may present such data about read/write bandwidth:
http://www.shannon-sys.com/product_detail?id=4929256206666909936
Note: this is the technical limit of the hard drive. The computing capacity of a single CPU core cannot reach this limit in most of the cases.
8.Network bandwidth
Network bandwidth determines the data import speed, as well as the redistribution speed during JOIN operations.
A single host may have multiple NICs and data nodes. The network bandwidth is estimated based on the total output bandwidth. For example, if there are ten hosts in a GP cluster and each has two 10 GB NICs, then the total network bandwidth would be 200 GB.
9.Storage skew
The shortcoming of a distributed system is that the slowest node determines the overall processing time. This becomes a larger issue in case of data skew.
The above are the main influencing factors for the performance of a database. Then how can we evaluate the SQL response speed based on these factors?
The cost model of PostgreSQL includes some of these factors, so we can work out the final SQL running cost based on the cost calculation formula and statistical data. If we align the cost with the time, we will be able to see the SQL running time.
Verification of optimizer cost factors - aligning cost constants to timestamps in PostgreSQL
The above evaluation still requires a database, and data (or statistical information of data) to be imported into the database. How can we evaluate the SQL response time if we only have hardware and data indicators, instead of an actual database? We can take samples of the formula, and then use the database cluster and data indicators as well as our intended SQL requests to evaluate expected performance.
We can simplify the evaluation model, as the CPU has significant effect (e.g. LLVM, vector optimization, or other optimizations) on the results. Here, I choose to ignore the deviation introduced by the CPU. We will not take into account data skew either.
We will discuss how to conduct a performance evaluation with the following environment as an example.
• Hard drive – 2 hard drives, with a read/write bandwidth of 2 GB/s each, to be made into one hard drive through LVM. The total bandwidth is 4 GB/s
• Memory – 512 GB, with a read/write bandwidth of 37.5 GB/s
• CPU – 2.5 GHz, 32-Core
• NIC – Two 10 GB NICs
• Number of machines – 8
• Number of data nodes on each machine – 16 data nodes
Performance indicators obtained in another environment
Let’s take integer data for example:
GP column-store
postgres=# create table mmtest(id int) with (appendonly=true, blocksize=2097152, ORIENTATION=COLUMN);
NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
postgres=# insert into mmtest select generate_series(1,100000);
INSERT 0 100000
insert into mmtest select * from mmtest ;
...
postgres=# insert into mmtest select * from mmtest ;
INSERT 0 409600000
postgres=# select pg_size_pretty(pg_total_relation_size('mmtest'));
pg_size_pretty
----------------
3133 MB
(1 row)
postgres=# select count(*) from mmtest ;
count
-----------
819200000
(1 row)
Time: 779.444 ms
postgres=# select * from mmtest where id=0;
id
----
(0 rows)
Time: 422.538 ms
GP row-store
postgres=# create table mmtest1(id int)
postgres-# ;
NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
Time: 273.659 ms
postgres=# insert into mmtest1 select * from mmtest;
postgres=# select pg_size_pretty(pg_total_relation_size('mmtest1'));
pg_size_pretty
----------------
28 GB
(1 row)
postgres=# select count(*) from mmtest1 ;
count
-----------
819200000
(1 row)
Time: 1171.229 ms
postgres=# select * from mmtest1 where id=0;
id
----
(0 rows)
Time: 452.582 ms
PG row-store
create unlogged table mmtest(id int);
postgres=# insert into mmtest select generate_series(1,100000);
INSERT 0 100000
insert into mmtest select * from mmtest ;
...
postgres=# insert into mmtest select * from mmtest ;
INSERT 0 409600000
postgres=# select pg_size_pretty(pg_total_relation_size('mmtest'));
pg_size_pretty
----------------
28 GB
(1 row)
postgres=# select * from mmtest where id=0;
id
----
(0 rows)
Time: 56410.222 ms (00:56.410)
32 Parallel computing
3.02 seconds
Store Type | Details |
GP column-store | Single core integer data filtering speed of 40 million rows/s Entire machine integer data filtering speed of 1.88 billion rows/s (Including scanning time) |
GP row-store | Single core integer data filtering speed of 37 million rows/s Entire machine integer data filtering speed of 1.77 billion rows/s (Including scanning time) |
PG row-store | Single core integer data filtering speed of 15 million rows/s Entire machine integer data filtering speed of 264.9 million rows/s (Including scanning time) |
2,599 posts | 764 followers
FollowApsaraDB - August 1, 2022
Alibaba Cloud_Academy - September 4, 2023
ApsaraDB - June 15, 2023
Alibaba Clouder - December 12, 2017
ApsaraDB - September 19, 2022
Alibaba Clouder - February 12, 2021
2,599 posts | 764 followers
FollowLearn More
An online MPP warehousing service based on the Greenplum Database open source program
Learn MoreSupports data migration and data synchronization between data engines, such as relational database, NoSQL and OLAP
Learn MoreMore Posts by Alibaba Clouder
Raja_KT March 7, 2019 at 5:51 am
The key to MPP DBs is the even distribution based on key or index and data modeling....Whether Teradata, GP, Netezza, Vertica..... performance can be gauged based on work