×
Community Blog Running TPC-H Benchmark on AnalyticDB PostgreSQL

Running TPC-H Benchmark on AnalyticDB PostgreSQL

This tutorial shows you how to run TPC-H benchmark on Alibaba Cloud AnalyticDB PostgreSQL

This article provides a general overview of using Terraform to achieve the goal. If you prefer using a graphical interface, you can refer to the following article:

Starter Guide | AnalyticDB for PostgreSQL TPC-H Benchmark Testing in One Click

Overview

About TPC-H

As stated in the TPC Benchmark™ H (TPC-H) specification:

“TPC-H is a decision support benchmark. It consists of a suite of business-oriented ad hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.”

For more information, see TPC-H specifications.

  • Note:

This implementation of TPC-H is derived from the TPC-H Benchmark and is not comparable to published TPC-H Benchmark results, as this implementation does not comply with all the requirements of the TPC-H Benchmark.

This is the ER(Entity Relationship) diagram of 8 tables in TPC-H.

1

(source: TPC Benchmark H Standard Specification)

About the TPC-H benchmark data

In TPC-H, SF (Scale Factor) is used to describe the amount of data, and 1 SF corresponds to 1GB unit. 100 SF is 100GB. The data volume corresponding to 1 SF is only the total data volume of the 8 tables, excluding the space occupation such as indexes, and more space needs to be reserved when preparing data. The data volume of each table under the 100GB data set is as follows:

Table Name Row Count
customer 15,000,000
lineitem 600,037,902
nation 25
orders 150,000,000
part 20,000,000
partsupp 80,000,000
region 5
supplier 1,000,000

Deployment architecture:

2

Step 1. Use Terraform to Provision ECS and Database on Alibaba Cloud

Run the terraform script to initialize the resources (in this tutorial, we use ECS and AnalyticDB PostgreSQL. OSS bucket will also be used for storing big TPC-H data set, and we will manually create the bucket later). Please specify the necessary information and region to deploy.

After the Terraform script execution finished, the ECS and AnalyticDB PostgreSQL instance information are listed as below.

3

Step 2. Configure and Mount Data Disk on ECS for TPC-H Data Set

Please log on to ECS with ECS EIP.

ssh root@<ECS_EIP>

Initialize and mount the data disk.

fdisk -u /dev/vdb

There will be some promote for the configuration, please follow the guide as shown in the image below.

4

Then input the following commands to finish the data disk mount.

fdisk -lu /dev/vdb

mkfs -t ext4 /dev/vdb1
cp /etc/fstab /etc/fstab.bak
echo `blkid /dev/vdb1 | awk '{print $2}' | sed 's/\"//g'` /mnt ext4 defaults 0 0 >> /etc/fstab
mount /dev/vdb1 /mnt
df -h

5

Step 3. Generate TPC-H 100GB Data Set and Upload to OSS

Install GIT, clone this github project and generate TPC-H data set. https://github.com/alibabacloud-howto/solution-adbpg-labs/blob/master/benchmark-tpc-h/data_gen_100gb.sh this file will generate 100GB data set. If you want to generate other size, please modify accordingly.

6

yum install -y git
cd /mnt
git clone https://github.com/alibabacloud-howto/solution-adbpg-labs.git
sh /mnt/solution-adbpg-labs/benchmark-tpc-h/data_gen_100gb.sh

Since we configured to generate the TPC-H data in 8 partitions in parallel, when input top command, there shows 8 dbgen processes generating data.

7

It will take for a while for these dbgen processes to finish the data generation. When dbgen processes disappear in the top view, then the data generation finish successfully.

8

Run the command df -h, it shows that 100+G Used under /mnt, which is the size of generated TPC-H data set in 100 SF.

9

Then create bucket in OSS for TPC-H data set.

10
11

Upload TPC-H data files to OSS bucket for parallel loading to AnalyticDB PostgreSQL later. Please update the parameters accordingly in the file https://github.com/alibabacloud-howto/solution-adbpg-labs/blob/master/benchmark-tpc-h/upload_tpch_oss.sh.

  • OSS_ENDPOINT : the OSS endpoint of the bucket created for TPC-H data set
  • OSS_BUCKET : the bucket created for TPC-H data set
  • AK_ID : your Alibaba Cloud account access key
  • AK_SECRET : your Alibaba Cloud account access secret

Then run the following commands:

cd /mnt
wget http://gosspublic.alicdn.com/ossutil/1.7.3/ossutil64
chmod 755 ossutil64
sh /mnt/solution-adbpg-labs/benchmark-tpc-h/upload_tpch_oss.sh

12

After the script finished, it will show 8 folders in OSS bucket for 8 tables correspondingly.

13

Step 4. Create TPC-H Schema in AnalyticDB PostgreSQL and Load Data from OSS

Create user account in AnalyticDB PostgreSQL:

  • Name: adbpg
  • Password: N1cetest

14

Download and setup AnalyticDB for PostgreSQL client.

Note:

Check if the dependency library is missing by running rpm -q compat-openssl10. If the dependency compat-openssl10 is not missing, the execution can proceed. If compat-openssl10 is missing, you need to install the library by executing yum install -y compat-openssl10.

cd /mnt
wget http://docs-aliyun.cn-hangzhou.oss.aliyun-inc.com/assets/attach/181125/cn_zh/1598426198114/adbpg_client_package.el7.x86_64.tar.gz
tar -xzvf adbpg_client_package.el7.x86_64.tar.gz
cd /mnt/adbpg_client_package/bin

vim ~/.pgpass

Input the following line in ~/.pgpass file, <AnalyticDB PostgreSQL connection string> is the connection string of the AnalyticDB PostgreSQL cluster.

Note:

Regarding .pgpass

Most access to the database, including the psql program, is done through the libpq library. This library includes a feature that allows you to place the passwords needed to connect as a user in a file called .pgpass (or the file referenced by PGPASSFILE).
The main purpose of this file is for password-free login.

<AnalyticDB PostgreSQL connection string>:5432:adbpg:adbpg:N1cetest

15

Then run the commands to create TPC-H tables.

chmod 0600 ~/.pgpass
cd /mnt/adbpg_client_package/bin
./psql -h<AnalyticDB PostgreSQL connection string> -Uadbpg adbpg -f /mnt/solution-adbpg-labs/benchmark-tpc-h/tpch-ddl.sql

16

Load TPC-H data set from OSS into AnalyticDB PostgreSQL. Please update the parameters accordingly in the file https://github.com/alibabacloud-howto/solution-adbpg-labs/blob/master/benchmark-tpc-h/load_tpch_oss_data.sql before executing the following commands.

  • oss://adbpg-tpch-bechmark-hongkong : change to your target TPC-H bucket accordingly
  • <ACCESS KEY> : your Alibaba Cloud account access key
  • <ACCESS SECRET> : your Alibaba Cloud account access secret
  • oss-cn-hongkong-internal.aliyuncs.com : change to the endpoint of your target TPC-H bucket accordingly
cd /mnt/adbpg_client_package/bin
./psql -h<AnalyticDB PostgreSQL connection string> -Uadbpg adbpg -f /mnt/solution-adbpg-labs/benchmark-tpc-h/load_tpch_oss_data.sql

17

After loading finished, run the SELECT COUNT(*) to verify the row count in 8 tables.

18

And please also run ANALYZE to collect statistics for tables for optimizer to generate better query execution plan.

19

Step 5. Run TPC-H Query Benchmark

Please update the parameters accordingly in the file https://github.com/alibabacloud-howto/solution-adbpg-labs/blob/master/benchmark-tpc-h/query.sh before execution.

  • ADB_PG_URL : AnalyticDB PostgreSQL cluster connection string
  • ADB_PG_USER : AnalyticDB PostgreSQL cluster account user name (no need to change if you follow this guide to use adbpg)
sh /mnt/solution-adbpg-labs/benchmark-tpc-h/query.sh

All the TPC-H queries are located in https://github.com/alibabacloud-howto/solution-adbpg-labs/tree/master/benchmark-tpc-h/tpch_query.

2 0 0
Share on

ApsaraDB

439 posts | 93 followers

You may also like

Comments

5109060302059819 August 12, 2022 at 11:01 am

You are not created the relationships after tables creation and data loading. Is it correct?

5126080747602209 March 13, 2024 at 8:26 am

It's correct. There is no need to define a foreign key, index, etc., because it uses similar names in DDL for the columns that refer to the same value. For example, 'NATIONKEY' is configured as 'c_nationkey' in the CUSTOMER table and as 'n_nationkey' in the NATION table. Then when running the TPC query workload (DML), it uses a WHERE condition like 'WHERE c_nationkey = n_nationkey' to filter.

rambo March 22, 2024 at 8:30 am

It's correct. There is no need to define a foreign key, index, etc., because it uses similar names in DDL for the columns that refer to the same value. For example, 'NATIONKEY' is configured as 'c_nationkey' in the CUSTOMER table and as 'n_nationkey' in the NATION table. Then when running the TPC query workload (DML), it uses a WHERE condition like 'WHERE c_nationkey = n_nationkey' to filter.