This topic uses the OnTime and Star Schema datasets as examples to describe how to import a test dataset to ApsaraDB for ClickHouse and run a performance test.
Prerequisites
An ApsaraDB for ClickHouse cluster is created. For more information, see Create a cluster.
A database account is created. For more information, see Create a cluster.
A Linux machine is prepared and the IP address of the Linux machine is added to the whitelist of the ApsaraDB for ClickHouse cluster. For more information, see Configure a whitelist.
clickhouse-client that corresponds to the version of the ApsaraDB for ClickHouse cluster is installed. For more information, see clickhouse-client.
Test description
All clickhouse-client commands in this test are run in the directory in which clickhouse-client is installed.
The OnTime and Star Schema datasets that are used in this test are referenced from the official website of open source ClickHouse. For more information, see Example Datasets.
Use the OnTime dataset to test performance
1. Connect to the cluster
Run the following command in clickhouse-client to connect to the ApsaraDB for ClickHouse cluster in a Linux environment:
./clickhouse-client --host=<Cluster endpoint> --port=<TCP port number> --user=<Database account username> --password=<Database account password>
2. Create a table
Download and execute the table creation statement based on the edition of the ApsaraDB for ClickHouse cluster.
Script for creating a table in an ApsaraDB for ClickHouse cluster of the Single-replica Edition
Script for creating a table in an ApsaraDB for ClickHouse cluster of the Double-replica Edition
3. Download the OnTime dataset
The OnTime dataset covers an extremely large time range and requires a long time to download. In addition, a data loss error is prompted during the download.
echo https://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_{1987..2021}_{1..12}.zip | xargs -P10 wget --no-check-certificate --continue
4. Import the data to ApsaraDB for ClickHouse
Run the following command to configure the ck_url, ck_user, ck_pass, and ck_port environment variables and import the data:
export ck_url=<Cluster endpoint>
export ck_user=<Database account username>
export ck_pass=<Database account password>
export ck_port=<TCP port number>
for i in *.zip; do echo $i; unzip -cq $i '*.csv' | sed 's/\.00//g' | ./clickhouse-client -h $ck_url -u $ck_user --password $ck_pass --port $ck_port --query="INSERT INTO ontime FORMAT CSVWithNames"; done
5. Use the test dataset
The test dataset is referenced from the official website of open source ClickHouse. For more information, see OnTime.
Q0
SELECT avg(c1)
FROM
(
SELECT Year, Month, count(*) AS c1
FROM ontime
GROUP BY Year, Month
);
Q1
SELECT DayOfWeek, count(*) AS c
FROM ontime
WHERE Year>=2000 AND Year<=2008
GROUP BY DayOfWeek
ORDER BY c DESC;
Q2
SELECT DayOfWeek, count(*) AS c
FROM ontime
WHERE DepDelay>10 AND Year>=2000 AND Year<=2008
GROUP BY DayOfWeek
ORDER BY c DESC;
Q3
SELECT Origin, count(*) AS c
FROM ontime
WHERE DepDelay>10 AND Year>=2000 AND Year<=2008
GROUP BY Origin
ORDER BY c DESC
LIMIT 10;
Q4
SELECT Carrier, count(*)
FROM ontime
WHERE DepDelay>10 AND Year=2007
GROUP BY Carrier
ORDER BY count(*) DESC;
Q5
set joined_subquery_requires_alias=0;
set any_join_distinct_right_table_keys=1;
SELECT Carrier, c, c2, c*100/c2 as c3
FROM
(
SELECT
Carrier,
count(*) AS c
FROM ontime
WHERE DepDelay>10
AND Year=2007
GROUP BY Carrier
)
ANY INNER JOIN
(
SELECT
Carrier,
count(*) AS c2
FROM ontime
WHERE Year=2007
GROUP BY Carrier
) USING Carrier
ORDER BY c3 DESC;
Q6
set joined_subquery_requires_alias=0;
SELECT Carrier, c, c2, c*100/c2 as c3
FROM
(
SELECT
Carrier,
count(*) AS c
FROM ontime
WHERE DepDelay>10
AND Year>=2000 AND Year<=2008
GROUP BY Carrier
)
ANY INNER JOIN
(
SELECT
Carrier,
count(*) AS c2
FROM ontime
WHERE Year>=2000 AND Year<=2008
GROUP BY Carrier
) USING Carrier
ORDER BY c3 DESC;
Q7
set joined_subquery_requires_alias=0;
SELECT Year, c1/c2
FROM
(
select
Year,
count(*)*100 as c1
from ontime
WHERE DepDelay>10
GROUP BY Year
)
ANY INNER JOIN
(
select
Year,
count(*) as c2
from ontime
GROUP BY Year
) USING (Year)
ORDER BY Year;
Q8
SELECT DestCityName, uniqExact(OriginCityName) AS u
FROM ontime
WHERE Year>=2000 and Year<=2010
GROUP BY DestCityName
ORDER BY u DESC
LIMIT 10;
Q9
SELECT Year, count(*) AS c1
FROM ontime
GROUP BY Year;
Q10
SELECT
min(Year), max(Year), Carrier, count(*) AS cnt,
sum(ArrDelayMinutes>30) AS flights_delayed,
round(sum(ArrDelayMinutes>30)/count(*),2) AS rate
FROM ontime
WHERE
DayOfWeek NOT IN (6,7) AND OriginState NOT IN ('AK', 'HI', 'PR', 'VI')
AND DestState NOT IN ('AK', 'HI', 'PR', 'VI')
AND FlightDate < '2010-01-01'
GROUP by Carrier
HAVING cnt>100000 and max(Year)>1990
ORDER by rate DESC
LIMIT 1000;
Use the Star Schema dataset to test performance
1. Connect to the cluster
Run the following command in clickhouse-client to connect to the ApsaraDB for ClickHouse cluster in a Linux environment:
./clickhouse-client --host=<Cluster endpoint> --port=<TCP port number> --user=<Database account username> --password=<Database account password>
2. Create a table
Download and execute the table creation statement based on the edition of the ApsaraDB for ClickHouse cluster.
Script for creating a table in an ApsaraDB for ClickHouse cluster of the Single-replica Edition
Script for creating a table in an ApsaraDB for ClickHouse cluster of the Double-replica Edition
3. Generate data files
Run the following Shell command to clone the data generation project ssb-dbgen, use Make to compile the project, and then generate data files. In this example, four data files are generated: customer.tbl, lineorder.tbl, part.tbl, and supplier.tbl.
The -s 100
parameter in the example specifies that approximately 600 million rows of data are generated after the command is run. The data is 67 GB in size. You can adjust the size of the data as needed.
git clone https://github.com/vadimtk/ssb-dbgen.git
cd ssb-dbgen
make
./dbgen -s 100 -T c
./dbgen -s 100 -T l
./dbgen -s 100 -T p
./dbgen -s 100 -T s
4. Import the data to ApsaraDB for ClickHouse
Run the following command to configure the ck_url, ck_user, ck_pass, and ck_port environment variables and import the data:
export ck_url=<Cluster endpoint>
export ck_user=<Database account username>
export ck_pass=<Database account password>
export ck_port=<TCP port number>
./clickhouse-client -h <Cluster endpoint> -u <Database account username> --password <Database account password> --port <TCP port number> --query "INSERT INTO customer FORMAT CSV" < ssb-dbgen/customer.tbl
./clickhouse-client -h <Cluster endpoint> -u <Database account username> --password <Database account password> --port <TCP port number> --query "INSERT INTO part FORMAT CSV" < ssb-dbgen/part.tbl
./clickhouse-client -h <Cluster endpoint> -u <Database account username> --password <Database account password> --port <TCP port number> --query "INSERT INTO supplier FORMAT CSV" < ssb-dbgen/supplier.tbl
./clickhouse-client -h <Cluster endpoint> -u <Database account username> --password <Database account password> --port <TCP port number> --query "INSERT INTO lineorder FORMAT CSV" < ssb-dbgen/lineorder.tbl
5. Use the test dataset
The test dataset is referenced from the official website of open source ClickHouse. For more information, see Star Schema Benchmark.
Q0
INSERT INTO lineorder_flat
SELECT
l.LO_ORDERKEY AS LO_ORDERKEY,
l.LO_LINENUMBER AS LO_LINENUMBER,
l.LO_CUSTKEY AS LO_CUSTKEY,
l.LO_PARTKEY AS LO_PARTKEY,
l.LO_SUPPKEY AS LO_SUPPKEY,
l.LO_ORDERDATE AS LO_ORDERDATE,
l.LO_ORDERPRIORITY AS LO_ORDERPRIORITY,
l.LO_SHIPPRIORITY AS LO_SHIPPRIORITY,
l.LO_QUANTITY AS LO_QUANTITY,
l.LO_EXTENDEDPRICE AS LO_EXTENDEDPRICE,
l.LO_ORDTOTALPRICE AS LO_ORDTOTALPRICE,
l.LO_DISCOUNT AS LO_DISCOUNT,
l.LO_REVENUE AS LO_REVENUE,
l.LO_SUPPLYCOST AS LO_SUPPLYCOST,
l.LO_TAX AS LO_TAX,
l.LO_COMMITDATE AS LO_COMMITDATE,
l.LO_SHIPMODE AS LO_SHIPMODE,
c.C_NAME AS C_NAME,
c.C_ADDRESS AS C_ADDRESS,
c.C_CITY AS C_CITY,
c.C_NATION AS C_NATION,
c.C_REGION AS C_REGION,
c.C_PHONE AS C_PHONE,
c.C_MKTSEGMENT AS C_MKTSEGMENT,
s.S_NAME AS S_NAME,
s.S_ADDRESS AS S_ADDRESS,
s.S_CITY AS S_CITY,
s.S_NATION AS S_NATION,
s.S_REGION AS S_REGION,
s.S_PHONE AS S_PHONE,
p.P_NAME AS P_NAME,
p.P_MFGR AS P_MFGR,
p.P_CATEGORY AS P_CATEGORY,
p.P_BRAND AS P_BRAND,
p.P_COLOR AS P_COLOR,
p.P_TYPE AS P_TYPE,
p.P_SIZE AS P_SIZE,
p.P_CONTAINER AS P_CONTAINER
FROM lineorder AS l
INNER JOIN customer AS c ON c.C_CUSTKEY = l.LO_CUSTKEY
INNER JOIN supplier AS s ON s.S_SUPPKEY = l.LO_SUPPKEY
INNER JOIN part AS p ON p.P_PARTKEY = l.LO_PARTKEY;
Q1
SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
FROM lineorder_flat
WHERE toYear(LO_ORDERDATE) = 1993 AND LO_DISCOUNT BETWEEN 1 AND 3 AND LO_QUANTITY < 25;
Q2
SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
FROM lineorder_flat
WHERE toYYYYMM(LO_ORDERDATE) = 199401 AND LO_DISCOUNT BETWEEN 4 AND 6 AND LO_QUANTITY BETWEEN 26 AND 35;
Q3
SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
FROM lineorder_flat
WHERE toISOWeek(LO_ORDERDATE) = 6 AND toYear(LO_ORDERDATE) = 1994
AND LO_DISCOUNT BETWEEN 5 AND 7 AND LO_QUANTITY BETWEEN 26 AND 35;
Q4
SELECT
sum(LO_REVENUE),
toYear(LO_ORDERDATE) AS year,
P_BRAND
FROM lineorder_flat
WHERE P_CATEGORY = 'MFGR#12' AND S_REGION = 'AMERICA'
GROUP BY
year,
P_BRAND
ORDER BY
year,
P_BRAND;
Q5
SELECT
sum(LO_REVENUE),
toYear(LO_ORDERDATE) AS year,
P_BRAND
FROM lineorder_flat
WHERE P_BRAND >= 'MFGR#2221' AND P_BRAND <= 'MFGR#2228' AND S_REGION = 'ASIA'
GROUP BY
year,
P_BRAND
ORDER BY
year,
P_BRAND;
Q6
SELECT
sum(LO_REVENUE),
toYear(LO_ORDERDATE) AS year,
P_BRAND
FROM lineorder_flat
WHERE P_BRAND = 'MFGR#2239' AND S_REGION = 'EUROPE'
GROUP BY
year,
P_BRAND
ORDER BY
year,
P_BRAND;
Q7
SELECT
C_NATION,
S_NATION,
toYear(LO_ORDERDATE) AS year,
sum(LO_REVENUE) AS revenue
FROM lineorder_flat
WHERE C_REGION = 'ASIA' AND S_REGION = 'ASIA' AND year >= 1992 AND year <= 1997
GROUP BY
C_NATION,
S_NATION,
year
ORDER BY
year ASC,
revenue DESC;
Q8
SELECT
C_CITY,
S_CITY,
toYear(LO_ORDERDATE) AS year,
sum(LO_REVENUE) AS revenue
FROM lineorder_flat
WHERE C_NATION = 'UNITED STATES' AND S_NATION = 'UNITED STATES' AND year >= 1992 AND year <= 1997
GROUP BY
C_CITY,
S_CITY,
year
ORDER BY
year ASC,
revenue DESC;
Q9
SELECT
C_CITY,
S_CITY,
toYear(LO_ORDERDATE) AS year,
sum(LO_REVENUE) AS revenue
FROM lineorder_flat
WHERE (C_CITY = 'UNITED KI1' OR C_CITY = 'UNITED KI5') AND (S_CITY = 'UNITED KI1' OR S_CITY = 'UNITED KI5') AND year >= 1992 AND year <= 1997
GROUP BY
C_CITY,
S_CITY,
year
ORDER BY
year ASC,
revenue DESC;
Q10
SELECT
C_CITY,
S_CITY,
toYear(LO_ORDERDATE) AS year,
sum(LO_REVENUE) AS revenue
FROM lineorder_flat
WHERE (C_CITY = 'UNITED KI1' OR C_CITY = 'UNITED KI5') AND (S_CITY = 'UNITED KI1' OR S_CITY = 'UNITED KI5') AND toYYYYMM(LO_ORDERDATE) = 199712
GROUP BY
C_CITY,
S_CITY,
year
ORDER BY
year ASC,
revenue DESC;
Q11
SELECT
toYear(LO_ORDERDATE) AS year,
C_NATION,
sum(LO_REVENUE - LO_SUPPLYCOST) AS profit
FROM lineorder_flat
WHERE C_REGION = 'AMERICA' AND S_REGION = 'AMERICA' AND (P_MFGR = 'MFGR#1' OR P_MFGR = 'MFGR#2')
GROUP BY
year,
C_NATION
ORDER BY
year ASC,
C_NATION ASC;
Q12
SELECT
toYear(LO_ORDERDATE) AS year,
S_NATION,
P_CATEGORY,
sum(LO_REVENUE - LO_SUPPLYCOST) AS profit
FROM lineorder_flat
WHERE C_REGION = 'AMERICA' AND S_REGION = 'AMERICA' AND (year = 1997 OR year = 1998) AND (P_MFGR = 'MFGR#1' OR P_MFGR = 'MFGR#2')
GROUP BY
year,
S_NATION,
P_CATEGORY
ORDER BY
year ASC,
S_NATION ASC,
P_CATEGORY ASC;
Q13
SELECT
toYear(LO_ORDERDATE) AS year,
S_CITY,
P_BRAND,
sum(LO_REVENUE - LO_SUPPLYCOST) AS profit
FROM lineorder_flat
WHERE S_NATION = 'UNITED STATES' AND (year = 1997 OR year = 1998) AND P_CATEGORY = 'MFGR#14'
GROUP BY
year,
S_CITY,
P_BRAND
ORDER BY
year ASC,
S_CITY ASC,
P_BRAND ASC;