TPC-DS is one of the most well-known and widely recognized benchmarks that are used to measure the performance and efficiency of big data systems. Alibaba Cloud E-MapReduce (EMR) is the first big data system that is certified as capable of running the TPC-DS 100 TB benchmark. This topic describes how to take full advantage of OSS-HDFS in an EMR cluster to successfully execute the 99 SQL statements of TPC-DS and achieve the best performance by taking a series of optimization measures.
Scenarios
Big data performance evaluation
If you want to evaluate big data processing performance when you use OSS-HDFS as the storage of Hadoop, especially for optimizing large-scale data analysis and query workloads, you can use TPC-DS as a decision support benchmark.
Data lake architecture verification
If you are building or already have a data lake architecture based on Alibaba Cloud Object Storage Service (OSS) and want to verify its efficiency in complex query processing, extract, transform, and load (ETL) jobs, and data warehouse performance, you can run TPC-DS in EMR clusters to obtain objective and standardized results.
Understand system performance
When you scale out a cluster, upgrade hardware, or adjust storage policies, TPC-DS benchmark can help you understand the overall performance of the system as data volume increases and computing resources change.
Cost-effectiveness analysis
OSS has advantages over traditional Hadoop Distributed File System (HDFS) in terms of cost. By measuring the performance of OSS-HDFS against TPC-DS, you can combine cost factors to determine whether OSS-HDFS is suitable for a specific business scenario and make a more economical choice.
Background information
TPC-DS is a standard benchmark designed and maintained by Transaction Processing Performance Council (TPC), one of the most well-known organization that defines measurement benchmarks for data management systems. The official tools of TPC-DS are mainly used for standalone data generation and SQL query execution, which are not suitable for large-scale distributed scenarios. To apply to big data analysis scenarios, you must prepare the following tool and EMR cluster:
Hive TPC-DS benchmark testing tool
The tool is developed by Hortonworks and is customized for components such as Hive and Spark in the Hadoop ecosystem. It can effectively simulate big data query challenges and support the generation and execution of complex SQL statements in TPC-DS and TPC-H standards in a cluster environment.
EMR cluster of 5.15.1 or later
You must use an EMR cluster of 5.15.1 or later. This version is compatible with and supports the Hortonworks Data Platform (HDP) 3 series, which corresponds to Hive 3.1.
Step 1: Create an EMR cluster and download the TPC-DS benchmark testing tool
Create an EMR cluster of 5.15.1 or later.
When you create an EMR cluster, take note of the following parameters. For more information about other parameters, see Create a cluster.
Category
Parameter
Description
Software configurations
Business Scenario
Select New Data Lake.
Metadata
Select DLF Unified Metadata.
Root Storage Directory of Cluster
Select a bucket for which OSS-HDFS is enabled.
Hardware configurations
Node Group
Set Node Group to Assign Public Network IP.
If you want to achieve the best performance, we recommend that you select a big data or local SSD instance type for the emr-core node. If you want to use a small amount of data to complete all processes in a short period of time, you can also select a general-purpose instance type that has 4 vCPUs and 16 GiB of memory for the emr-core node.
ImportantYou can determine the cluster size based on the dataset that you want to use. Make sure that the total capacity of the data disks of the core nodes is more than three times the size of the dataset. For more information about datasets, see Step 3: Generate and load data.
Log on to the emr-master node of your EMR cluster in SSH mode. For more information, see Log on to a cluster.
Install Git and Maven.
Run the following command to install Git:
sudo yum install -y git
Download the Binary tar.gz archive package of the latest version, such as apache-maven-3.9.6-bin.tar.gz, on the Apache Maven Project page.
Upload the Binary tar.gz archive package to the emr-master node of your EMR cluster and decompress the package.
tar zxf apache-maven-3.9.6-bin.tar.gz
Configure environment variables.
Run the following command to switch to the apache-maven-3.9.6 directory:
cd apache-maven-3.9.6
Run the following command to configure environment variables:
export MAVEN_HOME=`pwd` export PATH=`pwd`/bin:$PATH
Download the Hive TPC-DS benchmark testing tool.
Download the Hive TPC-DS benchmark testing tool by using one of the following methods:
Download the tool from GitHub
If you run the following command on a node that resides in a region in the Chinese mainland, the download may be slow. If the download fails, you can use another method.
git clone https://github.com/hortonworks/hive-testbench.git
Download the hive-testbench-hdp3.zip package to your local computer.
Upload the ZIP package to the emr-master node of your EMR cluster.
Run the following command to decompress the uploaded ZIP package in the emr-master node of your EMR cluster:
unzip hive-testbench-hdp3.zip
Step 2: Compile and package a data generator
Optional. Configure an Alibaba Cloud image.
You can use an image provided by Alibaba Cloud to accelerate Maven compilation in regions in the Chinese mainland. If the image is used, a data generator can be compiled and packaged in 2 to 3 minutes.
Run the following command to create a directory:
mkdir -p ~/.m2/
Run the following command to copy the configuration file of Maven to the new directory:
cp $MAVEN_HOME/conf/settings.xml ~/.m2/
Add the following image information to the ~/.m2/settings.xml file:
<mirror> <id>aliyun</id> <mirrorOf>central</mirrorOf> <name>Nexus aliyun</name> <url>http://maven.aliyun.com/nexus/content/groups/public</url> </mirror>
Switch to the hive-testbench-hdp3 directory.
cd hive-testbench-hdp3
Download the tpcds-extern.patch file, upload the file to the current directory, and then copy the file to the
tpcds-gen/patches/all/
directory.cp tpcds-extern.patch ./tpcds-gen/patches/all/
Use the toolset of TPC-DS to compile and package a data generator.
./tpcds-build.sh
Step 3: Generate and load data
Specify a scale factor (SF).
An SF is used to specify the size of a dataset. The size is measured in GB. For example, SF=1 indicates a 1 GB dataset, SF=100 indicates a 100 GB dataset, and SF=1000 indicates a 1 TB dataset. In this example, a small dataset is used, and SF is set to 3. Command:
SF=3
ImportantMake sure that the total capacity of data disks of the core nodes is more than three times the size of the dataset. Otherwise, an error is reported in subsequent operations.
Check and clean up the Hive database that you want to use.
Check whether the Hive database that you want to use exists.
hive -e "desc database tpcds_bin_partitioned_orc_$SF"
Optional. If the database exists, clean up the database.
ImportantIf the tpcds_bin_partitioned_orc_$SF database exists, you must run the following command to clean up the database. Otherwise, an error is reported in subsequent operations. If the database does not exist, skip this step.
hive -e "drop database tpcds_bin_partitioned_orc_$SF cascade"
Configure a Hive service URL.
The default Hive service URL configured in the tpcds-setup.sh script file is inconsistent with the Hive service URL configured in the EMR cluster. You must run the following command to replace the default Hive service URL with the Hive service URL configured in the EMR cluster:
sed -i 's/localhost:2181\/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2?tez.queue.name=default/master-1-1:10000\//' tpcds-setup.sh
The default Hive service URL in the script file is
jdbc:hive2://localhost:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2?tez.queue.name=default
. After you run the preceding command, the Hive service URL is changed tojdbc:hive2://master-1-1:10000/
.Fix configuration issues for the Hive TPC-DS benchmark testing tool.
Some parameters are not supported in open source Hive of specific versions such as Hive 2 and Hive 3. If you continue to use TPC-DS in Hive 2 or Hive 3, an error may be reported for jobs. You must run the following command to replace the unsupported parameters:
sed -i 's/hive.optimize.sort.dynamic.partition.threshold=0/hive.optimize.sort.dynamic.partition=true/' settings/*.sql
Generate and load data.
If SF is set to 3, data can be generated and loaded in approximately 40 to 50 minutes. If this step is successful, the generated TPC-DS data table is loaded to the tpcds_bin_partitioned_orc_$SF database. The generated data is automatically saved to the root storage path that you specified when you created the EMR cluster, which is the root directory of the bucket for which OSS-HDFS is enabled. Run the following command to load data:
./tpcds-setup.sh $SF
Obtain Hive table statistics.
We recommend that you use the Hive SQL ANALYZE command to obtain Hive table statistics. This helps speed up subsequent SQL queries. If SF is set to 3, the Hive table statistics can be obtained in approximately 20 to 30 minutes.
hive -f ./hive-testbench-hdp3/ddl-tpcds/bin_partitioned/analyze.sql \ --hiveconf hive.execution.engine=tez \ --database tpcds_bin_partitioned_orc_$SF
Step 4: Execute TPC-DS SQL statements
You can use Hive or Spark to execute TPC-DS SQL statements.
Use Hive to execute TPC-DS SQL statements
Execute a single SQL statement.
TPC-DS has a total of 99 SQL files, such as query10.sql and query11.sql. All files are placed in the sample-queries-tpcds directory. If SF is set to 3, every TPC-DS SQL statement can return output within 5 minutes.
ImportantTPC-DS queries and TPC-DS data are randomly generated. Therefore, some SQL statements may return no records.
cd sample-queries-tpcds hive --database tpcds_bin_partitioned_orc_$SF set hive.execution.engine=tez; source query10.sql;
Execute all 99 SQL statements in sequence by using the script file that is provided in the toolset of TPC-DS. Example:
cd ~/hive-testbench-hdp3 # Generate a Hive configuration file and set the Hive execution engine to Tez. echo 'set hive.execution.engine=tez;' > sample-queries-tpcds/testbench.settings ./runSuite.pl tpcds $SF
Use Spark to execute TPC-DS SQL statements
The toolset of TPC-DS provides some sample Spark SQL statements in the spark-queries-tpcds directory. You can use a command-line tool, such as Spark SQL
or Spark Beeline
, to execute the sample statements. In this step, Spark Beeline, which is connected to Spark Thrift Server, is used in the example. This example shows how to execute TPC-DS SQL statements to query the TPC-DS dataset that is generated in Step 3.
EMR Spark allows you to store tables in multiple storage media, such as HDFS and OSS, and allows you to store metadata in DLF.
Run the Spark Beeline ANALYZE command to obtain Hive table statistics. This helps accelerate subsequent SQL queries.
cd ~/hive-testbench-hdp3 spark-beeline -u jdbc:hive2://master-1-1:10001/tpcds_bin_partitioned_orc_$SF \ -f ./ddl-tpcds/bin_partitioned/analyze.sql
Switch to the directory in which the sample Spark SQL statements are placed.
cd spark-queries-tpcds/
Execute a single SQL statement.
spark-beeline -u jdbc:hive2://master-1-1:10001/tpcds_bin_partitioned_orc_$SF -f q1.sql
Execute all 99 SQL statements in sequence.
The toolset of TPC-DS does not contain a script file that can be used to execute all Spark SQL statements at a time. You can use the following simple script for reference:
for q in `ls *.sql`; do spark-beeline -u jdbc:hive2://master-1-1:10001/tpcds_bin_partitioned_orc_$SF -f $q > $q.out done
ImportantIn the q30.sql file, the column name c_last_review_date_sk is written as c_last_review_date by mistake. As a result, the thirtieth SQL statement fails.