MaxCompute uses the TPC-DS official tool to generate 10-GB, 100-GB, 1-TB, and 10-TB TPC-DS datasets. You can use the TPC-DS datasets for product testing. This topic describes the basic information about TPC-DS datasets in MaxCompute public datasets and how to use MaxCompute to query data from the TPC-DS datasets.
Description
TPC-DS, short for TPC BenchmarkTM DS, is a standard benchmark formulated by Transaction Processing Performance Council (TPC), the most well-known organization that defines benchmarks for measuring the performance of data management systems. The measurement results of the benchmark are also published by TPC.
MaxCompute uses the TPC-DS official tool to generate 10-GB, 100-GB, 1-TB, and 10-TB TPC-DS datasets. The datasets are stored in different schemas of the MaxCompute public project BIGDATA_PUBLIC_DATASET. For more information about schemas, see Schema-related operations. After you activate MaxCompute and create a project, you can query TPC-DS tables by performing cross-project access. The following table shows the information about tables.
Data size | Project name | Schema name | Table name |
10 GB | BIGDATA_PUBLIC_DATASET | TPCDS_10G | call_center catalog_page catalog_returns catalog_sales customer customer_address customer_demographics date_dim household_demographics income_band inventory item promotion reason ship_mode store store_returns store_sales tab_reducenum tab_reducenum_100 time_dim warehouse web_page web_returns web_sales web_site |
100 GB | BIGDATA_PUBLIC_DATASET | TPCDS_100G | |
1 TB | BIGDATA_PUBLIC_DATASET | TPCDS_1T | |
10 TB | BIGDATA_PUBLIC_DATASET | TPCDS_10T |
Table data is referenced from TPC-DS. For more information, see TPC.
For more information about the table schema and content, see TPC BENCHMARK™ DS.
Declaration
The TPC-DS data is generated and analyzed in the MaxCompute public dataset based on the TPC-DS benchmark test. The test results of data in the MaxCompute public dataset differ from the TPC-DS benchmark results published by TPC because the test that is performed on the MaxCompute public dataset cannot meet the requirements of the TPC-DS benchmark test.
The TPC-DS datasets provided by MaxCompute can be used only for product testing. The data is not periodically updated. Therefore, we recommend that you do not use the TPC-DS datasets in production environments.
The TPC-DS data provided by MaxCompute is obtained from TPC. You can also generate your TPC-DS data. For more information about how to generate TPC-DS test data, see TPC-DS documentation.
Supported regions
Region | Region ID |
China (Hangzhou) | cn-hangzhou |
China (Shanghai) | cn-shanghai |
China (Beijing) | cn-beijing |
China (Zhangjiakou) | cn-zhangjiakou |
China (Ulanqab) | cn-wulanchabu |
China (Shenzhen) | cn-shenzhen |
China (Chengdu) | cn-chengdu |
China (Hong Kong) | cn-hongkong |
Singapore | ap-southeast-1 |
Japan (Tokyo) | ap-northeast-1 |
Malaysia (Kuala Lumpur) | ap-southeast-3 |
Indonesia (Jakarta) | ap-southeast-5 |
US (Silicon Valley) | us-west-1 |
US (Virginia) | us-east-1 |
UK (London) | eu-west-1 |
Germany (Frankfurt) | eu-central-1 |
UAE (Dubai) | me-east-1 |
China East 2 Finance | cn-shanghai-finance-1 |
China North 2 Finance | cn-beijing-finance-1 |
China South 1 Finance | cn-shenzhen-finance-1 |
China North 2 Ali Gov 1 | cn-north-2-gov-1 |
Precautions
Public datasets are available to all MaxCompute users. When you use public datasets, take note of the following items:
All data is stored in the public MaxCompute project BIGDATA_PUBLIC_DATASET. No MaxCompute users belong to this project. Therefore, you need to access the data across projects. When you write an SQL script, you must specify the project name and schema name before the table name. If you do not enable the tenant-level schema syntax, you need to enable the session-level schema syntax before you execute a statement. Sample statements:
-- Enable the session-level schema syntax. set odps.namespace.schema=true; -- In this example, data in the tpcds_10g dataset is queried. If you want to query data from another dataset, manually replace the schema name in the following statement with the name of the schema in which the dataset is stored. select * from bigdata_public_dataset.tpcds_10g.store_sales limit 100;
NoteYou do not need to pay for the storage of the data in the public datasets. However, you are charged computing fees that are generated when you execute statements. For more information about billing rules, see Computing pricing.
You cannot find the tables in the public datasets on the DataMap page of DataWorks because cross-project access is required.
TPC-DS datasets are stored in projects that support storage by schema. If you do not enable the tenant-level schema syntax, you cannot view the TPC-DS datasets in the public datasets provided by DataAnalysis of DataWorks, but you can query the TPC-DS datasets by using the SQL statements provided by MaxCompute.
Data is accessed across projects. To ensure that SQL statements are successfully executed, you need to run the following commands:
-- For data types used in the table schemas of TPC-DS datasets, such as DECIMAL and INT, you need to run the following commands: set odps.sql.hive.compatible=true; set odps.sql.type.system.odps2=true; set odps.sql.decimal.odps2=true; -- In the following commands, the flag values are the same as those for new projects and may be different from those for existing projects. Flag values for existing projects remain unchanged to prevent impact on existing queries. -- We recommend that you use the setproject commands to change the flag values to the default values. If you do not change the flag values, an error may be reported when the LIMIT keyword is not included in the ORDER BY clause of an SQL statement. An execution latency may also occur due to an invalid join order of the TPC-DS Q72 query. set odps.sql.validate.orderby.limit=false; set odps.optimizer.join.reorder.enable=true; set odps.optimizer.column.stat.enable=true; -- Cartesian products are used when you perform the TPC-DS Q77 query for TPC-DS datasets. By default, Cartesian products produced by MaxCompute are not supported in sort-merge join operations. If you want to use Cartesian products in sort-merge join operations, run the following command: set odps.sql.allow.cartesian=true;
Perform a TPC-DS query
Prerequisites
MaxCompute is activated. A MaxCompute project is created. For more information about how to create a MaxCompute project, see Create a MaxCompute project.
Supported tools and platforms
Sample query files
MaxCompute provides sample query files for datasets of different sizes. Each file contains 99 queries. For these queries, the complexity and the range of scanned data vary greatly. We recommend that you select query files based on your business requirements. This prevents additional computing costs from being generated. You can also use the tools in the TPC-DS benchmark suite to generate different versions of these queries, which vary based on parameter values. For more information, see TPC-DS official documentation.
Data size | Query file |
10 GB | |
100 GB | |
1 TB | |
10 TB |
The sample query files described in this topic are referenced from the TPC-DS benchmark test. The test results of the preceding files differ from the TPC-DS benchmark results published by TPC because the test that is performed on the MaxCompute public dataset cannot meet the requirements of the TPC-DS benchmark test. For more information, see TPC.