TPC-DS data - MaxCompute - Alibaba Cloud Documentation Center

0.0.201

MaxCompute uses the TPC-DS official tool to generate 10-GB, 100-GB, 1-TB, and 10-TB TPC-DS datasets. You can use the TPC-DS datasets for product testing. This topic describes the basic information about TPC-DS datasets in MaxCompute public datasets and how to use MaxCompute to query data from the TPC-DS datasets.

Description

TPC-DS, short for TPC Benchmark^TM DS, is a standard benchmark formulated by Transaction Processing Performance Council (TPC), the most well-known organization that defines benchmarks for measuring the performance of data management systems. The measurement results of the benchmark are also published by TPC.

MaxCompute uses the TPC-DS official tool to generate 10-GB, 100-GB, 1-TB, and 10-TB TPC-DS datasets. The datasets are stored in different schemas of the MaxCompute public project BIGDATA_PUBLIC_DATASET. For more information about schemas, see Schema-related operations. After you activate MaxCompute and create a project, you can query TPC-DS tables by performing cross-project access. The following table shows the information about tables.

Data size	Project name	Schema name	Table name

Data size	Project name	Schema name	Table name
10 GB	BIGDATA_PUBLIC_DATASET	TPCDS_10G	call_center catalog_page catalog_returns catalog_sales customer customer_address customer_demographics date_dim household_demographics income_band inventory item promotion reason ship_mode store store_returns store_sales tab_reducenum tab_reducenum_100 time_dim warehouse web_page web_returns web_sales web_site
100 GB	BIGDATA_PUBLIC_DATASET	TPCDS_100G
1 TB	BIGDATA_PUBLIC_DATASET	TPCDS_1T
10 TB	BIGDATA_PUBLIC_DATASET	TPCDS_10T

Note

Table data is referenced from TPC-DS. For more information, see TPC.
For more information about the table schema and content, see TPC BENCHMARK™ DS.

Declaration

The TPC-DS data is generated and analyzed in the MaxCompute public dataset based on the TPC-DS benchmark test. The test results of data in the MaxCompute public dataset differ from the TPC-DS benchmark results published by TPC because the test that is performed on the MaxCompute public dataset cannot meet the requirements of the TPC-DS benchmark test.
The TPC-DS datasets provided by MaxCompute can be used only for product testing. The data is not periodically updated. Therefore, we recommend that you do not use the TPC-DS datasets in production environments.
The TPC-DS data provided by MaxCompute is obtained from TPC. You can also generate your TPC-DS data. For more information about how to generate TPC-DS test data, see TPC-DS documentation.

Supported regions

Region	Region ID

Region	Region ID
China (Hangzhou)	cn-hangzhou
China (Shanghai)	cn-shanghai
China (Beijing)	cn-beijing
China (Zhangjiakou)	cn-zhangjiakou
China (Ulanqab)	cn-wulanchabu
China (Shenzhen)	cn-shenzhen
China (Chengdu)	cn-chengdu
China (Hong Kong)	cn-hongkong
Singapore	ap-southeast-1
Japan (Tokyo)	ap-northeast-1
Malaysia (Kuala Lumpur)	ap-southeast-3
Indonesia (Jakarta)	ap-southeast-5
US (Silicon Valley)	us-west-1
US (Virginia)	us-east-1
UK (London)	eu-west-1
Germany (Frankfurt)	eu-central-1
UAE (Dubai)	me-east-1
China East 2 Finance	cn-shanghai-finance-1
China North 2 Finance	cn-beijing-finance-1
China South 1 Finance	cn-shenzhen-finance-1
China North 2 Ali Gov 1	cn-north-2-gov-1

Precautions

Public datasets are available to all MaxCompute users. When you use public datasets, take note of the following items:

All data is stored in the public MaxCompute project BIGDATA_PUBLIC_DATASET. No MaxCompute users belong to this project. Therefore, you need to access the data across projects. When you write an SQL script, you must specify the project name and schema name before the table name. If you do not enable the tenant-level schema syntax, you need to enable the session-level schema syntax before you execute a statement. Sample statements:
```
-- Enable the session-level schema syntax.
set odps.namespace.schema=true; 
-- In this example, data in the tpcds_10g dataset is queried. If you want to query data from another dataset, manually replace the schema name in the following statement with the name of the schema in which the dataset is stored. 
select * from bigdata_public_dataset.tpcds_10g.store_sales limit 100;
```
Note
You do not need to pay for the storage of the data in the public datasets. However, you are charged computing fees that are generated when you execute statements. For more information about billing rules, see Computing pricing.
You cannot find the tables in the public datasets on the DataMap page of DataWorks because cross-project access is required.
TPC-DS datasets are stored in projects that support storage by schema. If you do not enable the tenant-level schema syntax, you cannot view the TPC-DS datasets in the public datasets provided by DataAnalysis of DataWorks, but you can query the TPC-DS datasets by using the SQL statements provided by MaxCompute.

Data is accessed across projects. To ensure that SQL statements are successfully executed, you need to run the following commands:

-- For data types used in the table schemas of TPC-DS datasets, such as DECIMAL and INT, you need to run the following commands:
set odps.sql.hive.compatible=true;
set odps.sql.type.system.odps2=true;
set odps.sql.decimal.odps2=true;
-- In the following commands, the flag values are the same as those for new projects and may be different from those for existing projects. Flag values for existing projects remain unchanged to prevent impact on existing queries. 
-- We recommend that you use the setproject commands to change the flag values to the default values. If you do not change the flag values, an error may be reported when the LIMIT keyword is not included in the ORDER BY clause of an SQL statement. An execution latency may also occur due to an invalid join order of the TPC-DS Q72 query.
set odps.sql.validate.orderby.limit=false;
set odps.optimizer.join.reorder.enable=true;
set odps.optimizer.column.stat.enable=true;
-- Cartesian products are used when you perform the TPC-DS Q77 query for TPC-DS datasets. By default, Cartesian products produced by MaxCompute are not supported in sort-merge join operations. If you want to use Cartesian products in sort-merge join operations, run the following command: 
set odps.sql.allow.cartesian=true;

Perform a TPC-DS query

Prerequisites

MaxCompute is activated. A MaxCompute project is created. For more information about how to create a MaxCompute project, see Create a MaxCompute project.

Supported tools and platforms

Sample query files

MaxCompute provides sample query files for datasets of different sizes. Each file contains 99 queries. For these queries, the complexity and the range of scanned data vary greatly. We recommend that you select query files based on your business requirements. This prevents additional computing costs from being generated. You can also use the tools in the TPC-DS benchmark suite to generate different versions of these queries, which vary based on parameter values. For more information, see TPC-DS official documentation.

Data size	Query file

Data size	Query file
10 GB	MaxCompute-TPCDS_10G-99-query
100 GB	MaxCompute-TPCDS_100G-99-query
1 TB	MaxCompute-TPCDS_1T-99-query
10 TB	MaxCompute-TPCDS_10T-99-query

Note

The sample query files described in this topic are referenced from the TPC-DS benchmark test. The test results of the preceding files differ from the TPC-DS benchmark results published by TPC because the test that is performed on the MaxCompute public dataset cannot meet the requirements of the TPC-DS benchmark test. For more information, see TPC.

Feedback

Previous: GitHub public event dataNext: TPC-DS performance test

On this page （1, T）

Description

Declaration

Supported regions

Precautions

Perform a TPC-DS query

Prerequisites

Supported tools and platforms

Sample query files

About Alibaba Cloud

Our Global Network

Quick Start

Global Offices

Olympic Games Paris 2024 New

Stade Roland Garros – Glitz from the Past New

Place de la Concorde – “Breaking” the Barriers New

Vaires-sur-Marne Nautical Stadium – Sports with Sustainability New

International Broadcast Center – Images, Sounds, and Data that Captivate Billions New

Customer Success Stories New

Trust Center

Security & Compliance Center

Cloud Compliance Resources

Security Compliance FAQs

Product & Feature Update New

Cloud Forward

Press Room

Alibaba Cloud e-Magazine New

Alibaba Cloud in Analyst Research

Notice

Go Global Service New

Go Global Alliance with Alibaba Cloud

Asia Accelerator Hot

Information Compliance

China Gateway - MLPS 2.0 Compliance New

China Gateway - Networking

China Gateway - Global Application Acceleration New

China Gateway - Security

China Gateway - Data Security New

ICP Support Hot

China Gateway - Omnichannel Data Mid-End New

China Gateway - Organizational Data Mid-End New

China Gateway - Business Mid-End New

China Gateway - AI Service for Conversational Chatbots New

China Gateway - Online Education

China Gateway - Domain Registration

Work at Alibaba Cloud

Experienced Professionals

Students and Graduates

Free Trial

Pricing

Promo Center

Price Reduction

Pay Less and Deploy More

FinOps

Elastic Compute Service (ECS)

Simple Application Server (SAS)

Elastic GPU Service

Elastic Desktop Service (EDS)

Object Storage Service (OSS)

Cloud Enterprise Network (CEN)

Web Application Firewall (WAF)

Domain Names

Lingma

Container Compute Service (ACS)

Secure Access Service Edge (SASE)

Intelligent Media Services(IMS)

Edge Security Acceleration (ESA)(Original DCDN)

Intelligent Media Management

DingTalk Enterprise

YiDA

Alibaba Cloud Model Studio

Apsara Prime - For Easy Cloud Product Selection

Alibaba Cloud ECS - Cater All Your Cloud Hosting Needs

1TB CDN—Get Free 1 TB Outbound Traffic Plan Now

Security—Under Attack? Get Free Security Support

Short Message Service - Free Testing is Available

Elastic Compute Service (ECS) Hot

CloudBox

Compute Nest

Dedicated Host Hot

ECS Bare Metal Instance

Elastic GPU Service Featured

Simple Application Server (SAS) Hot

Auto Scaling

Cloud Phone Beta

Elastic Desktop Service (EDS) Featured

Batch Compute

Elastic High Performance Computing (E-HPC)

Super Computing Cluster (SCC)