All Products
Search
Document Center

MaxCompute:TPC-DS data

Last Updated:Nov 15, 2024

MaxCompute uses the TPC-DS official tool to generate 10-GB, 100-GB, 1-TB, and 10-TB TPC-DS datasets. You can use the TPC-DS datasets for product testing. This topic describes the basic information about TPC-DS datasets in MaxCompute public datasets and how to use MaxCompute to query data from the TPC-DS datasets.

Description

TPC-DS, short for TPC BenchmarkTM DS, is a standard benchmark formulated by Transaction Processing Performance Council (TPC), the most well-known organization that defines benchmarks for measuring the performance of data management systems. The measurement results of the benchmark are also published by TPC.

MaxCompute uses the TPC-DS official tool to generate 10-GB, 100-GB, 1-TB, and 10-TB TPC-DS datasets. The datasets are stored in different schemas of the MaxCompute public project BIGDATA_PUBLIC_DATASET. For more information about schemas, see Schema-related operations. After you activate MaxCompute and create a project, you can query TPC-DS tables by performing cross-project access. The following table shows the information about tables.

Data size

Project name

Schema name

Table name

10 GB

BIGDATA_PUBLIC_DATASET

TPCDS_10G

call_center

catalog_page

catalog_returns

catalog_sales

customer

customer_address

customer_demographics

date_dim

household_demographics

income_band

inventory

item

promotion

reason

ship_mode

store

store_returns

store_sales

tab_reducenum

tab_reducenum_100

time_dim

warehouse

web_page

web_returns

web_sales

web_site

100 GB

BIGDATA_PUBLIC_DATASET

TPCDS_100G

1 TB

BIGDATA_PUBLIC_DATASET

TPCDS_1T

10 TB

BIGDATA_PUBLIC_DATASET

TPCDS_10T

Note
  • Table data is referenced from TPC-DS. For more information, see TPC.

  • For more information about the table schema and content, see TPC BENCHMARK™ DS.

Declaration

  • The TPC-DS data is generated and analyzed in the MaxCompute public dataset based on the TPC-DS benchmark test. The test results of data in the MaxCompute public dataset differ from the TPC-DS benchmark results published by TPC because the test that is performed on the MaxCompute public dataset cannot meet the requirements of the TPC-DS benchmark test.

  • The TPC-DS datasets provided by MaxCompute can be used only for product testing. The data is not periodically updated. Therefore, we recommend that you do not use the TPC-DS datasets in production environments.

  • The TPC-DS data provided by MaxCompute is obtained from TPC. You can also generate your TPC-DS data. For more information about how to generate TPC-DS test data, see TPC-DS documentation.

Supported regions

Region

Region ID

China (Hangzhou)

cn-hangzhou

China (Shanghai)

cn-shanghai

China (Beijing)

cn-beijing

China (Zhangjiakou)

cn-zhangjiakou

China (Ulanqab)

cn-wulanchabu

China (Shenzhen)

cn-shenzhen

China (Chengdu)

cn-chengdu

China (Hong Kong)

cn-hongkong

Singapore

ap-southeast-1

Japan (Tokyo)

ap-northeast-1

Malaysia (Kuala Lumpur)

ap-southeast-3

Indonesia (Jakarta)

ap-southeast-5

US (Silicon Valley)

us-west-1

US (Virginia)

us-east-1

UK (London)

eu-west-1

Germany (Frankfurt)

eu-central-1

UAE (Dubai)

me-east-1

China East 2 Finance

cn-shanghai-finance-1

China North 2 Finance

cn-beijing-finance-1

China South 1 Finance

cn-shenzhen-finance-1

China North 2 Ali Gov 1

cn-north-2-gov-1

Precautions

Public datasets are available to all MaxCompute users. When you use public datasets, take note of the following items:

  • All data is stored in the public MaxCompute project BIGDATA_PUBLIC_DATASET. No MaxCompute users belong to this project. Therefore, you need to access the data across projects. When you write an SQL script, you must specify the project name and schema name before the table name. If you do not enable the tenant-level schema syntax, you need to enable the session-level schema syntax before you execute a statement. Sample statements:

    -- Enable the session-level schema syntax.
    set odps.namespace.schema=true; 
    -- In this example, data in the tpcds_10g dataset is queried. If you want to query data from another dataset, manually replace the schema name in the following statement with the name of the schema in which the dataset is stored. 
    select * from bigdata_public_dataset.tpcds_10g.store_sales limit 100;
    Note

    You do not need to pay for the storage of the data in the public datasets. However, you are charged computing fees that are generated when you execute statements. For more information about billing rules, see Computing pricing.

  • You cannot find the tables in the public datasets on the DataMap page of DataWorks because cross-project access is required.

  • TPC-DS datasets are stored in projects that support storage by schema. If you do not enable the tenant-level schema syntax, you cannot view the TPC-DS datasets in the public datasets provided by DataAnalysis of DataWorks, but you can query the TPC-DS datasets by using the SQL statements provided by MaxCompute.

  • Data is accessed across projects. To ensure that SQL statements are successfully executed, you need to run the following commands:

    -- For data types used in the table schemas of TPC-DS datasets, such as DECIMAL and INT, you need to run the following commands:
    set odps.sql.hive.compatible=true;
    set odps.sql.type.system.odps2=true;
    set odps.sql.decimal.odps2=true;
    -- In the following commands, the flag values are the same as those for new projects and may be different from those for existing projects. Flag values for existing projects remain unchanged to prevent impact on existing queries. 
    -- We recommend that you use the setproject commands to change the flag values to the default values. If you do not change the flag values, an error may be reported when the LIMIT keyword is not included in the ORDER BY clause of an SQL statement. An execution latency may also occur due to an invalid join order of the TPC-DS Q72 query.
    set odps.sql.validate.orderby.limit=false;
    set odps.optimizer.join.reorder.enable=true;
    set odps.optimizer.column.stat.enable=true;
    -- Cartesian products are used when you perform the TPC-DS Q77 query for TPC-DS datasets. By default, Cartesian products produced by MaxCompute are not supported in sort-merge join operations. If you want to use Cartesian products in sort-merge join operations, run the following command: 
    set odps.sql.allow.cartesian=true;

Perform a TPC-DS query

Prerequisites

MaxCompute is activated. A MaxCompute project is created. For more information about how to create a MaxCompute project, see Create a MaxCompute project.

Supported tools and platforms

Sample query files

MaxCompute provides sample query files for datasets of different sizes. Each file contains 99 queries. For these queries, the complexity and the range of scanned data vary greatly. We recommend that you select query files based on your business requirements. This prevents additional computing costs from being generated. You can also use the tools in the TPC-DS benchmark suite to generate different versions of these queries, which vary based on parameter values. For more information, see TPC-DS official documentation.

Data size

Query file

10 GB

MaxCompute-TPCDS_10G-99-query

100 GB

MaxCompute-TPCDS_100G-99-query

1 TB

MaxCompute-TPCDS_1T-99-query

10 TB

MaxCompute-TPCDS_10T-99-query

Note

The sample query files described in this topic are referenced from the TPC-DS benchmark test. The test results of the preceding files differ from the TPC-DS benchmark results published by TPC because the test that is performed on the MaxCompute public dataset cannot meet the requirements of the TPC-DS benchmark test. For more information, see TPC.