All Products
Search
Document Center

MaxCompute:Overview of public datasets

Last Updated:Dec 20, 2024

After MaxCompute is activated, you can use MaxCompute SQL to query and analyze data in public datasets. This helps you quickly get started with MaxCompute. This topic describes the public datasets of MaxCompute and how to use MaxCompute SQL to query and analyze data in the public datasets.

Introduction

MaxCompute provides public datasets based on data categories such as GitHub public event data, national statistics data, TPC performance test data, digital business data, life service data, and financial stock data. All data is stored in different schemas in the public project BIGDATA_PUBLIC_DATASET in MaxCompute.

Category

Description

Dataset name

Schema name

GitHub public event data

A large number of developers develop open source projects on GitHub and generate a large number of events during the development process. GitHub records the information about each event, including the event type, event details, developer, and code repository. GitHub also exposes public events, such as the events of starring repositories and submitting code.

GitHub public event dataset

github_events

National statistics data

Includes annual gross domestic product (GDP) data for countries around the world and all provinces in the Chinese mainland.

National statistics dataset

national_data

TPC performance data

TPC-DS

TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance. TPC-DS enables emerging technologies, such as big data systems, to perform benchmark tests.

  • TPC-DS 10-GB performance test dataset

  • TPC-DS 100-GB performance test dataset

  • TPC-DS 1-TB performance test dataset

  • TPC-DS 10-TB performance test dataset

  • tpcds_10g

  • tpcds_100g

  • tpcds_1t

  • tpcds_10t

TPC-H

TPC-H is a decision support benchmark. It consists of a suite of business-oriented ad hoc queries and concurrent data modifications. TPC-H illustrates decision support systems that perform highly complex queries on large amounts of data and provide answers to critical business questions.

  • TPC-H 10-GB performance test dataset

  • TPC-H 100-GB performance test dataset

  • TPC-H 1-TB performance test dataset

  • TPC-H 10-TB performance test dataset

  • tpch_10g

  • tpch_100g

  • tpch_1t

  • tpch_10t

TPCx-BB

TPCx-BB is a TPC Express benchmark, which is designed to measure the performance of Hadoop-based big data systems. TPCx-BB measures the performance of both hardware and software components by executing 30 frequently performed analytical queries.

  • TPCx-BB 10-GB performance test dataset

  • TPCx-BB 100-GB performance test dataset

  • TPCx-BB 1-TB performance test dataset

  • TPCx-BB 10-TB performance test dataset

  • tpcbb_10g

  • tpcbb_100g

  • tpcbb_1t

  • tpcbb_10t

Digital business data

Includes Taobao advertising data, Taobao shopping data, and e-commerce data of Alibaba Group.

Digital business dataset

commerce

Life service data

Includes data of pre-owned houses, movies and box office, mobile phone number attribution, and administrative, urban, and rural division code information.

Life service dataset

life_service

Financial stock data

Includes stock information.

Financial stock dataset

finance

Disclaimer

  • Data in the public datasets of MaxCompute is only for product testing. The data is not periodically updated and its accuracy is not ensured. Do not use the data in the production process.

  • TPC data in MaxCompute public datasets is generated and analyzed based on the TPC benchmark test. The test results differ from the released TPC benchmark test results. This is because the test that is performed based on MaxCompute public datasets does not meet all the requirements of the TPC benchmark test.

  • The TPC performance test data provided by MaxCompute is obtained from TPC. You can also generate TPC performance test data. For more information about how to generate TPC performance test data, see TPC documentation.

Precautions

Public datasets are available to all MaxCompute users. When you use public datasets, take note of the following items:

  • All data of public datasets is stored in the BIGDATA_PUBLIC_DATASET project in MaxCompute. However, no users are added to this project as members. In this case, you must access the data across projects. When you write an SQL script, specify the project name and schema name before the table name. If you do not enable the tenant-level schema syntax, enable the session-level schema syntax before you execute a statement. Sample statements:

    -- Enable the session-level schema syntax.
    set odps.namespace.schema=true; 
    -- Query 100 data records from the dwd_github_events_odps table.
    select * from bigdata_public_dataset.github_events.dwd_github_events_odps where ds='2024-05-10' limit 100;
    Important

    You are not charged for the storage of the data in the public datasets. However, you are charged computing fees if you execute query statements. For more information, see Computing pricing.

  • You cannot find the tables in the public datasets on the Data Map page of DataWorks because cross-project access is required.

  • Public datasets are stored by schema. If you do not enable the tenant-level schema syntax, you cannot view the public datasets in DataWorks DataAnalysis. In this case, you can query the public datasets only by executing SQL statements.

Table details

The following content describes the details of tables in each schema in the public project BIGDATA_PUBLIC_DATASET.

GitHub public event data

Project name

BIGDATA_PUBLIC_DATASET

Schema name

github_events

Supported regions

China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), and China (Chengdu)

Table name and description

A large number of developers develop open source projects on GitHub and generate a large number of events during the development process. GitHub records the information about each event, including the event type, event details, developer, and code repository. GitHub also exposes public events, such as the events of starring repositories and submitting code. For more information about event types, see GitHub event types.

MaxCompute batch processes and develops large amounts of public event data that is provided by GH Archive and generates the following tables:

  • dwd_github_events_odps: the fact table that stores GitHub public event data.

  • dws_overview_by_repo_month: the aggregate table that stores monthly metric data of GitHub public events.

Note

Data in the tables is obtained from GH Archive.

Update cycle

  • dwd_github_events_odps: updated every T+1 hours.

  • dws_overview_by_repo_month: updated every T+1 days.

Schema query

-- Enable session-level schema syntax. 
set odps.namespace.schema=true; 
-- Query the schema of the dwd_github_events_odps table. If you want to query the schema of another table, replace the schema name and table name in the following statement. 
desc bigdata_public_dataset.github_events.dwd_github_events_odps;

Query example

-- Enable session-level schema syntax. 
SET odps.namespace.schema=true; 
-- Obtain the rankings of starred repositories in the previous year. Note that the situations such as unstarring are not considered in this example.
SELECT
    repo_id,
    repo_name,
    COUNT(actor_login) total
FROM
    bigdata_public_dataset.github_events.dwd_github_events_odps
WHERE
    ds>=date_add(getdate(), -365)
    AND type = 'WatchEvent'
GROUP BY
    repo_id,
    repo_name
ORDER BY
    total DESC
LIMIT 10;

For more information about data and query examples, see GitHub public event data.

National statistics data

Project name

BIGDATA_PUBLIC_DATASET

Schema name

national_data

Supported regions

China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), and China (Chengdu)

Table name and description

  • annual_gdp_by_province: stores the annual GDP data of each province in the Chinese mainland.

  • annual_gdp_by_country: stores the annual GDP data of each country around the world.

Note

Data in the annual_gdp_by_province table is obtained from National Bureau of Statistics of China and data in the annual_gdp_by_country table is obtained from International Monetary Fund (IMF).

Update cycle

Fixed data is provided and is not updated.

Schema query

-- Enable session-level schema syntax. 
set odps.namespace.schema=true; 
-- Query the schema of the annual_gdp_by_province table. If you want to query the schema of another table, replace the schema name and table name in the following statement. 
desc bigdata_public_dataset.national_data.annual_gdp_by_province;

Query example

-- Enable session-level schema syntax. 
SET odps.namespace.schema=true; 
-- Query the GDP change trend of Beijing over the past 20 years. 
SELECT
    region,
    gdp,
    year
FROM
    bigdata_public_dataset.national_data.annual_gdp_by_province
WHERE
    region='Beijing'
ORDER BY
    year ASC
LIMIT 20;

TPC-DS data

Project name

BIGDATA_PUBLIC_DATASET

Schema name

tpcds_10g, tpcds_100g, tpcds_1t, and tpcds_10t

Supported regions

China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), US (Virginia), US (Silicon Valley), UK (London), Germany (Frankfurt), UAE (Dubai), China East 2 Finance, China North 2 Finance (Preview), China North 2 Ali Gov 1, and China South 1 Finance

Table name and description

The TPC-DS model simulates the sales system of a large-scale national chain retailer. The sales system involves three sales channels: store (brick-and-mortar store), web (online store), and catalog (telephone order). Each channel uses one table to store sales records, another table to store return records, and multiple dimension tables to store information such as product information, promotion information, and user information. Table details:

  • call_center: stores the information about the customer service center.

  • catalog_page: stores the information about catalogs.

  • catalog_returns: stores the return records of products sold by using telephone orders.

  • catalog_sales: stores the sales records of products sold by using telephone orders.

  • customer: stores the customer information.

  • customer_address: stores the customer address information.

  • customer_demographics: stores the basic credit information about customers.

  • date_dim: stores the time dimension information.

  • household_demographics: stores the basic household credit information.

  • income_band: stores the revenue information.

  • inventory: stores the inventory information.

  • item: stores the product information.

  • promotion: stores the product promotion information.

  • reason: stores the reasons for sales returns.

  • ship_mode: stores the product shipment information.

  • store: stores the merchant information.

  • store_returns: stores the product return records for store sales.

  • store_sales: stores the product sales records for store sales.

  • time_dim: stores the time dimension information.

  • warehouse: stores the warehouse information.

  • web_page: stores the product web page information.

  • web_returns: stores the product return records for web sales.

  • web_sales: stores the product sales records for web sales.

  • web_site: stores the basic information about the website.

Note

The data in the tables is obtained from TPC.

Update cycle

Fixed data is provided and is not updated.

Schema query

-- Enable session-level schema syntax. 
SET odps.namespace.schema=TRUE; 
-- Query the schema of the call_center table in tpcds_10g. If you want to query a table schema from another dataset, replace the schema name and table name in the following statement. 
DESC bigdata_public_dataset.tpcds_10g.call_center;

Query example

SET odps.namespace.schema=TRUE; 
SELECT dt.d_year ,
       item.i_brand_id brand_id ,
       item.i_brand brand ,
       SUM(ss_sales_price) sum_agg
FROM bigdata_public_dataset.tpcds_10g.date_dim dt ,
     bigdata_public_dataset.tpcds_10g.store_sales ,
     bigdata_public_dataset.tpcds_10g.item
WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
  AND store_sales.ss_item_sk = item.i_item_sk
  AND item.i_manufact_id = 190
  AND dt.d_moy = 12
GROUP BY dt.d_year ,
         item.i_brand ,
         item.i_brand_id
ORDER BY dt.d_year,
         sum_agg DESC,
         brand_id LIMIT 100;

For more query sample files of different data specifications, see TPC-DS data.

For more information about data, see TPC Benchmark DS Standard Specification.

TPC-H data

Project name

BIGDATA_PUBLIC_DATASET

Schema name

tpch_10g, tpch_100g, tpch_1t, and tpch_10t

Supported regions

China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), US (Virginia), US (Silicon Valley), UK (London), Germany (Frankfurt), UAE (Dubai), China East 2 Finance, China North 2 Finance (Preview), China North 2 Ali Gov 1, and China South 1 Finance

Table name and description

TPC-H is a benchmark that is used to evaluate online analysis and processing. TPC-H data simulates business behavior between a provider and a buyer. TPC-H data contains information such as order information, product information, and user information. Table details:

  • customer: stores the consumer information.

  • lineitem: stores the online product information.

  • nation: stores the country information.

  • orders: stores the order information.

  • part: stores the part information.

  • partsupp: stores the supplier part information.

  • region: stores the region information.

  • supplier: stores the supplier information.

Note

The data in the tables is obtained from TPC.

Update cycle

Fixed data is provided and is not updated.

Schema query

-- Enable session-level schema syntax. 
SET odps.namespace.schema=TRUE; 
-- Query the schema of the lineitem table in tpch_10g. If you want to query a table schema from another dataset, replace the schema name and table name in the following statement. 
DESC bigdata_public_dataset.tpch_10g.lineitem;

Query example

SET odps.namespace.schema=TRUE; 
SET odps.sql.validate.orderby.limit=FALSE;
SET odps.sql.hive.compatible=TRUE;
SELECT l_returnflag,
       l_linestatus,
       sum(l_quantity) AS sum_qty,
       sum(l_extendedprice) AS sum_base_price,
       sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
       sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
       avg(l_quantity) AS avg_qty,
       avg(l_extendedprice) AS avg_price,
       avg(l_discount) AS avg_disc,
       count(*) AS count_order
FROM bigdata_public_dataset.tpch_10g.lineitem
WHERE l_shipdate <= date'1998-12-01' - interval '90' DAY
GROUP BY l_returnflag,
         l_linestatus
ORDER BY l_returnflag,
         l_linestatus;

For more information about data and sample queries, see TPC Benchmark H Standard Specification.

TPCx-BB data

Project name

BIGDATA_PUBLIC_DATASET

Schema name

tpcxbb_10g, tpcxbb_100g, tpcxbb_1t, and tpcxbb_10t

Supported regions

China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), US (Virginia), US (Silicon Valley), UK (London), Germany (Frankfurt), UAE (Dubai), China East 2 Finance, China North 2 Finance (Preview), China North 2 Ali Gov 1, and China South 1 Finance

Table name and description

TPCx-BB is a big data benchmark test tool that simulates an online retail scenario. TPCx-BB data includes sales records, return records, product information, and promotion information. Table details:

  • customer: stores the customer information.

  • customer_address: stores the customer address information.

  • customer_demographics: stores the basic credit information about customers.

  • date_dim: stores the time dimension information.

  • household_demographics: stores the basic household credit information.

  • income_band: stores the revenue information.

  • inventory: stores the inventory information.

  • item: stores the product information.

  • item_marketprices: stores the product price information of competitors.

  • product_reviews: stores the product review information.

  • promotion: stores the product promotion information.

  • reason: stores the reasons for sales returns.

  • ship_mode: stores the product shipment information.

  • store: stores the store information.

  • store_returns: stores the product return records for store sales.

  • store_sales: stores the product sales records for store sales.

  • time_dim: stores the time dimension information.

  • warehouse: stores the warehouse information.

  • web_clickstreams: stores the web page click information.

  • web_page: stores the product web page information.

  • web_returns: stores the product return records for web sales.

  • web_sales: stores the product sales records for web sales.

  • web_site: stores the product website information.

Note

The data in the tables is obtained from TPC.

Update cycle

Fixed data is provided and is not updated.

Schema query

-- Enable session-level schema syntax. 
SET odps.namespace.schema=TRUE; 
-- Query the schema of the web_sales table in tpcxbb_10g. If you want to query a table schema from another dataset, replace the schema name and table name in the following statement. 
DESC bigdata_public_dataset.tpcxbb_10g.web_sales;

Query example

SET odps.namespace.schema=TRUE; 
select * FROM bigdata_public_dataset.tpcxbb_10g.web_sales limit 100;

For more information about data and query examples, see TPCx-BB Standard Specification.

Digital business data

Project name

BIGDATA_PUBLIC_DATASET

Schema name

commerce

Supported regions

China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), and China (Chengdu)

Table name and description

  • adv_raw_sample: stores the raw data that is randomly sampled from the logs for advertisement clicks of more than 1 million users within 8 days.

  • adv_ad_feature: stores the basic information about some advertisements in the raw_sample table.

  • user_profile: stores the basic information about all users in the raw_sample table.

  • behavior_log: stores the shopping behavior of all users in the raw_sample table within 22 days. The behavior includes browsing, adding to the shopping cart, adding to favorites, and purchasing.

Note

The data in the tables is obtained from Tianchi Lab - Ad Display/Click Data on Taobao.com.

Update cycle

Fixed data is provided and is no longer incrementally updated.

Schema query

-- Enable session-level schema syntax. 
SET odps.namespace.schema=TRUE; 
-- Query the schema of the behavior_log table. If you want to query the schema of another table, replace the table name in the following statement. 
DESC bigdata_public_dataset.commerce.behavior_log;

Query example

-- Enable session-level schema syntax. 
SET odps.namespace.schema=TRUE; 
-- Query the IDs of the top 3 product categories in the behavior_log table within 22 days based on product sales. 
SELECT cate,
       count(btag) sales
FROM behavior_log
WHERE btag='buy'
GROUP BY cate
ORDER BY sales DESC LIMIT 3;

Life service data

Project name

BIGDATA_PUBLIC_DATASET

Schema name

life_service

Supported regions

China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), and China (Chengdu)

Table name and description

  • movie_basic_info: stores the basic information about movies.

  • movie_box: stores the basic information about box office.

  • areacode_basic_info_2020: stores the basic information about administrative, urban, and rural division codes in 2020.

  • phoneno_basic_info_2020: stores the basic information about mobile phone number attribution in 2020.

Update cycle

  • movie_basic_info and movie_box: Data in date-specific partitions is provided and is no longer incrementally updated.

  • areacode_basic_info_2020 and phoneno_basic_info_2020: Fixed data is provided and is no longer incrementally updated.

Schema query

-- Enable session-level schema syntax. 
SET odps.namespace.schema=TRUE; 
-- Query the schema of the movie_box table. If you want to query the schema of another table, replace the table name in the following statement. 
DESC bigdata_public_dataset.life_service.movie_box;

Query example

-- Enable session-level schema syntax. 
SET odps.namespace.schema=TRUE;
-- Query the names of the top 10 most popular movies on January 14, 2017 based on the box office. 
SELECT moviename
FROM bigdata_public_dataset.life_service.movie_box
WHERE ds ='20170114'
ORDER BY rank ASC LIMIT 10;

Financial stock data

Project name

BIGDATA_PUBLIC_DATASET

Schema name

finance

Supported regions

China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), and China (Chengdu)

Table name and description

  • ods_enterprise_share_basic: stores the basic stock information.

  • ods_enterprise_share_quarter_cashflow: stores the quarterly cash flow information.

  • ods_enterprise_share_quarter_growth: stores the quarterly business growth data.

  • ods_enterprise_share_quarter_operation: stores the quarterly financial turnover data.

  • ods_enterprise_share_quarter_profit: stores the quarterly profit information.

  • ods_enterprise_share_quarter_report: stores the quarterly report information.

  • ods_enterprise_share_trade_h: stores the stock price information.

Update cycle

Data in date-specific partitions is provided and is no longer incrementally updated.

Schema query

-- Enable session-level schema syntax. 
SET odps.namespace.schema=TRUE; 
-- Query the schema of the ods_enterprise_share_basic table. If you want to query the schema of another table, replace the table name in the following statement. 
DESC bigdata_public_dataset.finance.ods_enterprise_share_basic;

Query example

-- Enable session-level schema syntax. 
SET odps.namespace.schema=TRUE;
-- Query the basic stock information on January 14, 2017. 
SELECT *
FROM bigdata_public_dataset.finance.ods_enterprise_share_basic
WHERE ds ='20170114' LIMIT 10;

Use public datasets

Prerequisites

MaxCompute is activated, and a MaxCompute project is created. For more information about how to create a MaxCompute project, see Create a MaxCompute project.

Supported tools or platforms

Procedure (use a DataWorks ODPS SQL node)

  1. Log on to the DataWorks console and create a workspace. For more information about how to create a workspace, see Create a workspace.

  2. Associate a MaxCompute data source with the workspace. For more information, see Add a data source or register a cluster to a workspace.

  3. Create an ODPS SQL node and enter the following SQL statements. For more information, see Develop a MaxCompute SQL task.

    -- Query the GDP change trend of each province in the Chinese mainland over the past 20 years. 
    SET odps.namespace.schema=true; 
    SET odps.sql.validate.orderby.limit = false;
    SELECT
        region,
        gdp,
        year
    FROM
        bigdata_public_dataset.national_data.annual_gdp_by_province
    ORDER BY
        year ASC;
  4. Click image.png, and view the output result.image.png

Note

The MAXCOMPUTE_PUBLIC_DATA project described in Public dataset reference is no longer maintained or updated. You can continue to use the public datasets of the project based on your business requirements.

References

Introduction to MaxCompute data export methods:

  • Download: Supports downloading data or the execution results of a specified Instance to the local environment.

  • UNLOAD: Supports exporting data to external storages such as Object Storage Service (OSS) and Hologres.