After MaxCompute is activated, you can use MaxCompute SQL to query and analyze data in public datasets. This helps you quickly get started with MaxCompute. This topic describes the public datasets of MaxCompute and how to use MaxCompute SQL to query and analyze data in the public datasets.
Introduction
MaxCompute provides public datasets based on data categories such as GitHub public event data, national statistics data, TPC performance test data, digital business data, life service data, and financial stock data. All data is stored in different schemas in the public project BIGDATA_PUBLIC_DATASET in MaxCompute.
Category | Description | Dataset name | Schema name | |
GitHub public event data | A large number of developers develop open source projects on GitHub and generate a large number of events during the development process. GitHub records the information about each event, including the event type, event details, developer, and code repository. GitHub also exposes public events, such as the events of starring repositories and submitting code. | GitHub public event dataset | github_events | |
National statistics data | Includes annual gross domestic product (GDP) data for countries around the world and all provinces in the Chinese mainland. | National statistics dataset | national_data | |
TPC performance data | TPC-DS | TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance. TPC-DS enables emerging technologies, such as big data systems, to perform benchmark tests. |
|
|
TPC-H | TPC-H is a decision support benchmark. It consists of a suite of business-oriented ad hoc queries and concurrent data modifications. TPC-H illustrates decision support systems that perform highly complex queries on large amounts of data and provide answers to critical business questions. |
|
| |
TPCx-BB | TPCx-BB is a TPC Express benchmark, which is designed to measure the performance of Hadoop-based big data systems. TPCx-BB measures the performance of both hardware and software components by executing 30 frequently performed analytical queries. |
|
| |
Digital business data | Includes Taobao advertising data, Taobao shopping data, and e-commerce data of Alibaba Group. | Digital business dataset | commerce | |
Life service data | Includes data of pre-owned houses, movies and box office, mobile phone number attribution, and administrative, urban, and rural division code information. | Life service dataset | life_service | |
Financial stock data | Includes stock information. | Financial stock dataset | finance |
Disclaimer
Data in the public datasets of MaxCompute is only for product testing. The data is not periodically updated and its accuracy is not ensured. Do not use the data in the production process.
TPC data in MaxCompute public datasets is generated and analyzed based on the TPC benchmark test. The test results differ from the released TPC benchmark test results. This is because the test that is performed based on MaxCompute public datasets does not meet all the requirements of the TPC benchmark test.
The TPC performance test data provided by MaxCompute is obtained from TPC. You can also generate TPC performance test data. For more information about how to generate TPC performance test data, see TPC documentation.
Precautions
Public datasets are available to all MaxCompute users. When you use public datasets, take note of the following items:
All data of public datasets is stored in the
BIGDATA_PUBLIC_DATASET
project in MaxCompute. However, no users are added to this project as members. In this case, you must access the data across projects. When you write an SQL script, specify the project name and schema name before the table name. If you do not enable the tenant-level schema syntax, enable the session-level schema syntax before you execute a statement. Sample statements:-- Enable the session-level schema syntax. set odps.namespace.schema=true; -- Query 100 data records from the dwd_github_events_odps table. select * from bigdata_public_dataset.github_events.dwd_github_events_odps where ds='2024-05-10' limit 100;
ImportantYou are not charged for the storage of the data in the public datasets. However, you are charged computing fees if you execute query statements. For more information, see Computing pricing.
You cannot find the tables in the public datasets on the Data Map page of DataWorks because cross-project access is required.
Public datasets are stored by schema. If you do not enable the tenant-level schema syntax, you cannot view the public datasets in DataWorks DataAnalysis. In this case, you can query the public datasets only by executing SQL statements.
Table details
The following content describes the details of tables in each schema in the public project BIGDATA_PUBLIC_DATASET.
GitHub public event data
Project name | BIGDATA_PUBLIC_DATASET |
Schema name | github_events |
Supported regions | China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), and China (Chengdu) |
Table name and description | A large number of developers develop open source projects on GitHub and generate a large number of events during the development process. GitHub records the information about each event, including the event type, event details, developer, and code repository. GitHub also exposes public events, such as the events of starring repositories and submitting code. For more information about event types, see GitHub event types. MaxCompute batch processes and develops large amounts of public event data that is provided by GH Archive and generates the following tables:
Note Data in the tables is obtained from GH Archive. |
Update cycle |
|
Schema query |
|
Query example |
|
For more information about data and query examples, see GitHub public event data. |
National statistics data
Project name | BIGDATA_PUBLIC_DATASET |
Schema name | national_data |
Supported regions | China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), and China (Chengdu) |
Table name and description |
Note Data in the annual_gdp_by_province table is obtained from National Bureau of Statistics of China and data in the annual_gdp_by_country table is obtained from International Monetary Fund (IMF). |
Update cycle | Fixed data is provided and is not updated. |
Schema query |
|
Query example |
|
TPC-DS data
Project name | BIGDATA_PUBLIC_DATASET |
Schema name | tpcds_10g, tpcds_100g, tpcds_1t, and tpcds_10t |
Supported regions | China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), US (Virginia), US (Silicon Valley), UK (London), Germany (Frankfurt), UAE (Dubai), China East 2 Finance, China North 2 Finance (Preview), China North 2 Ali Gov 1, and China South 1 Finance |
Table name and description | The TPC-DS model simulates the sales system of a large-scale national chain retailer. The sales system involves three sales channels: store (brick-and-mortar store), web (online store), and catalog (telephone order). Each channel uses one table to store sales records, another table to store return records, and multiple dimension tables to store information such as product information, promotion information, and user information. Table details:
Note The data in the tables is obtained from TPC. |
Update cycle | Fixed data is provided and is not updated. |
Schema query |
|
Query example |
|
For more query sample files of different data specifications, see TPC-DS data. For more information about data, see TPC Benchmark DS Standard Specification. |
TPC-H data
Project name | BIGDATA_PUBLIC_DATASET |
Schema name | tpch_10g, tpch_100g, tpch_1t, and tpch_10t |
Supported regions | China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), US (Virginia), US (Silicon Valley), UK (London), Germany (Frankfurt), UAE (Dubai), China East 2 Finance, China North 2 Finance (Preview), China North 2 Ali Gov 1, and China South 1 Finance |
Table name and description | TPC-H is a benchmark that is used to evaluate online analysis and processing. TPC-H data simulates business behavior between a provider and a buyer. TPC-H data contains information such as order information, product information, and user information. Table details:
Note The data in the tables is obtained from TPC. |
Update cycle | Fixed data is provided and is not updated. |
Schema query |
|
Query example |
|
For more information about data and sample queries, see TPC Benchmark H Standard Specification. |
TPCx-BB data
Project name | BIGDATA_PUBLIC_DATASET |
Schema name | tpcxbb_10g, tpcxbb_100g, tpcxbb_1t, and tpcxbb_10t |
Supported regions | China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), US (Virginia), US (Silicon Valley), UK (London), Germany (Frankfurt), UAE (Dubai), China East 2 Finance, China North 2 Finance (Preview), China North 2 Ali Gov 1, and China South 1 Finance |
Table name and description | TPCx-BB is a big data benchmark test tool that simulates an online retail scenario. TPCx-BB data includes sales records, return records, product information, and promotion information. Table details:
Note The data in the tables is obtained from TPC. |
Update cycle | Fixed data is provided and is not updated. |
Schema query |
|
Query example |
|
For more information about data and query examples, see TPCx-BB Standard Specification. |
Digital business data
Project name | BIGDATA_PUBLIC_DATASET |
Schema name | commerce |
Supported regions | China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), and China (Chengdu) |
Table name and description |
Note The data in the tables is obtained from Tianchi Lab - Ad Display/Click Data on Taobao.com. |
Update cycle | Fixed data is provided and is no longer incrementally updated. |
Schema query |
|
Query example |
|
Life service data
Project name | BIGDATA_PUBLIC_DATASET |
Schema name | life_service |
Supported regions | China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), and China (Chengdu) |
Table name and description |
|
Update cycle |
|
Schema query |
|
Query example |
|
Financial stock data
Project name | BIGDATA_PUBLIC_DATASET |
Schema name | finance |
Supported regions | China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), and China (Chengdu) |
Table name and description |
|
Update cycle | Data in date-specific partitions is provided and is no longer incrementally updated. |
Schema query |
|
Query example |
|
Use public datasets
Prerequisites
MaxCompute is activated, and a MaxCompute project is created. For more information about how to create a MaxCompute project, see Create a MaxCompute project.
Supported tools or platforms
Procedure (use a DataWorks ODPS SQL node)
Log on to the DataWorks console and create a workspace. For more information about how to create a workspace, see Create a workspace.
Associate a MaxCompute compute engine with the workspace. For more information, see Add a data source or register a cluster to a workspace.
Create an ODPS SQL node and enter the following SQL statements. For more information, see Develop a MaxCompute SQL task.
-- Query the GDP change trend of each province in the Chinese mainland over the past 20 years. SET odps.namespace.schema=true; SET odps.sql.validate.orderby.limit = false; SELECT region, gdp, year FROM bigdata_public_dataset.national_data.annual_gdp_by_province ORDER BY year ASC;
Click , and view the output result.
The MAXCOMPUTE_PUBLIC_DATA project described in Public dataset reference is no longer maintained or updated. You can continue to use the public datasets of the project based on your business requirements.