This topic describes the GitHub public event data in the MaxCompute public dataset and how to query GitHub public event data in MaxCompute. This topic also provides query examples and data analysis results.
Overview
A large number of developers develop open source projects on GitHub and generate a large number of events during the development process. GitHub records the information about each event, including the event type, event details, the developer, and the code repository. GitHub also exposes public events, such as the events of starring repositories and submitting code. For more information about event types, see GitHub event types. The GH Archive project summarizes GitHub public events on an hourly basis and allows developers to access the events. For more information about the GH Archive project, see GH Archive.
MaxCompute batch processes and develops large amounts of public event data that is provided by GH Archive and generates two tables. The following table describes the information about the two tables.
Table name | Description | Update cycle |
dwd_github_events_odps | Fact table of GitHub public event data | T+1 hours |
dws_overview_by_repo_month | Aggregate table of monthly metrics of GitHub public event data | T+1 days |
The dwd_github_events_odps
and dws_overview_by_repo_month
tables are stored in the github_events
schema of the MaxCompute public project BIGDATA_PUBLIC_DATASET
. For more information about schemas, see Schema-related operations. If you have activated MaxCompute, you can access the tables across projects.
Table details
dwd_github_events_odps
The fact table stores the main information about each GitHub public event and is updated every T+1
hours. The following table describes the fields in the fact table.
Field name | Data type | Description |
id | BIGINT | The event ID. |
actor_id | BIGINT | The actor ID. |
actor_login | STRING | The username of the actor. |
repo_id | BIGINT | The repository ID. |
repo_name | STRING | The repository name in the owner/Repository_name format. |
org_id | BIGINT | The ID of the organization to which the repository belongs. |
org_login | STRING | The name of the organization to which the repository belongs. |
type | STRING | The type of the event. For more information about event types, see GitHub event types. |
created_at | DATETIME | The time when the event occurred. |
action | STRING | The event action. |
iss_or_pr_id | BIGINT | The ID of the issue or pull_request. |
number | BIGINT | The sequence number of the issue or pull_request. |
comment_id | BIGINT | The comment ID. |
commit_id | STRING | The commit ID. |
member_id | BIGINT | The member ID. |
rev_or_push_or_rel_id | BIGINT | The ID of the review, push, or release. |
ref | STRING | The name of the resource that you create or delete. |
ref_type | STRING | The type of the resource that you create or delete. |
state | STRING | The status of the issue, pull_request, or pull_request_review. |
author_association | STRING | The relationship between the actor and the repository. |
language | STRING | The programming language that is used to merge request code. |
merged | BOOLEAN | Specifies whether merge is allowed. |
merged_at | DATETIME | The time when code is merged. |
additions | BIGINT | The number of rows that are added to the code. |
deletions | BIGINT | The number of rows that are deleted from the code. |
changed_files | BIGINT | The number of files that are changed by the pull request. |
push_size | BIGINT | The push size. |
push_distinct_size | BIGINT | The different push sizes. |
hr | STRING | The hour in which the event occurred. For example, if the event occurred at |
month | STRING | The month in which the event occurred. For example, if the event occurred in October, 2015, the value of this parameter is |
year | STRING | The year in which the event occurred. For example, if the event occurred in 2015, the value of this parameter is |
ds | STRING | The date on which the event occurred. The value of this parameter is in the |
dws_overview_by_repo_month
The aggregate table stores monthly event metrics for a project and is updated every T+1
days. The following table describes the fields in the aggregate table.
Field name | Data type | Description |
repo_id | BIGINT | The repository ID. |
repo_name | STRING | The repository name in the owner/Repository_name format. |
stars | BIGINT | The number of stars in a repository. |
commits | BIGINT | The number of commits in a repository. |
pushes | BIGINT | The number of pushes in a repository. |
total_prs | BIGINT | The number of pull_requests in a repository. |
pr_creators | BIGINT | The number of creators that submit pull_requests in a repository. |
pr_reviews | BIGINT | The number of pr_reviews in a repository. |
pr_reviewers | BIGINT | The number of pr_reviewers in a repository. |
total_issues | BIGINT | The number of issues in a repository. |
forks | BIGINT | The number of forks in a repository. |
month | STRING | The month in which the event occurred. For example, if the event occurred in October 2015, the value of this parameter is |
For more information about how the data is generated, see Integration of offline and real-time processing of data in GitHub public event datasets.
MaxCompute provides the db_repos
table that stores the IDs and names of open source database projects, and the programming_language_repos
table that stores the IDs and names of projects developed in open source programming languages. You can use the tables to analyze specific objects. The project information is from ossinsight.
Supported regions
Region | Region ID |
China (Hangzhou) | cn-hangzhou |
China (Shanghai) | cn-shanghai |
China (Beijing) | cn-beijing |
China (Zhangjiakou) | cn-zhangjiakou |
China (Ulanqab) | cn-wulanchabu |
China (Shenzhen) | cn-shenzhen |
China (Chengdu) | cn-chengdu |
Disclaimer
You can use the GitHub public event data that is provided by MaxCompute only for product testing. Data accuracy is not ensured. Do not use the public event data for actual production.
Precautions
Public datasets are available to all MaxCompute users. When you use public datasets, take note of the following items:
All data of public datasets is stored in the
BIGDATA_PUBLIC_DATASET
project in MaxCompute. However, no users are added to this project as members. In this case, you must access the data across projects. When you write an SQL script, specify the project name and schema name before the table name. If you do not enable the tenant-level schema syntax, enable the session-level schema syntax before you execute a statement. Sample statements:-- Enable the session-level schema syntax. set odps.namespace.schema=true; -- Query 100 data records from the dwd_github_events_odps table. select * from bigdata_public_dataset.github_events.dwd_github_events_odps where ds='2024-05-10' limit 100;
ImportantYou are not charged for the storage of the data in the public datasets. However, you are charged computing fees if you execute query statements. For more information, see Computing pricing.
You cannot find the tables in the public datasets on the Data Map page of DataWorks because cross-project access is required.
Public datasets are stored by schema. If you do not enable the tenant-level schema syntax, you cannot view the public datasets in DataWorks DataAnalysis. In this case, you can query the public datasets only by executing SQL statements.
Explore GitHub public event data in MaxCompute
Prerequisites
MaxCompute is activated. A MaxCompute project is created. For more information about how to create a MaxCompute project, see Create a MaxCompute project.
Supported tools and platforms
Appendix: Query examples and analysis results
Explore open source databases
Query the top 10 most popular open source database projects in the previous five years (from 2018 to 2022)
Sample query statements:
SET odps.namespace.schema = TRUE; SELECT dws.repo_id AS repo_id, repos.name AS repo_name, SUM(dws.stars) AS stars FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws JOIN bigdata_public_dataset.github_events.db_repos repos ON dws.repo_id = repos.id WHERE MONTH >= '2015-01' AND MONTH <='2022-12' GROUP BY dws.repo_id, repos.name ORDER BY stars DESC LIMIT 10;
Analysis result: The most popular open source database projects in the previous five years are
elasticsearch
,redis
, andprometheus
in descending order.
Query the changes in the ranking of top 10 most popular open source database projects in the previous five years ( from 2018 to 2022)
Sample query statements:
SET odps.namespace.schema = TRUE; SET odps.sql.validate.orderby.limit=FALSE; WITH tmp as (SELECT dws.repo_id AS repo_id, repos.name AS repo_name, SUM(dws.stars) AS stars, SUBSTR(MONTH,1,4) AS YEAR, row_number() over (partition BY SUBSTR(MONTH,1,4) ORDER BY SUM(dws.stars) DESC) AS ranknum FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws JOIN bigdata_public_dataset.github_events.db_repos repos ON dws.repo_id = repos.id WHERE MONTH>='2018-01' GROUP BY dws.repo_id, repos.name,SUBSTR(MONTH,1,4)) SELECT repo_id, repo_name, stars, ranknum, YEAR FROM tmp WHERE YEAR<=2022 AND ranknum<=10 ORDER BY YEAR ASC,ranknum ASC;
Analysis result: In the previous five years, the top 10 most popular open source database projects frequently change. The
clickhouse
project has the fastest rise in ranking, from the 10th place in 2018 to the first place in 2021. In 2022, theredis
project surpassed the clickhouse project. Thetaosdata/TDengine
project ranked first in 2019 but dropped to the ninth place in 2020.
Query the monthly growth trend of the number of stars that were received by each of the top 10 most popular open source database projects in the previous five years (from 2018 to 2022)
Sample query statements:
set odps.namespace.schema = true; set odps.sql.validate.orderby.limit=false; WITH top_10_repos AS ( SELECT dws.repo_id as repo_id, repos.name as repo_name, SUM(dws.stars) as stars FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws join bigdata_public_dataset.github_events.db_repos repos on dws.repo_id = repos.id where month >= '2018-01' AND MONTH <='2022-12' group by dws.repo_id, repos.name ORDER BY stars DESC LIMIT 10 ), tmp AS ( SELECT month, repo_id, stars, SUM(stars) OVER (partition by repo_id ORDER BY month ASC ) AS total_stars FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month where month >= '2015-01' and stars is not null and repo_id in (select repo_id from top_10_repos) group by repo_id,month,stars ORDER BY month ASC,repo_id) SELECT tmp.month as month, top_10_repos.repo_name as repo_name, tmp.total_stars as total_stars from tmp join top_10_repos ON top_10_repos.repo_id = tmp.repo_id group by month,repo_name,total_stars ORDER BY month ASC,repo_name ;
Analysis result: The
elasticsearch
project is the most popular open source database in the top 10 open source database projects, and the number of stars for theclickhouse
project rapidly grew since 2021.
Query the most popular databases in the first half of 2023
Sample query statements:
set odps.namespace.schema = true; SELECT repos.name AS repo_name, sum (dws.stars) AS stars FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws JOIN bigdata_public_dataset.github_events.db_repos repos ON repos.id = dws.repo_id WHERE month >= '2023-01' AND month <= '2023-06' GROUP BY repo_name ORDER BY stars DESC LIMIT 10;
Analysis result: The most popular databases in the first half of 2023 were
clickhouse
,prometheus
, andredis
in descending order.
Query the databases that are actively maintained and updated in the first half of 2023
Sample query statements:
set odps.namespace.schema = true; SELECT repos.name AS repo_name, COUNT(dwd.id) AS num FROM bigdata_public_dataset.github_events.dwd_github_events_odps dwd JOIN bigdata_public_dataset.github_events.db_repos repos ON repos.id = dwd.repo_id WHERE type = 'PullRequestEvent' AND ds>='2023-01-01' and ds<='2023-06-30' AND action = 'opened' GROUP BY repos.name ORDER BY num DESC LIMIT 10;
Analysis result: In the first half of 2023, the
StarRocks
database is most actively maintained and updated. A database is considered active ifPullRequest
is enabled.
Query the leading individual contributor to the most active open source database project in the first half of 2023
Sample query statements:
SET odps.namespace.schema=true; WITH a AS ( SELECT repo_id ,repo_name ,actor_id ,actor_login ,COUNT(*) AS contribution ,ds FROM bigdata_public_dataset.github_events.dwd_github_events_odps WHERE ds >='2021-01-01' and ds<='2021-12-31' AND ( ( type = 'PullRequestEvent' AND action = 'opened' ) OR ( type = 'IssuesEvent' AND action = 'opened' ) OR ( type = 'IssueCommentEvent' AND action = 'created' ) OR ( type = 'PullRequestReviewEvent' AND action = 'created' ) OR ( type = 'PullRequestReviewCommentEvent' AND action = 'created' ) OR ( type = 'PushEvent' AND action IS NULL ) ) GROUP BY repo_id ,repo_name ,actor_id ,actor_login ,ds ) SELECT repo_name ,actor_login ,SUM(contribution) AS contribution FROM a WHERE repo_name = 'StarRocks/starrocks' AND actor_login NOT LIKE '%[bot]' AND actor_login NOT LIKE 'cockroach%' GROUP BY repo_name ,actor_login ORDER BY contribution DESC LIMIT 10 ;
Analysis result: In the first half of 2023,
kangkaisen
was the leading individual contributor to the most active open source database project.
Explore programming languages
Query the top 10 programming languages that were most commonly used in the previous year
Sample query statements:
SET odps.namespace.schema=true; SELECT language, count(*) total FROM bigdata_public_dataset.github_events.dwd_github_events_odps WHERE ds>=date_add(getdate(), -365) AND language IS NOT NULL GROUP BY language ORDER BY total DESC LIMIT 10;
Analysis result: The most commonly used programming languages were
JavaScript
,TypeScript
, andPython
in descending order.
Query the top 10 programming language projects that were most popular in the previous five years (from 2018 to 2022)
Sample query statements:
SET odps.namespace.schema = TRUE; SELECT dws.repo_id AS repo_id, repos.name AS repo_name, SUM(dws.stars) AS stars FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws JOIN bigdata_public_dataset.github_events.programming_language_repos repos ON dws.repo_id = repos.id WHERE MONTH >= '2015-01' GROUP BY dws.repo_id, repos.name ORDER BY stars DESC LIMIT 10;
Analysis result: In the previous five years, the most popular programming language projects were
Go
, with 81,642 stars,TypeScript
, andNode
in descending order.
Query the monthly growth trend of the number of stars that were received by each of the top 10 most popular programming language projects in the previous five years (from 2018 to 2022)
Sample query statements:
set odps.namespace.schema = true; set odps.sql.validate.orderby.limit=false; WITH top_10_repos AS ( SELECT dws.repo_id as repo_id, repos.name as repo_name, SUM(dws.stars) as stars FROM bigdata_public_dataset.github_events.dws_overview_by_repo_by_month dws join bigdata_public_dataset.github_events.programming_language_repos repos on dws.repo_id = repos.id where month >= '2018-01' and month <='2022-12' group by dws.repo_id, repos.name ORDER BY stars DESC LIMIT 10 ), tmp AS ( SELECT month, repo_id, stars, SUM(stars) OVER (partition by repo_id ORDER BY month ASC ) AS total_stars FROM bigdata_public_dataset.github_events.dws_overview_by_repo_by_month where month >= '2015-01' and stars is not null and repo_id in (select repo_id from top_10_repos) group by repo_id,month,stars ORDER BY month ASC,repo_id) SELECT tmp.month as month, top_10_repos.repo_name as repo_name, tmp.total_stars as total_stars from tmp join top_10_repos ON top_10_repos.repo_id = tmp.repo_id group by month,repo_name,total_stars ORDER BY month ASC,repo_name ;
Analysis result: In the previous five years,
Go
was the programming language whose number of stars increased the most rapidly.