All Products
Search
Document Center

MaxCompute:GitHub public event data

Last Updated:Dec 10, 2024

This topic describes the GitHub public event data in the MaxCompute public dataset and how to query GitHub public event data in MaxCompute. This topic also provides query examples and data analysis results.

Overview

A large number of developers develop open source projects on GitHub and generate a large number of events during the development process. GitHub records the information about each event, including the event type, event details, the developer, and the code repository. GitHub also exposes public events, such as the events of starring repositories and submitting code. For more information about event types, see GitHub event types. The GH Archive project summarizes GitHub public events on an hourly basis and allows developers to access the events. For more information about the GH Archive project, see GH Archive.

MaxCompute batch processes and develops large amounts of public event data that is provided by GH Archive and generates two tables. The following table describes the information about the two tables.

Table name

Description

Update cycle

dwd_github_events_odps

Fact table of GitHub public event data

T+1 hours

dws_overview_by_repo_month

Aggregate table of monthly metrics of GitHub public event data

T+1 days

The dwd_github_events_odps and dws_overview_by_repo_month tables are stored in the github_events schema of the MaxCompute public project BIGDATA_PUBLIC_DATASET. For more information about schemas, see Schema-related operations. If you have activated MaxCompute, you can access the tables across projects.

Table details

dwd_github_events_odps

The fact table stores the main information about each GitHub public event and is updated every T+1 hours. The following table describes the fields in the fact table.

Field name

Data type

Description

id

BIGINT

The event ID.

actor_id

BIGINT

The actor ID.

actor_login

STRING

The username of the actor.

repo_id

BIGINT

The repository ID.

repo_name

STRING

The repository name in the owner/Repository_name format.

org_id

BIGINT

The ID of the organization to which the repository belongs.

org_login

STRING

The name of the organization to which the repository belongs.

type

STRING

The type of the event.

For more information about event types, see GitHub event types.

created_at

DATETIME

The time when the event occurred.

action

STRING

The event action.

iss_or_pr_id

BIGINT

The ID of the issue or pull_request.

number

BIGINT

The sequence number of the issue or pull_request.

comment_id

BIGINT

The comment ID.

commit_id

STRING

The commit ID.

member_id

BIGINT

The member ID.

rev_or_push_or_rel_id

BIGINT

The ID of the review, push, or release.

ref

STRING

The name of the resource that you create or delete.

ref_type

STRING

The type of the resource that you create or delete.

state

STRING

The status of the issue, pull_request, or pull_request_review.

author_association

STRING

The relationship between the actor and the repository.

language

STRING

The programming language that is used to merge request code.

merged

BOOLEAN

Specifies whether merge is allowed.

merged_at

DATETIME

The time when code is merged.

additions

BIGINT

The number of rows that are added to the code.

deletions

BIGINT

The number of rows that are deleted from the code.

changed_files

BIGINT

The number of files that are changed by the pull request.

push_size

BIGINT

The push size.

push_distinct_size

BIGINT

The different push sizes.

hr

STRING

The hour in which the event occurred.

For example, if the event occurred at 00:23, the value of this parameter is 00.

month

STRING

The month in which the event occurred.

For example, if the event occurred in October, 2015, the value of this parameter is 2015-10.

year

STRING

The year in which the event occurred.

For example, if the event occurred in 2015, the value of this parameter is 2015.

ds

STRING

The date on which the event occurred.

The value of this parameter is in the yyyy-mm-dd format.

dws_overview_by_repo_month

The aggregate table stores monthly event metrics for a project and is updated every T+1 days. The following table describes the fields in the aggregate table.

Field name

Data type

Description

repo_id

BIGINT

The repository ID.

repo_name

STRING

The repository name in the owner/Repository_name format.

stars

BIGINT

The number of stars in a repository.

commits

BIGINT

The number of commits in a repository.

pushes

BIGINT

The number of pushes in a repository.

total_prs

BIGINT

The number of pull_requests in a repository.

pr_creators

BIGINT

The number of creators that submit pull_requests in a repository.

pr_reviews

BIGINT

The number of pr_reviews in a repository.

pr_reviewers

BIGINT

The number of pr_reviewers in a repository.

total_issues

BIGINT

The number of issues in a repository.

forks

BIGINT

The number of forks in a repository.

month

STRING

The month in which the event occurred.

For example, if the event occurred in October 2015, the value of this parameter is 2015-10.

For more information about how the data is generated, see Integration of offline and real-time processing of data in GitHub public event datasets.

MaxCompute provides the db_repos table that stores the IDs and names of open source database projects, and the programming_language_repos table that stores the IDs and names of projects developed in open source programming languages. You can use the tables to analyze specific objects. The project information is from ossinsight.

Supported regions

Region

Region ID

China (Hangzhou)

cn-hangzhou

China (Shanghai)

cn-shanghai

China (Beijing)

cn-beijing

China (Zhangjiakou)

cn-zhangjiakou

China (Ulanqab)

cn-wulanchabu

China (Shenzhen)

cn-shenzhen

China (Chengdu)

cn-chengdu

Disclaimer

You can use the GitHub public event data that is provided by MaxCompute only for product testing. Data accuracy is not ensured. Do not use the public event data for actual production.

Precautions

Public datasets are available to all MaxCompute users. When you use public datasets, take note of the following items:

  • All data of public datasets is stored in the BIGDATA_PUBLIC_DATASET project in MaxCompute. However, no users are added to this project as members. In this case, you must access the data across projects. When you write an SQL script, specify the project name and schema name before the table name. If you do not enable the tenant-level schema syntax, enable the session-level schema syntax before you execute a statement. Sample statements:

    -- Enable the session-level schema syntax.
    set odps.namespace.schema=true; 
    -- Query 100 data records from the dwd_github_events_odps table.
    select * from bigdata_public_dataset.github_events.dwd_github_events_odps where ds='2024-05-10' limit 100;
    Important

    You are not charged for the storage of the data in the public datasets. However, you are charged computing fees if you execute query statements. For more information, see Computing pricing.

  • You cannot find the tables in the public datasets on the Data Map page of DataWorks because cross-project access is required.

  • Public datasets are stored by schema. If you do not enable the tenant-level schema syntax, you cannot view the public datasets in DataWorks DataAnalysis. In this case, you can query the public datasets only by executing SQL statements.

Explore GitHub public event data in MaxCompute

Prerequisites

MaxCompute is activated. A MaxCompute project is created. For more information about how to create a MaxCompute project, see Create a MaxCompute project.

Supported tools and platforms

Appendix: Query examples and analysis results

Explore open source databases

Query the top 10 most popular open source database projects in the previous five years (from 2018 to 2022)

  • Sample query statements:

    SET odps.namespace.schema = TRUE;
    SELECT dws.repo_id AS repo_id,
     repos.name AS repo_name,
     SUM(dws.stars) AS stars
    FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws
    JOIN bigdata_public_dataset.github_events.db_repos repos ON dws.repo_id = repos.id
    WHERE MONTH >= '2015-01' AND MONTH <='2022-12'
    GROUP BY dws.repo_id,
     repos.name
    ORDER BY stars DESC LIMIT 10;
  • Analysis result: The most popular open source database projects in the previous five years are elasticsearch, redis, and prometheus in descending order.image.png

Query the changes in the ranking of top 10 most popular open source database projects in the previous five years ( from 2018 to 2022)

  • Sample query statements:

    SET odps.namespace.schema = TRUE;
    SET odps.sql.validate.orderby.limit=FALSE;
     WITH tmp as
     (SELECT dws.repo_id AS repo_id, repos.name AS repo_name, SUM(dws.stars) AS stars, SUBSTR(MONTH,1,4) AS YEAR, row_number() over (partition BY SUBSTR(MONTH,1,4)
     ORDER BY SUM(dws.stars) DESC) AS ranknum
     FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws
     JOIN bigdata_public_dataset.github_events.db_repos repos ON dws.repo_id = repos.id
     WHERE MONTH>='2018-01'
     GROUP BY dws.repo_id, repos.name,SUBSTR(MONTH,1,4))
    SELECT repo_id,
     repo_name,
     stars,
     ranknum,
     YEAR
    FROM tmp
    WHERE YEAR<=2022
     AND ranknum<=10
    ORDER BY YEAR ASC,ranknum ASC;
  • Analysis result: In the previous five years, the top 10 most popular open source database projects frequently change. The clickhouse project has the fastest rise in ranking, from the 10th place in 2018 to the first place in 2021. In 2022, the redis project surpassed the clickhouse project. The taosdata/TDengine project ranked first in 2019 but dropped to the ninth place in 2020.image.png

Query the monthly growth trend of the number of stars that were received by each of the top 10 most popular open source database projects in the previous five years (from 2018 to 2022)

  • Sample query statements:

    set odps.namespace.schema = true;
    set odps.sql.validate.orderby.limit=false;
    WITH top_10_repos AS (
     SELECT
     dws.repo_id as repo_id, 
     repos.name as repo_name, 
     SUM(dws.stars) as stars
     FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws
     join bigdata_public_dataset.github_events.db_repos repos on dws.repo_id = repos.id
     where month >= '2018-01' AND MONTH <='2022-12'
     group by dws.repo_id, repos.name
     ORDER BY stars DESC
     LIMIT 10
    ),
    tmp AS (
     SELECT
     month,
     repo_id,
     stars,
     SUM(stars) OVER (partition by repo_id ORDER BY month ASC ) AS total_stars
     FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month
     where month >= '2015-01' and stars is not null
     and repo_id in (select repo_id from top_10_repos)
     group by repo_id,month,stars
     ORDER BY month ASC,repo_id)
     
    SELECT
     tmp.month as month,
     top_10_repos.repo_name as repo_name,
     tmp.total_stars as total_stars
    
     from tmp
     join top_10_repos ON top_10_repos.repo_id = tmp.repo_id
     group by month,repo_name,total_stars
     ORDER BY month ASC,repo_name
    ;
  • Analysis result: The elasticsearch project is the most popular open source database in the top 10 open source database projects, and the number of stars for the clickhouse project rapidly grew since 2021.

    image.png

Query the most popular databases in the first half of 2023

  • Sample query statements:

    set odps.namespace.schema = true;
    SELECT
     repos.name AS repo_name,
     sum (dws.stars) AS stars
    FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws
    JOIN bigdata_public_dataset.github_events.db_repos repos ON repos.id = dws.repo_id
    WHERE month >= '2023-01' AND month <= '2023-06'
    GROUP BY repo_name
    ORDER BY stars DESC
    LIMIT 10;
  • Analysis result: The most popular databases in the first half of 2023 were clickhouse, prometheus, and redis in descending order.image.png

Query the databases that are actively maintained and updated in the first half of 2023

  • Sample query statements:

    set odps.namespace.schema = true;
    SELECT
     repos.name AS repo_name,
     COUNT(dwd.id) AS num
    FROM bigdata_public_dataset.github_events.dwd_github_events_odps dwd
     JOIN bigdata_public_dataset.github_events.db_repos repos ON repos.id = dwd.repo_id
    WHERE type = 'PullRequestEvent'
     AND ds>='2023-01-01' and ds<='2023-06-30'
     AND action = 'opened'
    GROUP BY repos.name
    ORDER BY num DESC
    LIMIT 10;
  • Analysis result: In the first half of 2023, the StarRocks database is most actively maintained and updated. A database is considered active if PullRequest is enabled.

    image.png

Query the leading individual contributor to the most active open source database project in the first half of 2023

  • Sample query statements:

    SET odps.namespace.schema=true; 
    WITH a AS 
    (
     SELECT repo_id
     ,repo_name
     ,actor_id
     ,actor_login
     ,COUNT(*) AS contribution
     ,ds
     FROM bigdata_public_dataset.github_events.dwd_github_events_odps
     WHERE ds >='2021-01-01' and ds<='2021-12-31'
     AND (
     (
     type = 'PullRequestEvent'
     AND action = 'opened'
     )
     OR (
     type = 'IssuesEvent'
     AND action = 'opened'
     )
     OR (
     type = 'IssueCommentEvent'
     AND action = 'created'
     )
     OR (
     type = 'PullRequestReviewEvent'
     AND action = 'created'
     )
     OR (
     type = 'PullRequestReviewCommentEvent'
     AND action = 'created'
     )
     OR (
     type = 'PushEvent'
     AND action IS NULL
     )
     )
     GROUP BY repo_id
     ,repo_name
     ,actor_id
     ,actor_login
     ,ds
    )
    SELECT repo_name
     ,actor_login
     ,SUM(contribution) AS contribution
    FROM a
    WHERE repo_name = 'StarRocks/starrocks'
    AND actor_login NOT LIKE '%[bot]'
    AND actor_login NOT LIKE 'cockroach%'
    
    GROUP BY repo_name
     ,actor_login
    ORDER BY contribution DESC
    LIMIT 10
    ;
  • Analysis result: In the first half of 2023, kangkaisen was the leading individual contributor to the most active open source database project.image.png

Explore programming languages

Query the top 10 programming languages that were most commonly used in the previous year

  • Sample query statements:

    SET odps.namespace.schema=true; 
    SELECT
     language,
     count(*) total
    FROM
     bigdata_public_dataset.github_events.dwd_github_events_odps
    WHERE
     ds>=date_add(getdate(), -365)
     AND language IS NOT NULL
    GROUP BY
     language
    ORDER BY
     total DESC
    LIMIT 10;
  • Analysis result: The most commonly used programming languages were JavaScript, TypeScript, and Python in descending order.image.png

Query the top 10 programming language projects that were most popular in the previous five years (from 2018 to 2022)

  • Sample query statements:

    SET odps.namespace.schema = TRUE;
    
    SELECT dws.repo_id AS repo_id,
     repos.name AS repo_name,
     SUM(dws.stars) AS stars
    FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws
    JOIN bigdata_public_dataset.github_events.programming_language_repos repos ON dws.repo_id = repos.id
    WHERE MONTH >= '2015-01'
    GROUP BY dws.repo_id,
     repos.name
    ORDER BY stars DESC LIMIT 10;
  • Analysis result: In the previous five years, the most popular programming language projects were Go, with 81,642 stars, TypeScript, and Node in descending order.

    image.png

Query the monthly growth trend of the number of stars that were received by each of the top 10 most popular programming language projects in the previous five years (from 2018 to 2022)

  • Sample query statements:

    set odps.namespace.schema = true;
    set odps.sql.validate.orderby.limit=false;
    WITH top_10_repos AS (
     SELECT
     dws.repo_id as repo_id, 
     repos.name as repo_name, 
     SUM(dws.stars) as stars
     FROM bigdata_public_dataset.github_events.dws_overview_by_repo_by_month dws
     join bigdata_public_dataset.github_events.programming_language_repos repos on dws.repo_id = repos.id
     where month >= '2018-01' and month <='2022-12'
     group by dws.repo_id, repos.name
     ORDER BY stars DESC
     LIMIT 10
    ),
    tmp AS (
     SELECT
     month,
     repo_id,
     stars,
     SUM(stars) OVER (partition by repo_id ORDER BY month ASC ) AS total_stars
     FROM bigdata_public_dataset.github_events.dws_overview_by_repo_by_month
     where month >= '2015-01' and stars is not null
     and repo_id in (select repo_id from top_10_repos)
     group by repo_id,month,stars
     ORDER BY month ASC,repo_id)
     
    SELECT
     tmp.month as month,
     top_10_repos.repo_name as repo_name,
     tmp.total_stars as total_stars
    
     from tmp
     join top_10_repos ON top_10_repos.repo_id = tmp.repo_id
     group by month,repo_name,total_stars
     ORDER BY month ASC,repo_name
    ;
  • Analysis result: In the previous five years, Go was the programming language whose number of stars increased the most rapidly.image.png