Skip to article frontmatterSkip to article content

🛠️ What’s in the data?

Each dataset is a SQLite database with GitHub metadata for a Jupyter organization. The data is collected using github-to-sqlite, which captures comprehensive information about repositories, issues, pull requests, users, and comments.

The rest of this page shows example analyses on the jupyter-book repository.

Showing data for last: 6 months of activity.

Download and load data

Use pooch to grab the latest data from GitHub releases:

db_path = pooch.retrieve(
    url="https://github.com/jupyter/github-data/releases/download/latest/jupyter-book.db",
    known_hash=None,
)
conn = sqlite3.connect(db_path)
Downloading data from 'https://github.com/jupyter/github-data/releases/download/latest/jupyter-book.db' to file '/home/runner/.cache/pooch/344f4530266be39c90499b249209bff4-jupyter-book.db'.
SHA256 hash of downloaded file: 3d9f89ad4b9c2457cb0b9705466c996bff604390731a3bc532f6d79df69bba93
Use this value as the 'known_hash' argument of 'pooch.retrieve' to ensure that the file hasn't changed if it is downloaded again in the future.

The database contains five main tables:

Load them into pandas DataFrames:

repos = pd.read_sql("SELECT * FROM repos;", conn).set_index("id")
issues = pd.read_sql("SELECT * FROM issues;", conn)
users = pd.read_sql("SELECT * FROM users;", conn).set_index("id")
labels = pd.read_sql("SELECT * FROM labels;", conn)
comments = pd.read_sql("SELECT * FROM issue_comments;", conn)

# Filter to last 6 months
six_months_ago = datetime.now(timezone.utc) - timedelta(days=180)
issues['created_at'] = pd.to_datetime(issues['created_at'])
issues = issues[issues['created_at'] >= six_months_ago]

print(f"{len(repos)} repos, {len(issues)} issues+PRs, {len(users)} users, {len(comments)} comments")
30 repos, 566 issues+PRs, 1407 users, 17606 comments

Repository analysis

Note: Repository metadata (stars, open issue counts) reflects current GitHub totals, not filtered to 6 months.

top_repos = repos[['name', 'stargazers_count', 'open_issues_count']].sort_values(
    'stargazers_count', ascending=False
).head(10)
top_repos
Loading...

Which repositories have the most issue activity in the last 6 months?

# Count total issues (open + closed) per repository from last 6 months
issue_counts = issues.groupby('repo').size().sort_values(ascending=False).head(10)

# Get repo names
repo_names = repos.loc[issue_counts.index, 'name']

# Create bar plot
fig, ax = plt.subplots(figsize=(10, 6))
ax.barh(range(len(issue_counts)), issue_counts.values, color='steelblue')
ax.set_yticks(range(len(issue_counts)))
ax.set_yticklabels(repo_names)
ax.set_xlabel('Issues + PRs (last 6 months)')
ax.set_title('Top 10 Repositories by Recent Activity')
ax.invert_yaxis()
plt.tight_layout()
plt.show()
<Figure size 1000x600 with 1 Axes>

What labels are used across repositories?

top_labels = labels[['name', 'color', 'description']].head(10)
top_labels
Loading...

Issue analysis

# Separate issues from PRs
issue_only = issues[issues['type'] == 'issue']
prs = issues[issues['type'] == 'pull'].copy()

print(f"{len(issue_only)} issues, {len(prs)} PRs")
305 issues, 261 PRs

Who opens the most issues?

top_issue_authors = issue_only['user'].value_counts().head(10)

# Join with user info
issue_authors_df = pd.DataFrame({'issues_opened': top_issue_authors}).join(users[['login', 'type']])
issue_authors_df
Loading...

Which issues get the most discussion?

top_discussed = issue_only.nlargest(5, 'comments')[['number', 'title', 'state', 'comments']]
top_discussed
Loading...

What reactions do issues receive?

# Parse reactions from the reactions column
reaction_counts = {'+1': 0, 'heart': 0, 'hooray': 0, 'rocket': 0, 'eyes': 0}

for reaction_str in issues['reactions'].dropna():
    try:
        reactions = eval(reaction_str)
        for key in reaction_counts.keys():
            reaction_counts[key] += reactions.get(key, 0)
    except:
        pass

reactions_df = pd.DataFrame([reaction_counts], index=['Total']).T.sort_values('Total', ascending=False)
reactions_df
Loading...

Which repos have the most open issues from the last 6 months?

open_issues = issue_only[issue_only['state'] == 'open'].copy()

# Add repo names
open_issues = open_issues.merge(
    repos[['name']], left_on='repo', right_index=True, how='left'
)

# Count by repo
issues_per_repo = open_issues.groupby('name').size().sort_values(ascending=False).head(10)
open_issues_df = pd.DataFrame({'open_issues_last_6mo': issues_per_repo})
open_issues_df
Loading...

Pull request analysis

All PR metrics below use the 6-month filtered data.

Who opens the most PRs?

top_pr_authors = prs['user'].value_counts().head(10)

# Join with user info
pr_authors_df = pd.DataFrame({'prs_opened_last_6mo': top_pr_authors}).join(users[['login', 'type']])
pr_authors_df
Loading...

Which PRs get the most discussion?

most_discussed_prs = prs.nlargest(5, 'comments')[['number', 'title', 'state', 'comments']]
most_discussed_prs
Loading...

Who comments on PRs the most?

Note: Comment data is not filtered by time - it includes all comments on PRs from the last 6 months.

# Get all PR IDs
pr_ids = set(prs['id'])

# Filter comments to only those on PRs
pr_comments = comments[comments['issue'].isin(pr_ids)]

# Count comments by user
top_pr_commenters = pr_comments['user'].value_counts().head(10)

# Join with user info
pr_commenters_df = pd.DataFrame({'pr_comments': top_pr_commenters}).join(users[['login', 'type']])
pr_commenters_df
Loading...

How long do PRs stay open?

prs['created'] = pd.to_datetime(prs['created_at'])
prs['closed'] = pd.to_datetime(prs['closed_at'])
prs['days_open'] = (prs['closed'] - prs['created']).dt.days

# Only look at merged PRs
merged = prs[prs['state'] == 'closed'].nlargest(5, 'days_open')
longest_open_prs = merged[['number', 'title', 'days_open']]
longest_open_prs
Loading...

Which repos have the most PR activity in the last 6 months?

# Add repo names to PRs
prs_with_repos = prs.merge(
    repos[['name']], left_on='repo', right_index=True, how='left'
)

# Count PRs per repo
prs_per_repo = prs_with_repos.groupby('name').size().sort_values(ascending=False).head(10)
pr_activity_df = pd.DataFrame({'prs_last_6mo': prs_per_repo})
pr_activity_df
Loading...

Which repos have the most PR discussion in the last 6 months?

# Join PR comments with repo info
pr_comment_repos = pr_comments.merge(
    issues[['id', 'repo']], left_on='issue', right_on='id', how='left'
).merge(
    repos[['name']], left_on='repo', right_index=True, how='left'
)

# Count PR comments by repo
comments_per_repo = pr_comment_repos.groupby('name').size().sort_values(ascending=False).head(10)
pr_discussion_df = pd.DataFrame({'pr_comments': comments_per_repo})
pr_discussion_df
Loading...

All datasets updated daily at: https://github.com/jupyter/github-data/releases/tag/latest