This repository provides a complete pipeline for analyzing collaboration and code structure patterns in GitHub repositories.
It uses the GitHub API and LLM-based analysis (Ollama 3.2:3B) to extract, enrich, clean, and visualize repository pull request (PR) and code structure data.
The workflow consists of several stages β from data extraction to graph generation and statistical analysis.
- π Data Extraction: Automatically pulls PR and repository metadata via GitHub API.
- π§© Data Enrichment: Enhances raw PR data with communication and structure insights.
- π§Ό Data Cleaning & Preprocessing: Standardizes data for analysis.
- π€ Bot Filtering: Removes automated bot accounts from analysis.
- π Graph Generation: Visualizes PR networks, branching behavior, and collaboration patterns.
- π Statistical Analysis: Provides summary statistics of repository activity.
git clone https://github.com/<your-username>/collabAnalysis.git
cd collabAnalysispython -m venv venv
source venv/bin/activate # (Mac/Linux)
venv\Scripts\activate # (Windows)Install Python dependencies listed in requirements.txt (if not present, manually include your project's libraries such as pandas, requests, tqdm, ollama, etc.):
pip install -r requirements.txtThis project uses Ollama 3.2:3B for analyzing PR communications and code structure.
Follow instructions from Ollama's official site.
ollama pull llama3.2:3bollama serveKeep this running in the background while executing scripts that depend on Ollama.
Below is the recommended execution order:
Run app.py to pull data for selected repositories.
python scripts/app.pyrepositories list before running.
python enrich_output/overwrite_files.py
python event_labelling/Utility/pr_communication_label.pyRequires a running Ollama instance.
python event_labelling/CodeStructure_Branching/code_structure_and_branching.py
python event_labelling/Utility/csvFix.pypython process_model/clean.pyKeep only one section active if working with split datasets.
python process_model/preprocessing.pypython process_model/graphing.pyTo get general repository statistics:
python event_labelling/analysis.pyA reusable utility module for filtering bot accounts from GitHub data.
Features:
- Detects 20+ common bot patterns (dependabot, renovate, GitHub Actions, etc.)
- Flexible filtering functions for DataFrames
- Extensible with custom bot patterns
- Includes verbose logging for transparency
Usage Example:
from event_labelling.Utility.bot_filter import remove_bot_prs, remove_bot_commits
# Filter bot PRs
clean_prs_df = remove_bot_prs(prs_df)
# Filter bot commits
clean_commits_df = remove_bot_commits(commits_df)
# Custom filtering
from event_labelling.Utility.bot_filter import filter_bots_from_dataframe
clean_df = filter_bots_from_dataframe(df, username_column='reviewer')Available Functions:
is_bot_username()- Check if a username is a botfilter_bots_from_dataframe()- Filter bots from any DataFrameremove_bot_prs()- Convenience function for PR dataremove_bot_commits()- Convenience function for commit dataget_bot_usernames()- List all bot usernames foundfilter_bots_from_multiple_columns()- Filter based on multiple columns
The scripts produce:
- Cleaned CSV files with enriched PR and code structure data.
- Graph visualizations showing collaboration patterns, code structure networks, and branching metrics.
- Summary statistics in CSV or plotted form.
- Bot-filtered datasets for accurate human collaboration analysis.
collabAnalysis/
βββ documentation/
β βββ analysis.md # Analysis documentation
β βββ app.md # App usage guide
β βββ csvFix.md # CSV fixing documentation
βββ scripts/
β βββ app.py # Fetches data via GitHub API
βββ enrich_output/
β βββ overwrite_files.py # Data enrichment step
βββ event_labelling/
β βββ CodeStructure&Branching/
β β βββ code_structure_and_branching.py # Code structure analysis
β βββ Utility/ # π Utility modules
β | βββ botFilter.py # π€ Bot filtering utility
β βββ csvFix.py # CSV repair utilities
β βββ pr_communication_label.py # PR communication labeling
β βββ relabelling.py # Data relabeling utilities
βββ process_model/
β βββ clean.py # Data cleaning
β βββ preprocessing.py # Data preprocessing
β βββ graphing.py # Graph generation
βββ test/
β βββ testApp.py # App testing
β βββ testBot_filter.py # π Bot filter tests
β βββ testClean.py # Clean module tests
βββ data/
β βββ csv/ # Output CSVs and processed data
βββ confidential/ # Sensitive or anonymized data (e.g., usernames)
βββ README.md # This file
Set your GitHub API token as an environment variable:
export GITHUB_TOKEN='your_token_here'Update the following variables in scripts/app.py:
REPOSITORIES- List of repositories to analyzeREPO_OWNER- Repository owner/usernameORG_NAME- Organization name (if applicable)
If using anonymized data, create a mapping file:
// confidential/anonymized_usernames.json
{
"real_username_1": "Anon_User_1",
"real_username_2": "Anon_User_2"
}Then set ANONYMIZE = True in the relevant scripts.
- GitHub API Token: Required for data collection. Set as
GITHUB_TOKENenvironment variable. - Bot Filtering: Automatically applied during data processing. Customize patterns in
botFilter.pyif needed. - Anonymization: Update paths in scripts if using anonymized datasets.
- Working Directory: All scripts assume execution from the project root directory.
- Ollama Dependency: Code structure analysis requires a running Ollama server.
Run unit tests to verify functionality:
# Test bot filtering
python test/testBot_filter.py
# Test data cleaning
python test/testClean.py
# Test app functionality
python test/testApp.py