🧠 Collaboration Analysis – Replication Package

This repository provides a complete pipeline for analyzing collaboration and code structure patterns in GitHub repositories.
It uses the GitHub API and LLM-based analysis (Ollama 3.2:3B) to extract, enrich, clean, and visualize repository pull request (PR) and code structure data.

📂 Overview

The workflow consists of several stages — from data extraction to graph generation and statistical analysis.

Main Features

🔍 Data Extraction: Automatically pulls PR and repository metadata via GitHub API.
🧩 Data Enrichment: Enhances raw PR data with communication and structure insights.
🧼 Data Cleaning & Preprocessing: Standardizes data for analysis.
🤖 Bot Filtering: Removes automated bot accounts from analysis.
📊 Graph Generation: Visualizes PR networks, branching behavior, and collaboration patterns.
📈 Statistical Analysis: Provides summary statistics of repository activity.

⚙️ Setup Instructions

1. Clone the Repository

git clone https://github.com/<your-username>/collabAnalysis.git
cd collabAnalysis

2. Create a Virtual Environment

python -m venv venv
source venv/bin/activate    # (Mac/Linux)
venv\Scripts\activate       # (Windows)

3. Install Dependencies

Install Python dependencies listed in requirements.txt (if not present, manually include your project's libraries such as pandas, requests, tqdm, ollama, etc.):

pip install -r requirements.txt

🤖 Ollama Setup

This project uses Ollama 3.2:3B for analyzing PR communications and code structure.

1. Install Ollama

Follow instructions from Ollama's official site.

2. Pull the required model

ollama pull llama3.2:3b

3. Start the Ollama server

ollama serve

Keep this running in the background while executing scripts that depend on Ollama.

🚀 How to Run

Below is the recommended execution order:

1. Data Collection

Run app.py to pull data for selected repositories.

python scripts/app.py

⚠️ Make sure you've added your target repositories in the repositories list before running.

2. Data Enrichment

python enrich_output/overwrite_files.py
python event_labelling/Utility/pr_communication_label.py

3. Code Structure & Branching Analysis

Requires a running Ollama instance.

python event_labelling/CodeStructure_Branching/code_structure_and_branching.py
python event_labelling/Utility/csvFix.py

4. Data Cleaning

python process_model/clean.py

5. Preprocessing (for Graphs)

Keep only one section active if working with split datasets.

python process_model/preprocessing.py

6. Graph Generation

python process_model/graphing.py

7. Statistical Analysis (Optional)

To get general repository statistics:

python event_labelling/analysis.py

🛠️ Utility Modules

Bot Filter (`event_labelling/Utility/botFilter.py`)

A reusable utility module for filtering bot accounts from GitHub data.

Features:

Detects 20+ common bot patterns (dependabot, renovate, GitHub Actions, etc.)
Flexible filtering functions for DataFrames
Extensible with custom bot patterns
Includes verbose logging for transparency

Usage Example:

from event_labelling.Utility.bot_filter import remove_bot_prs, remove_bot_commits

# Filter bot PRs
clean_prs_df = remove_bot_prs(prs_df)

# Filter bot commits
clean_commits_df = remove_bot_commits(commits_df)

# Custom filtering
from event_labelling.Utility.bot_filter import filter_bots_from_dataframe
clean_df = filter_bots_from_dataframe(df, username_column='reviewer')

Available Functions:

is_bot_username() - Check if a username is a bot
filter_bots_from_dataframe() - Filter bots from any DataFrame
remove_bot_prs() - Convenience function for PR data
remove_bot_commits() - Convenience function for commit data
get_bot_usernames() - List all bot usernames found
filter_bots_from_multiple_columns() - Filter based on multiple columns

📊 Output

The scripts produce:

Cleaned CSV files with enriched PR and code structure data.
Graph visualizations showing collaboration patterns, code structure networks, and branching metrics.
Summary statistics in CSV or plotted form.
Bot-filtered datasets for accurate human collaboration analysis.

🧱 Project Structure

collabAnalysis/
├── documentation/
│   ├── analysis.md             # Analysis documentation
│   ├── app.md                  # App usage guide
│   └── csvFix.md               # CSV fixing documentation
├── scripts/
│   └── app.py                  # Fetches data via GitHub API
├── enrich_output/
│   └── overwrite_files.py      # Data enrichment step
├── event_labelling/
│   ├── CodeStructure&Branching/
│   │   └── code_structure_and_branching.py  # Code structure analysis
│   ├── Utility/                # 🆕 Utility modules
│   |   ├── botFilter.py        # 🤖 Bot filtering utility
│   ├── csvFix.py           # CSV repair utilities
│   ├── pr_communication_label.py  # PR communication labeling
│   └── relabelling.py      # Data relabeling utilities
├── process_model/
│   ├── clean.py                # Data cleaning
│   ├── preprocessing.py        # Data preprocessing
│   └── graphing.py             # Graph generation
├── test/
│   ├── testApp.py              # App testing
│   ├── testBot_filter.py       # 🆕 Bot filter tests
│   └── testClean.py            # Clean module tests
├── data/
│   └── csv/                    # Output CSVs and processed data
├── confidential/               # Sensitive or anonymized data (e.g., usernames)
└── README.md                   # This file

🔧 Configuration

GitHub API Token

Set your GitHub API token as an environment variable:

export GITHUB_TOKEN='your_token_here'

Repository Configuration

Update the following variables in scripts/app.py:

REPOSITORIES - List of repositories to analyze
REPO_OWNER - Repository owner/username
ORG_NAME - Organization name (if applicable)

Anonymization (Optional)

If using anonymized data, create a mapping file:

// confidential/anonymized_usernames.json
{
  "real_username_1": "Anon_User_1",
  "real_username_2": "Anon_User_2"
}

Then set ANONYMIZE = True in the relevant scripts.

🧩 Notes

GitHub API Token: Required for data collection. Set as GITHUB_TOKEN environment variable.
Bot Filtering: Automatically applied during data processing. Customize patterns in botFilter.py if needed.
Anonymization: Update paths in scripts if using anonymized datasets.
Working Directory: All scripts assume execution from the project root directory.
Ollama Dependency: Code structure analysis requires a running Ollama server.

🐛 Testing

Run unit tests to verify functionality:

# Test bot filtering
python test/testBot_filter.py

# Test data cleaning
python test/testClean.py

# Test app functionality
python test/testApp.py

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
data		data
documentation		documentation
enrich_output		enrich_output
event_labelling		event_labelling
process_model		process_model
scripts		scripts
src		src
test		test
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
analysis.py		analysis.py
requirements.stable.txt		requirements.stable.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🧠 Collaboration Analysis – Replication Package

📂 Overview

Main Features

⚙️ Setup Instructions

1. Clone the Repository

2. Create a Virtual Environment

3. Install Dependencies

🤖 Ollama Setup

1. Install Ollama

2. Pull the required model

3. Start the Ollama server

🚀 How to Run

1. Data Collection

2. Data Enrichment

3. Code Structure & Branching Analysis

4. Data Cleaning

5. Preprocessing (for Graphs)

6. Graph Generation

7. Statistical Analysis (Optional)

🛠️ Utility Modules

Bot Filter (event_labelling/Utility/botFilter.py)

📊 Output

🧱 Project Structure

🔧 Configuration

GitHub API Token

Repository Configuration

Anonymization (Optional)

🧩 Notes

🐛 Testing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Bot Filter (`event_labelling/Utility/botFilter.py`)

Packages