Skip to content

bohuie/processAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

77 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧠 Collaboration Analysis – Replication Package

This repository provides a complete pipeline for analyzing collaboration and code structure patterns in GitHub repositories.
It uses the GitHub API and LLM-based analysis (Ollama 3.2:3B) to extract, enrich, clean, and visualize repository pull request (PR) and code structure data.


πŸ“‚ Overview

The workflow consists of several stages β€” from data extraction to graph generation and statistical analysis.

Main Features

  • πŸ” Data Extraction: Automatically pulls PR and repository metadata via GitHub API.
  • 🧩 Data Enrichment: Enhances raw PR data with communication and structure insights.
  • 🧼 Data Cleaning & Preprocessing: Standardizes data for analysis.
  • πŸ€– Bot Filtering: Removes automated bot accounts from analysis.
  • πŸ“Š Graph Generation: Visualizes PR networks, branching behavior, and collaboration patterns.
  • πŸ“ˆ Statistical Analysis: Provides summary statistics of repository activity.

βš™οΈ Setup Instructions

1. Clone the Repository

git clone https://github.com/<your-username>/collabAnalysis.git
cd collabAnalysis

2. Create a Virtual Environment

python -m venv venv
source venv/bin/activate    # (Mac/Linux)
venv\Scripts\activate       # (Windows)

3. Install Dependencies

Install Python dependencies listed in requirements.txt (if not present, manually include your project's libraries such as pandas, requests, tqdm, ollama, etc.):

pip install -r requirements.txt

πŸ€– Ollama Setup

This project uses Ollama 3.2:3B for analyzing PR communications and code structure.

1. Install Ollama

Follow instructions from Ollama's official site.

2. Pull the required model

ollama pull llama3.2:3b

3. Start the Ollama server

ollama serve

Keep this running in the background while executing scripts that depend on Ollama.


πŸš€ How to Run

Below is the recommended execution order:

1. Data Collection

Run app.py to pull data for selected repositories.

python scripts/app.py

⚠️ Make sure you've added your target repositories in the repositories list before running.

2. Data Enrichment

python enrich_output/overwrite_files.py
python event_labelling/Utility/pr_communication_label.py

3. Code Structure & Branching Analysis

Requires a running Ollama instance.

python event_labelling/CodeStructure_Branching/code_structure_and_branching.py
python event_labelling/Utility/csvFix.py

4. Data Cleaning

python process_model/clean.py

5. Preprocessing (for Graphs)

Keep only one section active if working with split datasets.

python process_model/preprocessing.py

6. Graph Generation

python process_model/graphing.py

7. Statistical Analysis (Optional)

To get general repository statistics:

python event_labelling/analysis.py

πŸ› οΈ Utility Modules

Bot Filter (event_labelling/Utility/botFilter.py)

A reusable utility module for filtering bot accounts from GitHub data.

Features:

  • Detects 20+ common bot patterns (dependabot, renovate, GitHub Actions, etc.)
  • Flexible filtering functions for DataFrames
  • Extensible with custom bot patterns
  • Includes verbose logging for transparency

Usage Example:

from event_labelling.Utility.bot_filter import remove_bot_prs, remove_bot_commits

# Filter bot PRs
clean_prs_df = remove_bot_prs(prs_df)

# Filter bot commits
clean_commits_df = remove_bot_commits(commits_df)

# Custom filtering
from event_labelling.Utility.bot_filter import filter_bots_from_dataframe
clean_df = filter_bots_from_dataframe(df, username_column='reviewer')

Available Functions:

  • is_bot_username() - Check if a username is a bot
  • filter_bots_from_dataframe() - Filter bots from any DataFrame
  • remove_bot_prs() - Convenience function for PR data
  • remove_bot_commits() - Convenience function for commit data
  • get_bot_usernames() - List all bot usernames found
  • filter_bots_from_multiple_columns() - Filter based on multiple columns

πŸ“Š Output

The scripts produce:

  • Cleaned CSV files with enriched PR and code structure data.
  • Graph visualizations showing collaboration patterns, code structure networks, and branching metrics.
  • Summary statistics in CSV or plotted form.
  • Bot-filtered datasets for accurate human collaboration analysis.

🧱 Project Structure

collabAnalysis/
β”œβ”€β”€ documentation/
β”‚   β”œβ”€β”€ analysis.md             # Analysis documentation
β”‚   β”œβ”€β”€ app.md                  # App usage guide
β”‚   └── csvFix.md               # CSV fixing documentation
β”œβ”€β”€ scripts/
β”‚   └── app.py                  # Fetches data via GitHub API
β”œβ”€β”€ enrich_output/
β”‚   └── overwrite_files.py      # Data enrichment step
β”œβ”€β”€ event_labelling/
β”‚   β”œβ”€β”€ CodeStructure&Branching/
β”‚   β”‚   └── code_structure_and_branching.py  # Code structure analysis
β”‚   β”œβ”€β”€ Utility/                # πŸ†• Utility modules
β”‚   |   β”œβ”€β”€ botFilter.py        # πŸ€– Bot filtering utility
β”‚   β”œβ”€β”€ csvFix.py           # CSV repair utilities
β”‚   β”œβ”€β”€ pr_communication_label.py  # PR communication labeling
β”‚   └── relabelling.py      # Data relabeling utilities
β”œβ”€β”€ process_model/
β”‚   β”œβ”€β”€ clean.py                # Data cleaning
β”‚   β”œβ”€β”€ preprocessing.py        # Data preprocessing
β”‚   └── graphing.py             # Graph generation
β”œβ”€β”€ test/
β”‚   β”œβ”€β”€ testApp.py              # App testing
β”‚   β”œβ”€β”€ testBot_filter.py       # πŸ†• Bot filter tests
β”‚   └── testClean.py            # Clean module tests
β”œβ”€β”€ data/
β”‚   └── csv/                    # Output CSVs and processed data
β”œβ”€β”€ confidential/               # Sensitive or anonymized data (e.g., usernames)
└── README.md                   # This file

πŸ”§ Configuration

GitHub API Token

Set your GitHub API token as an environment variable:

export GITHUB_TOKEN='your_token_here'

Repository Configuration

Update the following variables in scripts/app.py:

  • REPOSITORIES - List of repositories to analyze
  • REPO_OWNER - Repository owner/username
  • ORG_NAME - Organization name (if applicable)

Anonymization (Optional)

If using anonymized data, create a mapping file:

// confidential/anonymized_usernames.json
{
  "real_username_1": "Anon_User_1",
  "real_username_2": "Anon_User_2"
}

Then set ANONYMIZE = True in the relevant scripts.


🧩 Notes

  • GitHub API Token: Required for data collection. Set as GITHUB_TOKEN environment variable.
  • Bot Filtering: Automatically applied during data processing. Customize patterns in botFilter.py if needed.
  • Anonymization: Update paths in scripts if using anonymized datasets.
  • Working Directory: All scripts assume execution from the project root directory.
  • Ollama Dependency: Code structure analysis requires a running Ollama server.

πŸ› Testing

Run unit tests to verify functionality:

# Test bot filtering
python test/testBot_filter.py

# Test data cleaning
python test/testClean.py

# Test app functionality
python test/testApp.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages