Skip to content

Latest commit

 

History

History
700 lines (518 loc) · 22.8 KB

File metadata and controls

700 lines (518 loc) · 22.8 KB

Contributing to WorkRB

Thank you for your interest in contributing to WorkRB! We're building a community-driven benchmark for work domain AI evaluation, and your contributions help make it better for everyone.

Table of Contents

Ways to Contribute

We welcome contributions of all kinds:

  • 🐛 Report bugs – Found an issue? Let us know in GitHub Issues
  • 📊 Add new tasks – Extend WorkRB with new evaluation tasks
  • 🤖 Add new models – Implement state-of-the-art models or baselines
  • 📈 Add new metrics – Contribute evaluation metrics relevant to the work domain
  • 📚 Improve documentation – Help make WorkRB easier to use
  • ✨ Suggest features – Share ideas for improvements

Development Setup

Prerequisites

Setup Instructions

  1. Fork and clone the repository:

    git clone https://github.com/YOUR_USERNAME/workrb.git
    cd workrb
  2. Install dependencies:

     # Create and install a virtual environment, including dev
     uv sync --all-extras
    
     # Activate the virtual environment (venv)
     source .venv/bin/activate
    
     # Install the pre-commit hooks in the venv
     pre-commit install --install-hooks
  3. Verify installation:

    # Run example script
    uv run python examples/usage_example.py
  4. Create a new branch for your changes:

    git checkout -b feature/my-new-feature

Contributing Process

1. Make an Issue with your proposal

Before starting any significant work (new feature, task, model, or refactor), please open a proposal issue first. This helps us align on scope and approach before you invest time in an implementation.

  • Open an issue at https://github.com/techwolf-ai/workrb/issues describing your proposal. Select the 'Feature Request' template to provide additional context.
  • Maintainers will triage and respond in the issue with feedback and next steps
  • Once there’s agreement on the direction, proceed to the implementation with a Pull Request referencing the issue

2. Start implementing (local)

Project:

  1. Make a fork of the main branch into your own repo
  2. Implement your code. See further in this guide how to add new tasks, models, or metrics.
  3. Ensure all linting and tests complete successfully locally before creating a PR:
    uv run poe lint
    uv run pytest tests/my_task_tests.py  # Just your tests
    uv run poe test                       # Test suite (excludes model benchmarks)
    uv run poe test-benchmark             # Model benchmark tests only
  4. Having questions? Add them to your Github Issue.

3. Submit Your PR

Make a pull request (PR) from your fork into the main branch of WorkRB, following:

  1. Push your branch to your fork:

    git push origin feature/my-new-feature
  2. Open a Pull Request to main branch on WorkRB's GitHub with:

    • A clear title describing the change

    • Link to the issue by using hashtag identifier (e.g. #123 will refer to issue 123)

    • Filling in the following template:

      ## Description
      - Description of what changed and why
      - References to any related issues (use #)
      - Screenshots/examples if relevant
      
      ## Checklist
      - [ ] Added new tests for new functionality
      - [ ] Tested locally with example tasks
      - [ ] Code follows project style guidelines
      - [ ] Documentation updated
      - [ ] No new warnings introduced

4. Review Process

  1. The Test CI workflow (test.yml) runs automatically on your PR — linting and the full test suite must pass before merging. Fix any failures before requesting review.
  2. Maintainers will review your PR
  3. Address any feedback or requested changes
  4. Once approved, a maintainer will merge your PR

5. (Optional) Updating your fork when main has changed

While you've been working on your fork, the main branch in the original repo may have moved ahead while you were working. Before we can merge your PR, you need to merge the latest changes into your fork's feature branch. To do this, run from your local fork repository, on the branch you're working on:

# Add the upstream remote (one-time)
git remote add upstream https://github.com/techwolf-ai/workrb.git

# Fetch latest changes and merge into your branch
git fetch upstream
git merge upstream/main

# Push to your fork
git push origin feature/my-new-feature

Merge commits on your feature branch are fine: PRs are squash-merged into main by the maintainers, so the final history stays clean.

Adding a New Task

Tasks are the core evaluation units in WorkRB. Follow these steps to add a new task:

Step 1: Choose the Task Type

Step 2: Create Your Task Class

Create a new file in src/workrb/tasks/ranking/ or src/workrb/tasks/classification/ based on the task type. For a full example, see also examples/custom_task_example.py.

# src/workrb/tasks/ranking/my_task.py

from workrb.types import ModelInputType
from workrb.registry import register_task
from workrb.tasks.abstract.base import DatasetSplit, Language
from workrb.tasks.abstract.ranking_base import RankingDataset, RankingTask, RankingTaskGroup


@register_task()
class MyCustomRankingTask(RankingTask):
    """
    Description of your task.
    
    This task evaluates models on [specific capability].
    Dataset: [dataset name and source]
    """
    
    @property
    def name(self) -> str:
        return "MyCustomRankingTask"
    
    @property
    def description(self) -> str:
        return "Detailed description of what this task evaluates"
    
    @property
    def task_group(self) -> RankingTaskGroup:
        # Choose appropriate group or add new one to RankingTaskGroup enum
        return RankingTaskGroup.JOB2SKILL
    
    @property
    def query_input_type(self) -> ModelInputType:
        """Type of query texts (e.g., JOB_TITLE, SKILL_NAME, etc.)"""
        return ModelInputType.JOB_TITLE
    
    @property
    def target_input_type(self) -> ModelInputType:
        """Type of target texts"""
        return ModelInputType.SKILL_NAME
    
    @property
    def default_metrics(self) -> list[str]:
        """Override default metrics if needed"""
        return ["map", "mrr", "recall@5", "recall@10"]
    
    def load_dataset(self, dataset_id: str, split: DatasetSplit) -> RankingDataset:
        """
        Load dataset for a specific dataset ID and split.

        Parameters
        ----------
        dataset_id : str
            Dataset identifier. For monolingual tasks, the base class
            automatically uses the language code as dataset_id
            (e.g. "en", "de"), so you can use ``Language(dataset_id)``
            to resolve the language when loading data.
        split : DatasetSplit
            Data split to load.

        Returns
        -------
        RankingDataset
            Dataset with query_texts, target_indices, and target_space.
        """
        # Load your data here (from files, HuggingFace datasets, etc.)
        # Example:
        query_texts = ["Software Engineer", "Data Scientist"]
        target_space = ["Python", "Machine Learning", "SQL"]
        target_indices = [
            [0, 2],  # Software Engineer -> Python, SQL
            [0, 1],  # Data Scientist -> Python, Machine Learning
        ]

        return RankingDataset(
            query_texts=query_texts,
            target_indices=target_indices,
            target_space=target_space,
            dataset_id=dataset_id,
        )

Advanced: Cross-lingual and multi-dataset tasks

The default Task base class assumes a 1:1 mapping between languages and datasets (each language code is the dataset ID). For tasks that have multiple datasets per language or cross-lingual evaluation pairs, override two methods:

  • languages_to_dataset_ids(languages) — return all dataset IDs that should be loaded for the given languages. For example, a cross-lingual task might return ["ita_q_it_c_en", "ita_q_it_c_de"] for [Language.IT].
  • get_dataset_languages(dataset_id) — return a DatasetLanguages named tuple specifying the input_languages and output_languages frozensets for a dataset. This controls how results are grouped during per-language aggregation.
from workrb.types import DatasetLanguages, Language

def get_dataset_languages(self, dataset_id: str) -> DatasetLanguages:
    # Example: Italian queries, English targets
    return DatasetLanguages(
        input_languages=frozenset({Language.IT}),
        output_languages=frozenset({Language.EN}),
    )

By default, per-language aggregation only includes monolingual datasets (LanguageAggregationMode.MONOLINGUAL_ONLY). Cross-lingual results can be aggregated using CROSSLINGUAL_GROUP_INPUT_LANGUAGES or CROSSLINGUAL_GROUP_OUTPUT_LANGUAGES — see the Results & Aggregation section in the README.

For a real-world cross-lingual task implementation, see src/workrb/tasks/ranking/melo.py which overrides both methods for multi-region, cross-lingual evaluation.

Step 3: Add to Module Exports

Update src/workrb/tasks/__init__.py:

from .ranking.my_task import MyCustomRankingTask

__all__ = [
    # ... existing tasks
    "MyCustomRankingTask",
]

Step 4: Create Tests

Create tests/test_my_task.py:

import pytest
import workrb
from workrb.tasks.abstract.base import Language


def test_my_custom_task_loads():
    """Test that task loads without errors"""
    task = workrb.tasks.MyCustomRankingTask(split="val", languages=["en"])
    dataset_id = Language.EN.value
    dataset = task.datasets[dataset_id]

    assert len(dataset.query_texts) > 0
    assert len(dataset.target_space) > 0
    assert len(dataset.target_indices) == len(dataset.query_texts)

Step 5: Test Your Task

# Run your specific test
uv run pytest tests/test_my_task.py -v

# Run all tests to ensure no regressions
uv run poe test

Step 6: Document Your Task

Add documentation to your task class docstring:

  • Dataset source and version
  • Task description and motivation
  • Expected model behavior
  • Any special considerations

See examples/custom_task_example.py for a complete reference implementation.

Adding a New Model

Models in WorkRB implement the ModelInterface for unified evaluation.

Step 1: Implement ModelInterface

Create a new file in src/workrb/models/:

# src/workrb/models/my_model.py

import torch
from sentence_transformers import SentenceTransformer

from workrb.types import ModelInputType
from workrb.models.base import ModelInterface
from workrb.registry import register_model


@register_model()
class MyCustomModel(ModelInterface):
    """
    Description of your model.
    
    This model uses [architecture/approach] for [task types].
    """
    
    def __init__(self, model_name_or_path: str = "default-model"):
        """
        Initialize the model.
        
        Args:
            model_name_or_path: Model identifier or path
        """
        self.model = SentenceTransformer(model_name_or_path)
        self.model_name_or_path = model_name_or_path
    
    @property
    def name(self) -> str:
        """Return model name for tracking/logging"""
        return f"MyCustomModel-{self.model_name_or_path}"
    
    @property
    def description(self) -> str:
        """Add description for your model."""
        return f"MyCustomModel is BiEncoder based on..."

    def _compute_rankings(
        self,
        queries: list[str],
        targets: list[str],
        query_input_type: ModelInputType,
        target_input_type: ModelInputType,
    ) -> torch.Tensor:
        """
        Compute similarity scores between queries and targets.
        
        Args:
            queries: List of query strings
            targets: List of target strings
            query_input_type: Type of query (JOB_TITLE, SKILL_NAME, etc.)
            target_input_type: Type of target
        
        Returns:
            Similarity matrix of shape [n_queries, n_targets]
            Higher scores indicate better matches
        """
        # Encode queries and targets
        query_embeddings = self.model.encode(queries, convert_to_tensor=True)
        target_embeddings = self.model.encode(targets, convert_to_tensor=True)
        
        # Compute cosine similarity
        similarity_matrix = torch.nn.functional.cosine_similarity(
            query_embeddings.unsqueeze(1),
            target_embeddings.unsqueeze(0),
            dim=2
        )
        
        return similarity_matrix
    
    def _compute_classification(
        self,
        texts: list[str],
        targets: list[str],
        input_type: ModelInputType,
        target_input_type: ModelInputType | None = None,
    ) -> torch.Tensor:
        """
        Compute classification scores.
        
        For ranking-based classification, compute similarity to each class label.
        For true classifiers, return logits from classification head.
        
        Args:
            texts: List of input texts to classify
            targets: List of class labels
            input_type: Type of input
            target_input_type: Type of targets (class labels)
        
        Returns:
            Tensor of shape [n_texts, n_classes] with class scores
        """
        # For embedding models, use similarity to class labels
        text_embeddings = self.model.encode(texts, convert_to_tensor=True)
        target_embeddings = self.model.encode(targets, convert_to_tensor=True)
        
        scores = torch.nn.functional.cosine_similarity(
            text_embeddings.unsqueeze(1),
            target_embeddings.unsqueeze(0),
            dim=2
        )
        
        return scores
    
    @property
    def classification_label_space(self) -> list[str] | None:
        """
        Return list of class labels if model has a classification head.
        
        For embedding-based models, return None (labels provided at inference time).
        For true classifiers, return the ordered list of labels.
        """
        return None

Step 2: Add to Module Exports

Update src/workrb/models/__init__.py:

from .my_model import MyCustomModel

__all__ = [
    # ... existing models
    "MyCustomModel",
]

Step 3: Test Your Model

Create a test file in tests/test_models/. This file contains both unit tests and (optionally) benchmark validation tests in a single file:

# tests/test_models/test_my_model.py

import pytest
from workrb.models.my_model import MyCustomModel
from workrb.tasks import TechSkillExtractRanking
from workrb.tasks.abstract.base import DatasetSplit, Language
from workrb.types import ModelInputType


class TestMyCustomModelLoading:
    """Test model loading and basic properties."""

    def test_model_initialization(self):
        """Test model initialization"""
        model = MyCustomModel()
        assert model.name is not None

    def test_model_ranking(self):
        """Test ranking computation"""
        model = MyCustomModel()
        queries = ["Software Engineer", "Data Scientist"]
        targets = ["Python", "Machine Learning", "SQL"]

        scores = model._compute_rankings(
            queries=queries,
            targets=targets,
            query_input_type=ModelInputType.JOB_TITLE,
            target_input_type=ModelInputType.SKILL_NAME,
        )

        assert scores.shape == (len(queries), len(targets))

Step 4: Validate Model Performance (if prior paper results available)

If your model has published benchmark results and a compatible (ideally small) dataset is available in WorkRB, add a benchmark validation test in the same test file. Mark the benchmark class with @pytest.mark.model_performance:

# tests/test_models/test_my_model.py (continued)

@pytest.mark.model_performance
class TestMyCustomModelBenchmark:
    """Validate MyCustomModel against paper-reported metrics."""

    def test_benchmark_metrics(self):
        """
        Verify model achieves results close to paper-reported metrics.

        Paper: "Title" (Venue Year)
        Reported on [dataset] test set:
        - MRR: 0.XX
        - RP@5: XX.X%
        """
        model = MyCustomModel()
        task = TechSkillExtractRanking(split=DatasetSplit.TEST, languages=[Language.EN.value])

        results = task.evaluate(model=model, metrics=["mrr", "rp@5"], language=Language.EN.value)

        # Paper-reported values (allow tolerance for minor differences)
        expected_mrr = 0.55
        expected_rp5 = 0.60

        assert results["mrr"] == pytest.approx(expected_mrr, abs=0.05)
        assert results["rp@5"] == pytest.approx(expected_rp5, abs=0.05)

See tests/test_models/test_contextmatch_model.py for a complete example.

Tests marked with @pytest.mark.model_performance are excluded from poe test by default. To run them:

  • Locally: uv run poe test-benchmark
  • In CI: Contributors can trigger the Model Benchmarks workflow manually from GitHub Actions (Actions → Model Benchmarks → Run workflow)

Step 5: Register Your Model

Make sure to use the @register_model() decorator (shown in Step 1), this will make your model discoverable via ModelRegistry.list_available().

Step 6: Document Your Model

Add your model to the Models table in README.md. You can either:

  1. Manually add a row to the table with your model's name, description, and whether it supports adaptive targets
  2. Generate a table over all registered models using the helper script:
    uv run python examples/list_available_tasks_and_models.py

Adding New Metrics

To add new evaluation metrics:

Step 1: Implement Metric Function

Add to src/workrb/metrics/ranking.py or classification.py:

def my_custom_metric(
    prediction_matrix: np.ndarray,
    pos_label_idxs: list[list[int]],
) -> float:
    """
    Calculate my custom metric.
    
    Args:
        prediction_matrix: Scores of shape [n_queries, n_targets]
        pos_label_idxs: List of lists of positive target indices per query
    
    Returns:
        Metric value (higher is better)
    """
    # Your metric implementation
    pass

Step 2: Register in Metric Calculator

Update the metric calculation function to include your metric:

# In calculate_ranking_metrics() or calculate_classification_metrics()
if "my_custom_metric" in metrics:
    results["my_custom_metric"] = my_custom_metric(prediction_matrix, pos_label_idxs)

Step 3: Add Tests

def test_my_custom_metric():
    scores = np.array([[0.9, 0.1], [0.2, 0.8]])
    pos_labels = [[0], [1]]
    
    result = my_custom_metric(scores, pos_labels)
    assert 0 <= result <= 1  # Adjust based on metric range

Code Standards

We use automated tools to maintain code quality:

Formatting & Linting

  • Formatting: ruff (automatic)
  • Linter: ruff (uv run poe lint)
  • Docstring style: numpy
# Run all checks & auto-fix where possible
uv run poe lint

Testing Requirements

  • All new code must have tests
  • Tests must pass before merging
  • Aim for >80% code coverage
# Run your specific tests only
uv run pytest tests/my_tests.py

# Run tests with coverage (excludes model benchmarks)
uv run poe test

# Run model benchmark tests only
uv run poe test-benchmark

Model Performance Tests: Benchmark tests in tests/test_models/ that are marked with @pytest.mark.model_performance validate model scores against paper-reported results. These are excluded from poe test by default.

Documentation Standards

  • All public functions/classes must have docstrings
  • Use numpy docstring format
  • Include:
    • Brief description
    • Args/Parameters
    • Returns
    • Raises (if applicable)
    • Examples (for complex functions)

Example:

def my_function(arg1: str, arg2: int = 5) -> list[str]:
    """
    Brief one-line description.
    
    Longer description if needed, explaining what the function does
    and any important details.
    
    Parameters
    ----------
    arg1 : str
        Description of arg1
    arg2 : int, optional
        Description of arg2, by default 5
    
    Returns
    -------
    list[str]
        Description of return value
    
    Examples
    --------
    >>> my_function("test", 10)
    ['result1', 'result2']
    """
    pass

Commit Messages & Versioning

This project uses Conventional Commits enforced by a pre-commit hook (commitizen). All commit messages must follow the format:

<type>: <description>

Common types: feat, fix, docs, chore, refactor, test. For example: feat: add SkillSkape ranking task.

Versioning and the CHANGELOG.md are managed automatically by commitizen (cz bump). You don't need to update the changelog manually, maintainers will handle your PR and new package releases.

CI/CD Workflows

The repository uses the following GitHub Actions workflows:

Workflow Trigger What it does
Test (test.yml) Push to main or PR to main Runs linting and the full test suite (poe test) on Python 3.10 with both highest and lowest dependency resolutions
Model Benchmarks (benchmark.yml) Manual trigger from Actions UI Runs model performance tests (poe test-benchmark). Contributors can trigger this manually via Actions → Model Benchmarks → Run workflow
Publish (publish.yml) GitHub Release creation Publishes the package to PyPI (maintainers only)

Questions & Support


Thank you for contributing to WorkRB! Your efforts help make AI evaluation in the work domain more accessible and transparent for everyone. 🎉