Thank you for your interest in contributing to WorkRB! We're building a community-driven benchmark for work domain AI evaluation, and your contributions help make it better for everyone.
- Ways to Contribute
- Development Setup
- Contributing Process
- Adding a New Task
- Adding a New Model
- Adding New Metrics
- Code Standards
- CI/CD Workflows
- Questions & Support
We welcome contributions of all kinds:
- 🐛 Report bugs – Found an issue? Let us know in GitHub Issues
- 📊 Add new tasks – Extend WorkRB with new evaluation tasks
- 🤖 Add new models – Implement state-of-the-art models or baselines
- 📈 Add new metrics – Contribute evaluation metrics relevant to the work domain
- 📚 Improve documentation – Help make WorkRB easier to use
- ✨ Suggest features – Share ideas for improvements
- install uv
- Git
-
Fork and clone the repository:
git clone https://github.com/YOUR_USERNAME/workrb.git cd workrb -
Install dependencies:
# Create and install a virtual environment, including dev uv sync --all-extras # Activate the virtual environment (venv) source .venv/bin/activate # Install the pre-commit hooks in the venv pre-commit install --install-hooks
-
Verify installation:
# Run example script uv run python examples/usage_example.py -
Create a new branch for your changes:
git checkout -b feature/my-new-feature
Before starting any significant work (new feature, task, model, or refactor), please open a proposal issue first. This helps us align on scope and approach before you invest time in an implementation.
- Open an issue at
https://github.com/techwolf-ai/workrb/issuesdescribing your proposal. Select the 'Feature Request' template to provide additional context. - Maintainers will triage and respond in the issue with feedback and next steps
- Once there’s agreement on the direction, proceed to the implementation with a Pull Request referencing the issue
Project:
- Make a fork of the main branch into your own repo
- Implement your code. See further in this guide how to add new tasks, models, or metrics.
- Ensure all linting and tests complete successfully locally before creating a PR:
uv run poe lint uv run pytest tests/my_task_tests.py # Just your tests uv run poe test # Test suite (excludes model benchmarks) uv run poe test-benchmark # Model benchmark tests only
- Having questions? Add them to your Github Issue.
Make a pull request (PR) from your fork into the main branch of WorkRB, following:
-
Push your branch to your fork:
git push origin feature/my-new-feature
-
Open a Pull Request to
mainbranch on WorkRB's GitHub with:-
A clear title describing the change
-
Link to the issue by using hashtag identifier (e.g. #123 will refer to issue 123)
-
Filling in the following template:
## Description - Description of what changed and why - References to any related issues (use #) - Screenshots/examples if relevant ## Checklist - [ ] Added new tests for new functionality - [ ] Tested locally with example tasks - [ ] Code follows project style guidelines - [ ] Documentation updated - [ ] No new warnings introduced
-
- The Test CI workflow (
test.yml) runs automatically on your PR — linting and the full test suite must pass before merging. Fix any failures before requesting review. - Maintainers will review your PR
- Address any feedback or requested changes
- Once approved, a maintainer will merge your PR
While you've been working on your fork, the main branch in the original repo may have moved ahead while you were working. Before we can merge your PR, you need to merge the latest changes into your fork's feature branch. To do this, run from your local fork repository, on the branch you're working on:
# Add the upstream remote (one-time)
git remote add upstream https://github.com/techwolf-ai/workrb.git
# Fetch latest changes and merge into your branch
git fetch upstream
git merge upstream/main
# Push to your fork
git push origin feature/my-new-featureMerge commits on your feature branch are fine: PRs are squash-merged into main by the maintainers, so the final history stays clean.
Tasks are the core evaluation units in WorkRB. Follow these steps to add a new task:
- RankingTask in src/workrb/tasks/abstract/ranking_base.py
- ClassificationTask in src/workrb/tasks/abstract/classification_base.py
Create a new file in src/workrb/tasks/ranking/ or src/workrb/tasks/classification/ based on the task type.
For a full example, see also examples/custom_task_example.py.
# src/workrb/tasks/ranking/my_task.py
from workrb.types import ModelInputType
from workrb.registry import register_task
from workrb.tasks.abstract.base import DatasetSplit, Language
from workrb.tasks.abstract.ranking_base import RankingDataset, RankingTask, RankingTaskGroup
@register_task()
class MyCustomRankingTask(RankingTask):
"""
Description of your task.
This task evaluates models on [specific capability].
Dataset: [dataset name and source]
"""
@property
def name(self) -> str:
return "MyCustomRankingTask"
@property
def description(self) -> str:
return "Detailed description of what this task evaluates"
@property
def task_group(self) -> RankingTaskGroup:
# Choose appropriate group or add new one to RankingTaskGroup enum
return RankingTaskGroup.JOB2SKILL
@property
def query_input_type(self) -> ModelInputType:
"""Type of query texts (e.g., JOB_TITLE, SKILL_NAME, etc.)"""
return ModelInputType.JOB_TITLE
@property
def target_input_type(self) -> ModelInputType:
"""Type of target texts"""
return ModelInputType.SKILL_NAME
@property
def default_metrics(self) -> list[str]:
"""Override default metrics if needed"""
return ["map", "mrr", "recall@5", "recall@10"]
def load_dataset(self, dataset_id: str, split: DatasetSplit) -> RankingDataset:
"""
Load dataset for a specific dataset ID and split.
Parameters
----------
dataset_id : str
Dataset identifier. For monolingual tasks, the base class
automatically uses the language code as dataset_id
(e.g. "en", "de"), so you can use ``Language(dataset_id)``
to resolve the language when loading data.
split : DatasetSplit
Data split to load.
Returns
-------
RankingDataset
Dataset with query_texts, target_indices, and target_space.
"""
# Load your data here (from files, HuggingFace datasets, etc.)
# Example:
query_texts = ["Software Engineer", "Data Scientist"]
target_space = ["Python", "Machine Learning", "SQL"]
target_indices = [
[0, 2], # Software Engineer -> Python, SQL
[0, 1], # Data Scientist -> Python, Machine Learning
]
return RankingDataset(
query_texts=query_texts,
target_indices=target_indices,
target_space=target_space,
dataset_id=dataset_id,
)The default Task base class assumes a 1:1 mapping between languages and datasets (each language code is the dataset ID). For tasks that have multiple datasets per language or cross-lingual evaluation pairs, override two methods:
languages_to_dataset_ids(languages)— return all dataset IDs that should be loaded for the given languages. For example, a cross-lingual task might return["ita_q_it_c_en", "ita_q_it_c_de"]for[Language.IT].get_dataset_languages(dataset_id)— return aDatasetLanguagesnamed tuple specifying theinput_languagesandoutput_languagesfrozensets for a dataset. This controls how results are grouped during per-language aggregation.
from workrb.types import DatasetLanguages, Language
def get_dataset_languages(self, dataset_id: str) -> DatasetLanguages:
# Example: Italian queries, English targets
return DatasetLanguages(
input_languages=frozenset({Language.IT}),
output_languages=frozenset({Language.EN}),
)By default, per-language aggregation only includes monolingual datasets (LanguageAggregationMode.MONOLINGUAL_ONLY). Cross-lingual results can be aggregated using CROSSLINGUAL_GROUP_INPUT_LANGUAGES or CROSSLINGUAL_GROUP_OUTPUT_LANGUAGES — see the Results & Aggregation section in the README.
For a real-world cross-lingual task implementation, see src/workrb/tasks/ranking/melo.py which overrides both methods for multi-region, cross-lingual evaluation.
Update src/workrb/tasks/__init__.py:
from .ranking.my_task import MyCustomRankingTask
__all__ = [
# ... existing tasks
"MyCustomRankingTask",
]Create tests/test_my_task.py:
import pytest
import workrb
from workrb.tasks.abstract.base import Language
def test_my_custom_task_loads():
"""Test that task loads without errors"""
task = workrb.tasks.MyCustomRankingTask(split="val", languages=["en"])
dataset_id = Language.EN.value
dataset = task.datasets[dataset_id]
assert len(dataset.query_texts) > 0
assert len(dataset.target_space) > 0
assert len(dataset.target_indices) == len(dataset.query_texts)# Run your specific test
uv run pytest tests/test_my_task.py -v
# Run all tests to ensure no regressions
uv run poe testAdd documentation to your task class docstring:
- Dataset source and version
- Task description and motivation
- Expected model behavior
- Any special considerations
See examples/custom_task_example.py for a complete reference implementation.
Models in WorkRB implement the ModelInterface for unified evaluation.
Create a new file in src/workrb/models/:
# src/workrb/models/my_model.py
import torch
from sentence_transformers import SentenceTransformer
from workrb.types import ModelInputType
from workrb.models.base import ModelInterface
from workrb.registry import register_model
@register_model()
class MyCustomModel(ModelInterface):
"""
Description of your model.
This model uses [architecture/approach] for [task types].
"""
def __init__(self, model_name_or_path: str = "default-model"):
"""
Initialize the model.
Args:
model_name_or_path: Model identifier or path
"""
self.model = SentenceTransformer(model_name_or_path)
self.model_name_or_path = model_name_or_path
@property
def name(self) -> str:
"""Return model name for tracking/logging"""
return f"MyCustomModel-{self.model_name_or_path}"
@property
def description(self) -> str:
"""Add description for your model."""
return f"MyCustomModel is BiEncoder based on..."
def _compute_rankings(
self,
queries: list[str],
targets: list[str],
query_input_type: ModelInputType,
target_input_type: ModelInputType,
) -> torch.Tensor:
"""
Compute similarity scores between queries and targets.
Args:
queries: List of query strings
targets: List of target strings
query_input_type: Type of query (JOB_TITLE, SKILL_NAME, etc.)
target_input_type: Type of target
Returns:
Similarity matrix of shape [n_queries, n_targets]
Higher scores indicate better matches
"""
# Encode queries and targets
query_embeddings = self.model.encode(queries, convert_to_tensor=True)
target_embeddings = self.model.encode(targets, convert_to_tensor=True)
# Compute cosine similarity
similarity_matrix = torch.nn.functional.cosine_similarity(
query_embeddings.unsqueeze(1),
target_embeddings.unsqueeze(0),
dim=2
)
return similarity_matrix
def _compute_classification(
self,
texts: list[str],
targets: list[str],
input_type: ModelInputType,
target_input_type: ModelInputType | None = None,
) -> torch.Tensor:
"""
Compute classification scores.
For ranking-based classification, compute similarity to each class label.
For true classifiers, return logits from classification head.
Args:
texts: List of input texts to classify
targets: List of class labels
input_type: Type of input
target_input_type: Type of targets (class labels)
Returns:
Tensor of shape [n_texts, n_classes] with class scores
"""
# For embedding models, use similarity to class labels
text_embeddings = self.model.encode(texts, convert_to_tensor=True)
target_embeddings = self.model.encode(targets, convert_to_tensor=True)
scores = torch.nn.functional.cosine_similarity(
text_embeddings.unsqueeze(1),
target_embeddings.unsqueeze(0),
dim=2
)
return scores
@property
def classification_label_space(self) -> list[str] | None:
"""
Return list of class labels if model has a classification head.
For embedding-based models, return None (labels provided at inference time).
For true classifiers, return the ordered list of labels.
"""
return NoneUpdate src/workrb/models/__init__.py:
from .my_model import MyCustomModel
__all__ = [
# ... existing models
"MyCustomModel",
]Create a test file in tests/test_models/. This file contains both unit tests and (optionally) benchmark validation tests in a single file:
# tests/test_models/test_my_model.py
import pytest
from workrb.models.my_model import MyCustomModel
from workrb.tasks import TechSkillExtractRanking
from workrb.tasks.abstract.base import DatasetSplit, Language
from workrb.types import ModelInputType
class TestMyCustomModelLoading:
"""Test model loading and basic properties."""
def test_model_initialization(self):
"""Test model initialization"""
model = MyCustomModel()
assert model.name is not None
def test_model_ranking(self):
"""Test ranking computation"""
model = MyCustomModel()
queries = ["Software Engineer", "Data Scientist"]
targets = ["Python", "Machine Learning", "SQL"]
scores = model._compute_rankings(
queries=queries,
targets=targets,
query_input_type=ModelInputType.JOB_TITLE,
target_input_type=ModelInputType.SKILL_NAME,
)
assert scores.shape == (len(queries), len(targets))If your model has published benchmark results and a compatible (ideally small) dataset is available in WorkRB, add a benchmark validation test in the same test file. Mark the benchmark class with @pytest.mark.model_performance:
# tests/test_models/test_my_model.py (continued)
@pytest.mark.model_performance
class TestMyCustomModelBenchmark:
"""Validate MyCustomModel against paper-reported metrics."""
def test_benchmark_metrics(self):
"""
Verify model achieves results close to paper-reported metrics.
Paper: "Title" (Venue Year)
Reported on [dataset] test set:
- MRR: 0.XX
- RP@5: XX.X%
"""
model = MyCustomModel()
task = TechSkillExtractRanking(split=DatasetSplit.TEST, languages=[Language.EN.value])
results = task.evaluate(model=model, metrics=["mrr", "rp@5"], language=Language.EN.value)
# Paper-reported values (allow tolerance for minor differences)
expected_mrr = 0.55
expected_rp5 = 0.60
assert results["mrr"] == pytest.approx(expected_mrr, abs=0.05)
assert results["rp@5"] == pytest.approx(expected_rp5, abs=0.05)See tests/test_models/test_contextmatch_model.py for a complete example.
Tests marked with @pytest.mark.model_performance are excluded from poe test by default. To run them:
- Locally:
uv run poe test-benchmark - In CI: Contributors can trigger the Model Benchmarks workflow manually from GitHub Actions (Actions → Model Benchmarks → Run workflow)
Make sure to use the @register_model() decorator (shown in Step 1), this will make your model discoverable via ModelRegistry.list_available().
Add your model to the Models table in README.md. You can either:
- Manually add a row to the table with your model's name, description, and whether it supports adaptive targets
- Generate a table over all registered models using the helper script:
uv run python examples/list_available_tasks_and_models.py
To add new evaluation metrics:
Add to src/workrb/metrics/ranking.py or classification.py:
def my_custom_metric(
prediction_matrix: np.ndarray,
pos_label_idxs: list[list[int]],
) -> float:
"""
Calculate my custom metric.
Args:
prediction_matrix: Scores of shape [n_queries, n_targets]
pos_label_idxs: List of lists of positive target indices per query
Returns:
Metric value (higher is better)
"""
# Your metric implementation
passUpdate the metric calculation function to include your metric:
# In calculate_ranking_metrics() or calculate_classification_metrics()
if "my_custom_metric" in metrics:
results["my_custom_metric"] = my_custom_metric(prediction_matrix, pos_label_idxs)def test_my_custom_metric():
scores = np.array([[0.9, 0.1], [0.2, 0.8]])
pos_labels = [[0], [1]]
result = my_custom_metric(scores, pos_labels)
assert 0 <= result <= 1 # Adjust based on metric rangeWe use automated tools to maintain code quality:
- Formatting: ruff (automatic)
- Linter: ruff (
uv run poe lint) - Docstring style: numpy
# Run all checks & auto-fix where possible
uv run poe lint- All new code must have tests
- Tests must pass before merging
- Aim for >80% code coverage
# Run your specific tests only
uv run pytest tests/my_tests.py
# Run tests with coverage (excludes model benchmarks)
uv run poe test
# Run model benchmark tests only
uv run poe test-benchmarkModel Performance Tests: Benchmark tests in tests/test_models/ that are marked with @pytest.mark.model_performance validate model scores against paper-reported results. These are excluded from poe test by default.
- All public functions/classes must have docstrings
- Use numpy docstring format
- Include:
- Brief description
- Args/Parameters
- Returns
- Raises (if applicable)
- Examples (for complex functions)
Example:
def my_function(arg1: str, arg2: int = 5) -> list[str]:
"""
Brief one-line description.
Longer description if needed, explaining what the function does
and any important details.
Parameters
----------
arg1 : str
Description of arg1
arg2 : int, optional
Description of arg2, by default 5
Returns
-------
list[str]
Description of return value
Examples
--------
>>> my_function("test", 10)
['result1', 'result2']
"""
passThis project uses Conventional Commits enforced by a pre-commit hook (commitizen). All commit messages must follow the format:
<type>: <description>
Common types: feat, fix, docs, chore, refactor, test. For example: feat: add SkillSkape ranking task.
Versioning and the CHANGELOG.md are managed automatically by commitizen (cz bump). You don't need to update the changelog manually, maintainers will handle your PR and new package releases.
The repository uses the following GitHub Actions workflows:
| Workflow | Trigger | What it does |
|---|---|---|
Test (test.yml) |
Push to main or PR to main |
Runs linting and the full test suite (poe test) on Python 3.10 with both highest and lowest dependency resolutions |
Model Benchmarks (benchmark.yml) |
Manual trigger from Actions UI | Runs model performance tests (poe test-benchmark). Contributors can trigger this manually via Actions → Model Benchmarks → Run workflow |
Publish (publish.yml) |
GitHub Release creation | Publishes the package to PyPI (maintainers only) |
- 🐛 Bug reports: For problems and bugs, use GitHub Issues
- 💡 Feature requests: For new ideas or additions, use GitHub Issues
- 📧 Email: For other matters, contact maintainers: [email protected]
Thank you for contributing to WorkRB! Your efforts help make AI evaluation in the work domain more accessible and transparent for everyone. 🎉