Split Benchmark

This repository contains a small, self-contained benchmark kit for evaluating Retab document splitting on the public PoliTax-Split dataset.

It includes saved Retab split outputs, a metric script, and a Streamlit viewer for comparing predicted segments with the ground truth.

benchmark_config.json: shared benchmark metadata, model mapping, instructions, and subdocument definitions.
inputs/: one JSON input per benchmark PDF, including the public PDF URL and ground-truth segments.
results/article_snapshot/: saved SDK-shaped split JSONs for the bundled benchmark snapshot, one file per document/model pair.
run_retab_splits.py: calls client.splits.create(...) and writes the SDK response as JSON.
compute_metrics.py: downloads the public PoliTax-Split annotations from Hugging Face when missing and computes benchmark metrics from saved split JSON outputs.
streamlit_viewer.py: visualizes saved split JSONs against the ground truth.
requirements.txt: Python dependencies for the scripts, tests, and viewer.

No PDFs are copied into this repository. The scripts use the public PDF URLs in inputs/*.json.

Setup

Install the dependencies with your preferred Python environment or package manager:

pip install -r requirements.txt

Set your Retab API key before running fresh split jobs:

export RETAB_API_KEY=sk_...

You do not need a Retab API key to inspect the bundled snapshot, run metrics, or launch the Streamlit viewer.

Reproduce Splits

The runner is intentionally self-contained: it has no command-line arguments. It reads benchmark_config.json and every file under inputs/, runs the fixed document/model configuration declared at the top of run_retab_splits.py, and writes the live outputs.

python run_retab_splits.py

The runner writes:

results/live/<run_id>/<model>/<document>.json

The JSON file is the direct return value from splits.create. There is exactly one result JSON per document/model pair; there is no separate get output because splits.create already returns the split object.

Compute Metrics

python compute_metrics.py

On first run, the script downloads annotations.jsonl, taxonomy.json, and metadata.csv from Extend-AI/PoliTax-Split into huggingface/. Those Hugging Face files provide the benchmark ground truth; the Retab predictions are the saved split JSON results under results/article_snapshot/ and results/live/<run_id>/.

The script writes:

metrics/politaxsplit_metrics.json
metrics/politaxsplit_metrics_aggregate.csv
metrics/politaxsplit_metrics_per_document.csv

The five reported metrics are:

page_level_accuracy: every page receives one predicted subdocument type and one ground-truth type. This is the fraction of pages where those labels are exactly equal. It ignores segment boundaries except through their effect on page labels.
typed_iou_f1: each predicted segment can match at most one ground-truth segment. A match requires the same subdocument type and page-span IoU >= 0.8, where IoU is overlapping_pages / union_pages. The final score is F1 over matched predicted and ground-truth segments.
boundary_f1: compares internal segment start pages. The first page is not counted as a boundary. A predicted boundary matches one ground-truth boundary if it is within +/-1 page, and each ground-truth boundary can be matched once. The final score is F1 over matched boundaries.
oversegmentation: max(0, predicted_segments - ground_truth_segments) / ground_truth_segments. This measures extra predicted pieces. Lower is better; 0 means the prediction did not create more segments than the ground truth.
undersegmentation: 1 - max(0, ground_truth_segments - predicted_segments) / ground_truth_segments. This measures missing predicted pieces as a score. Higher is better; 1 means the prediction did not create fewer segments than the ground truth.

For the aggregate table, page_level_accuracy, typed_iou_f1, and boundary_f1 are micro-aggregated from corpus-level counts. oversegmentation and undersegmentation are averaged across documents so one long PDF does not dominate the instance-count diagnostics.

View Results

streamlit run streamlit_viewer.py

The viewer loads the bundled article_snapshot by default. If you run fresh split jobs, new result sets appear as live/<run_id>.

Run Tests

pytest -q

Result Shape

Every split JSON is the public SDK Split resource shape:

{
  "id": "split_...",
  "file": { "id": "file_...", "filename": "...pdf", "mime_type": "application/pdf" },
  "model": "retab-large",
  "subdocuments": [{ "name": "...", "description": "...", "allow_multiple_instances": true }],
  "n_consensus": 1,
  "instructions": "Split this PoliTax packet into the listed tax subdocuments.",
  "output": [{ "name": "...", "pages": [1, 2, 3] }]
}

The bundled article snapshot IDs are deterministic placeholders. Fresh live runs contain real Retab split IDs and file IDs.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
inputs		inputs
results/article_snapshot		results/article_snapshot
.gitignore		.gitignore
README.md		README.md
benchmark_config.json		benchmark_config.json
compute_metrics.py		compute_metrics.py
requirements.txt		requirements.txt
run_retab_splits.py		run_retab_splits.py
streamlit_viewer.py		streamlit_viewer.py
test_compute_metrics.py		test_compute_metrics.py
test_inputs.py		test_inputs.py
test_streamlit_viewer_metrics.py		test_streamlit_viewer_metrics.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Split Benchmark

Contents

Setup

Reproduce Splits

Compute Metrics

View Results

Run Tests

Result Shape

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Split Benchmark

Contents

Setup

Reproduce Splits

Compute Metrics

View Results

Run Tests

Result Shape

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages