This repository contains a small, self-contained benchmark kit for evaluating Retab document splitting on the public PoliTax-Split dataset.
It includes saved Retab split outputs, a metric script, and a Streamlit viewer for comparing predicted segments with the ground truth.
benchmark_config.json: shared benchmark metadata, model mapping, instructions, and subdocument definitions.inputs/: one JSON input per benchmark PDF, including the public PDF URL and ground-truth segments.results/article_snapshot/: saved SDK-shaped split JSONs for the bundled benchmark snapshot, one file per document/model pair.run_retab_splits.py: callsclient.splits.create(...)and writes the SDK response as JSON.compute_metrics.py: downloads the public PoliTax-Split annotations from Hugging Face when missing and computes benchmark metrics from saved split JSON outputs.streamlit_viewer.py: visualizes saved split JSONs against the ground truth.requirements.txt: Python dependencies for the scripts, tests, and viewer.
No PDFs are copied into this repository. The scripts use the public PDF URLs in inputs/*.json.
Install the dependencies with your preferred Python environment or package manager:
pip install -r requirements.txtSet your Retab API key before running fresh split jobs:
export RETAB_API_KEY=sk_...You do not need a Retab API key to inspect the bundled snapshot, run metrics, or launch the Streamlit viewer.
The runner is intentionally self-contained: it has no command-line arguments. It reads benchmark_config.json and every file under inputs/, runs the fixed document/model configuration declared at the top of run_retab_splits.py, and writes the live outputs.
python run_retab_splits.pyThe runner writes:
results/live/<run_id>/<model>/<document>.json
The JSON file is the direct return value from splits.create. There is exactly one result JSON per document/model pair; there is no separate get output because splits.create already returns the split object.
python compute_metrics.pyOn first run, the script downloads annotations.jsonl, taxonomy.json, and metadata.csv from Extend-AI/PoliTax-Split into huggingface/. Those Hugging Face files provide the benchmark ground truth; the Retab predictions are the saved split JSON results under results/article_snapshot/ and results/live/<run_id>/.
The script writes:
metrics/politaxsplit_metrics.json
metrics/politaxsplit_metrics_aggregate.csv
metrics/politaxsplit_metrics_per_document.csv
The five reported metrics are:
page_level_accuracy: every page receives one predicted subdocument type and one ground-truth type. This is the fraction of pages where those labels are exactly equal. It ignores segment boundaries except through their effect on page labels.typed_iou_f1: each predicted segment can match at most one ground-truth segment. A match requires the same subdocument type and page-span IoU >= 0.8, where IoU isoverlapping_pages / union_pages. The final score is F1 over matched predicted and ground-truth segments.boundary_f1: compares internal segment start pages. The first page is not counted as a boundary. A predicted boundary matches one ground-truth boundary if it is within +/-1 page, and each ground-truth boundary can be matched once. The final score is F1 over matched boundaries.oversegmentation:max(0, predicted_segments - ground_truth_segments) / ground_truth_segments. This measures extra predicted pieces. Lower is better;0means the prediction did not create more segments than the ground truth.undersegmentation:1 - max(0, ground_truth_segments - predicted_segments) / ground_truth_segments. This measures missing predicted pieces as a score. Higher is better;1means the prediction did not create fewer segments than the ground truth.
For the aggregate table, page_level_accuracy, typed_iou_f1, and boundary_f1 are micro-aggregated from corpus-level counts. oversegmentation and undersegmentation are averaged across documents so one long PDF does not dominate the instance-count diagnostics.
streamlit run streamlit_viewer.pyThe viewer loads the bundled article_snapshot by default. If you run fresh split jobs, new result sets appear as live/<run_id>.
pytest -qEvery split JSON is the public SDK Split resource shape:
{
"id": "split_...",
"file": { "id": "file_...", "filename": "...pdf", "mime_type": "application/pdf" },
"model": "retab-large",
"subdocuments": [{ "name": "...", "description": "...", "allow_multiple_instances": true }],
"n_consensus": 1,
"instructions": "Split this PoliTax packet into the listed tax subdocuments.",
"output": [{ "name": "...", "pages": [1, 2, 3] }]
}The bundled article snapshot IDs are deterministic placeholders. Fresh live runs contain real Retab split IDs and file IDs.