Skip to content

tos-kamiya/pdfhl

Repository files navigation

pdfhl

PDF text highlighter with progressive, tolerant phrase matching.

Designed for AI agents (such as codex or gemini-cli): pdfhl was created to be easily used inside LLM-driven workflows. The tool’s robust matching (progressive phrase matching, tolerant normalization) ensures that AI agents can highlight text reliably without breaking their workflow.

While you can of course run pdfhl-cli manually, its main purpose is to serve as a reliable building block, such as academic paper highlighting and other automated reading tasks.

Overview

pdfhl searches for phrases in a PDF and highlights matches. Its search is robust to common PDF quirks and formatting variations.

  • Progressive Phrase Matching: Finds phrases even when words are separated by line breaks or other text. It works by matching chunks of words (e.g., 3-word, then 2-word chunks) and chaining them together.
  • mt5-based Segmentation (default): Query phrases are split into multilingual SentencePiece subwords using google/mt5-base, which significantly improves matching for Japanese and mixed-language text. Minimum coverage defaults to 3 subwords.
  • Tolerant Normalization: Normalizes ligatures, character width (e.g., f + i), hyphens, and quotes.
  • Selection Control: Choose to highlight only the most compact (shortest) match in the document or all possible matches.
  • Safe by Design: Never overwrites the input PDF; always writes to a new file.

Installation

Install via pipx from GitHub (recommended):

pipx install git+https://github.com/tos-kamiya/pdfhl@v0.4.0

To track the latest main branch instead of the tagged release:

pipx install --force git+https://github.com/tos-kamiya/pdfhl

Requires Python 3.10+. These commands install the pdfhl-cli executable alongside the importable library.

pdfhl uses google/mt5-base for subword segmentation by default. The tokenizer itself is fetched on first use; once cached, later runs work offline. Advanced users can point to a different tokenizer via the PDFHL_MT5_MODEL environment variable.

Quick Start

Highlight a single phrase. By default, finds the shortest match, ignoring case.

pdfhl-cli input.pdf --text "Deep Learning" -o output.pdf

Highlight all occurrences of a phrase:

pdfhl-cli input.pdf --text "Deep Learning" --all -o output.pdf

Style the highlight and add a label:

pdfhl-cli input.pdf --text "evaluation metrics" --color "#FFEE00" --opacity 0.2 --label "Metrics" -o out.pdf

Apply multiple highlights in one pass via a JSON recipe:

pdfhl-cli input.pdf --recipe recipe.json -o out.pdf

See more in examples/:

  • examples/simple-recipe-array.json
  • examples/recipe-with-items-object.json

Use Case: Academic Paper Highlighting

One key use case of this tool is to highlight important parts of academic papers (approach, experiment, threats to validity) for quick review.

We provide helper resources under extra-utils/:

  • extra-utils/pdftopages — splits a PDF into per-page text (Markdown)
  • extra-utils/prompt-paper-highlights.txt — a prompt that guides the LLM to identify important sentences and apply highlights

How to Use (Interactive)

Just paste the prompt into your LLM CLI (codex, gemini-cli, etc.), replace the file name, and run interactively.

  1. Copy the contents of extra-utils/prompt-paper-highlights.txt.
  2. Replace the placeholder {file.pdf} at the top with the actual filename of your paper.
  3. Paste it into codex or gemini-cli.
  4. Make sure pdftopages is either in a directory on your $PATH, or adjust the prompt to call it using its full path.

This interactive style lets you “poke” the CLI with the prompt, tweak the filename or script path as needed, and let the LLM guide the process (splitting pages, finding important sentences, applying highlights with pdfhl-cli). No additional scripting is required.

Color Convention

The provided prompt uses three colors:

  • Blue — Approach
  • Green — Experiment
  • Red — Threats to validity

Examples

CLI Reference

pdfhl-cli PDF [--text TEXT | --recipe JSON] [options]

Mode

  • Single-item mode:
    • --text TEXT or --pattern TEXT (alias for --text)
  • Recipe mode:
    • --recipe path/to/recipe.json

Common Options

  • --ignore-case / --case-sensitive
    • Default: ignore case.
  • --best / --all / --single
    • Controls match selection via SelectionMode. Default is --best (best match). --all keeps every match. --single fails if more than one match is found.
  • --dry-run
    • Search only; do not write an output file.
  • -o, --output PATH
    • Output PDF path. Default is <input>.highlighted.pdf.
  • --label STR
    • Annotation title/content label.
  • --color VAL
    • Color by name (yellow|mint|violet|red|green|blue), hex #RRGGBB, or r,g,b (each 0..1).
  • --opacity FLOAT
    • Highlight opacity (0..1), default 0.3.
  • --report json
    • Emit a JSON report to stdout.

Notes

  • By default, queries are segmented into mt5 subwords and then matched with \s* between subwords to tolerate missing spaces/line breaks.
  • Minimum coverage is 3 subwords; tune chunking via progressive_kmax (default 3) in recipes.
  • First run requires network access to download google/mt5-base; cached copies are reused afterwards.

Output Policy

  • The input PDF is never modified. The tool refuses to overwrite the input path; specify a different -o/--output.

Exit Codes

  • 0: OK
  • 1: Not found (any recipe item had zero matches)
  • 2: (Unused)
  • 3: Error (open/save failures, invalid paths, overwrite refusal, etc.)

JSON Recipe Format (--recipe)

You can pass either a top-level array of items or an object with an items: [...] key. All searches use the progressive matching algorithm.

Top-level array example:

[
  {"text": "Introduction", "color": "mint"},
  {"text": "Threats to Validity", "color": "red", "select_shortest": false}
]

Object with items example:

{
  "items": [
    {"text": "Experiment", "color": "green", "label": "Experiment"},
    {"text": "Conclusion", "color": "violet", "opacity": 0.25}
  ]
}

Per-item fields:

  • text or pattern (string, required): The phrase to search for.
  • ignore_case (bool, default true): Whether to perform a case-insensitive search.
  • select_shortest (bool, default true): If true, highlights only the single best match. If false, highlights all found matches (equivalent to --all).
  • label (string | null): Annotation label.
  • color (string | null): Color name, hex, or r,g,b float values.
  • opacity (float | null): Highlight opacity (0..1). Falls back to the CLI default if not set.
  • progressive_kmax (int, default 3): (Advanced) Max words in a search chunk.
  • progressive_max_gap_chars (int, default 200): (Advanced) Max characters allowed in a gap between matched chunks.

Additional notes

  • Query tokenization uses mt5 subwords. Coverage checks are in subwords; default minimum is 3.

JSON Report (--report json)

Single-item mode output:

{
  "input": "input.pdf",
  "output": "output.pdf",
  "matches": 1,
  "exit_code": 0,
  "dry_run": false,
  "hits": [
    {
      "page_index": 0,
      "page_number": 1,
      "start": 123,
      "end": 135,
      "rects": [[x0, y0, x1, y1], ...]
    }
  ],
  "context": {
    "query": "...",
    "ignore_case": true,
    "selection_mode": "best"
  }
}

Recipe mode output (aggregate):

{
  "input": "input.pdf",
  "output": "output.pdf",
  "items": [
    {
      "index": 0,
      "query": "Introduction",
      "matches": 1,
      "hits": [
        {"page_index": 0, "page_number": 1, "start": 10, "end": 22, "rects": [[...]]}
      ],
      "progressive_search": true,
      "progressive_kmax": 3,
      "progressive_max_gap_chars": 200,
      "selection_mode": "best"
    }
  ]
}

Notes

  • page_index is zero-based; page_number is one-based.
  • start/end are indices into the normalized text stream of a page.

Library API

pdfhl now exposes a lightweight importable API for scripting. A tiny sample PDF is bundled at examples/sample.pdf so you can run the snippets below immediately.

Highlight a single phrase and save in one call:

from pathlib import Path
from pdfhl import SelectionMode, highlight_text

outcome = highlight_text(
    Path("examples/sample.pdf"),
    "pdfhl sample document",
    output=Path("examples/sample.highlighted.pdf"),
    color="#ffeb3b",
    label="Example",
    selection_mode=SelectionMode.BEST,
)
print(outcome.highlight_count, outcome.segment_matches, outcome.saved_path)

outcome.highlight_count reports the number of unique highlight ranges, while outcome.segment_matches retains the raw segment count produced by the progressive matcher.

For batch scenarios, open a document once and apply multiple queries:

from pdfhl import PdfHighlighter, SelectionMode

with PdfHighlighter.open("examples/sample.pdf") as hl:
    single = hl.highlight_text("pdfhl sample document", selection_mode=SelectionMode.BEST, dry_run=True)  # dry-run to inspect matches only
    multi = hl.highlight_text("progressive highlight example", color="violet", selection_mode=SelectionMode.ALL)
    summary = hl.save("examples/sample.highlighted.pdf")

print(single.highlight_count, single.segment_matches)
print(multi.highlight_count, multi.segment_matches)

Context managers are optional. You can manage the lifecycle explicitly if you prefer:

from pdfhl import PdfHighlighter, SelectionMode

hl = PdfHighlighter.open("examples/sample.pdf")
try:
    hl.highlight_text("multiple times", color="#ff9800", label="Sample", selection_mode=SelectionMode.ALL)
    hl.highlight_text("pdfhl sample document", selection_mode=SelectionMode.SINGLE)
    outcome = hl.save("examples/sample.highlighted.pdf")
finally:
    hl.close()

print(outcome.highlight_count, outcome.segment_matches)

highlight_text arguments

pdfhl.highlight_text() returns a HighlightOutcome and accepts the following top-level parameters:

Parameter Type Default Description
pdf_path `str Path`
text str Query to search and highlight.
output `Path None` None
dry_run bool False Return matches without modifying or saving the PDF.

PdfHighlighter.highlight_text() shares the same keyword arguments below; the standalone function forwards them as **kwargs.

Keyword Type Default Description
color `str Sequence[float] None`
label `str None` None
selection_mode SelectionMode SelectionMode.BEST Choose how to resolve multiple matches (SINGLE, BEST, ALL).
ignore_case bool True Case-insensitive matching when True.
literal_whitespace bool False Treat the query’s whitespace literally when building regex patterns.
regex bool False Interpret text as a regex pattern instead of a literal phrase.
progressive bool True Use tolerant progressive search (True) or literal regex search (False).
progressive_kmax int 3 Maximum subword chunk size for progressive search.
progressive_max_gap_chars int 200 Max allowed character gap between progressive segments.
progressive_min_total_words int 3 Minimum matched subwords required when using progressive search.
opacity float 0.3 Highlight opacity (0..1).
dry_run bool False Inspect matches without applying annotations (same effect as the top-level flag on the free function).
page_filter `Callable[[PageInfo], bool] None` None

Development

Running Tests

You can run the test suite without installing the package itself. Choose either uv or plain venv/pip.

Option A — uv (recommended):

# Create and activate a virtualenv in .venv/
uv venv
source .venv/bin/activate

# Install pytest only (tests avoid heavy deps)
uv pip install pytest

# Run tests
pytest -q

Option B — Python built-in venv/pip:

python -m venv .venv
source .venv/bin/activate
pip install -U pip pytest
pytest -q

Notes

  • Tests focus on pure logic and do not require PyMuPDF or transformers. You may see harmless warnings if those libraries are not installed.
  • If you prefer, you can run without activating the venv by using the full path, e.g. .venv/bin/pytest -q.

License

pdfhl is distributed under the terms of the MIT license.

About

PDF text highlighter with progressive, tolerant phrase matching.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors