Skip to content

Phase A reflow may shift HTML-comment line numbers used by math-enhance replacement step #97

@fffoivos

Description

@fffoivos

Context

corpus.extract --math-enhance emits a JSON describing positions of math-equation placeholders in the extracted markdown. Each placeholder lives in the MD as an HTML comment (e.g. <!-- formula -->, <!-- equation:NNN -->). A later step uses that JSON to:

  1. Locate the placeholder in the MD,
  2. OCR the corresponding equation from the PDF using a math model,
  3. Replace the placeholder comment with the OCR'd LaTeX/MathML.

corpus.clean is being updated to optionally run Phase A reformatting (PhaseAMode::ParserSurgicalVerified, the parser-backed surgical rewriter from md_format_surgical) BEFORE the destructive cleaning passes. The full design is in rust/glossapi_rs_cleaner/docs/PHASE_A_PARSER_BACKED_*.md.

Concern

Phase A reflow can shift the LINE NUMBER of an HTML comment that's inline within a paragraph. Concretely: if a 5-line soft-wrapped paragraph contains an inline <!-- formula --> on line 4, paragraph reflow joins those 5 lines into 1 — the comment is now on a different line in the output, even though its byte content is byte-exact preserved.

Verified properties of Phase A on HTML comments:

  • Comment bytes are preserved byte-exact (rewrite only touches \n at paragraph soft-break boundaries; never modifies comment payload).
  • Comment block structure is preserved (HtmlBlock nodes pass through verbatim; inline comments inside Paragraph content are kept inline).
  • Comment line number is NOT preserved when an enclosing paragraph reflows.

Action items (low priority, future work)

  1. Audit how the math-enhance replacement step locates placeholders:

    • If it greps by comment text / stable ID (e.g. <!-- formula:abc123 -->) → no impact, safe to enable Phase A before it.
    • If it locates by (line, column) from the extract-time JSON → would break under Phase A reflow. Two options:
      a. Run math-enhance replacement BEFORE Phase A (current default ordering — keep this).
      b. Switch math-enhance to use stable text-anchor IDs and re-emit positions after Phase A.
  2. Document the chosen ordering / contract in the cleaner config + the math-enhance pipeline docs.

  3. Add a regression test: a doc with inline math-equation comments → run through Phase A → assert each comment text still appears in output (byte-exact, position may differ).

References

  • Phase A implementation: `rust/glossapi_rs_cleaner/src/md_format_surgical.rs`
  • Verified safety on 240 hardest-altered docs: 0 comment-loss bugs.
  • Full architecture: `rust/glossapi_rs_cleaner/docs/PHASE_A_PARSER_BACKED_INDEX.md`

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions