Context
corpus.extract --math-enhance emits a JSON describing positions of math-equation placeholders in the extracted markdown. Each placeholder lives in the MD as an HTML comment (e.g. <!-- formula -->, <!-- equation:NNN -->). A later step uses that JSON to:
- Locate the placeholder in the MD,
- OCR the corresponding equation from the PDF using a math model,
- Replace the placeholder comment with the OCR'd LaTeX/MathML.
corpus.clean is being updated to optionally run Phase A reformatting (PhaseAMode::ParserSurgicalVerified, the parser-backed surgical rewriter from md_format_surgical) BEFORE the destructive cleaning passes. The full design is in rust/glossapi_rs_cleaner/docs/PHASE_A_PARSER_BACKED_*.md.
Concern
Phase A reflow can shift the LINE NUMBER of an HTML comment that's inline within a paragraph. Concretely: if a 5-line soft-wrapped paragraph contains an inline <!-- formula --> on line 4, paragraph reflow joins those 5 lines into 1 — the comment is now on a different line in the output, even though its byte content is byte-exact preserved.
Verified properties of Phase A on HTML comments:
- Comment bytes are preserved byte-exact (rewrite only touches
\n at paragraph soft-break boundaries; never modifies comment payload).
- Comment block structure is preserved (
HtmlBlock nodes pass through verbatim; inline comments inside Paragraph content are kept inline).
- Comment line number is NOT preserved when an enclosing paragraph reflows.
Action items (low priority, future work)
-
Audit how the math-enhance replacement step locates placeholders:
- If it greps by comment text / stable ID (e.g.
<!-- formula:abc123 -->) → no impact, safe to enable Phase A before it.
- If it locates by (line, column) from the extract-time JSON → would break under Phase A reflow. Two options:
a. Run math-enhance replacement BEFORE Phase A (current default ordering — keep this).
b. Switch math-enhance to use stable text-anchor IDs and re-emit positions after Phase A.
-
Document the chosen ordering / contract in the cleaner config + the math-enhance pipeline docs.
-
Add a regression test: a doc with inline math-equation comments → run through Phase A → assert each comment text still appears in output (byte-exact, position may differ).
References
- Phase A implementation: `rust/glossapi_rs_cleaner/src/md_format_surgical.rs`
- Verified safety on 240 hardest-altered docs: 0 comment-loss bugs.
- Full architecture: `rust/glossapi_rs_cleaner/docs/PHASE_A_PARSER_BACKED_INDEX.md`
🤖 Generated with Claude Code
Context
corpus.extract --math-enhanceemits a JSON describing positions of math-equation placeholders in the extracted markdown. Each placeholder lives in the MD as an HTML comment (e.g.<!-- formula -->,<!-- equation:NNN -->). A later step uses that JSON to:corpus.cleanis being updated to optionally run Phase A reformatting (PhaseAMode::ParserSurgicalVerified, the parser-backed surgical rewriter frommd_format_surgical) BEFORE the destructive cleaning passes. The full design is inrust/glossapi_rs_cleaner/docs/PHASE_A_PARSER_BACKED_*.md.Concern
Phase A reflow can shift the LINE NUMBER of an HTML comment that's inline within a paragraph. Concretely: if a 5-line soft-wrapped paragraph contains an inline
<!-- formula -->on line 4, paragraph reflow joins those 5 lines into 1 — the comment is now on a different line in the output, even though its byte content is byte-exact preserved.Verified properties of Phase A on HTML comments:
\nat paragraph soft-break boundaries; never modifies comment payload).HtmlBlocknodes pass through verbatim; inline comments insideParagraphcontent are kept inline).Action items (low priority, future work)
Audit how the math-enhance replacement step locates placeholders:
<!-- formula:abc123 -->) → no impact, safe to enable Phase A before it.a. Run math-enhance replacement BEFORE Phase A (current default ordering — keep this).
b. Switch math-enhance to use stable text-anchor IDs and re-emit positions after Phase A.
Document the chosen ordering / contract in the cleaner config + the math-enhance pipeline docs.
Add a regression test: a doc with inline math-equation comments → run through Phase A → assert each comment text still appears in output (byte-exact, position may differ).
References
🤖 Generated with Claude Code