Skip to content

Refactor: extract DocumentConverter layer#5065

Open
TingquanGao wants to merge 12 commits intoPaddlePaddle:developfrom
TingquanGao:clean-refactor
Open

Refactor: extract DocumentConverter layer#5065
TingquanGao wants to merge 12 commits intoPaddlePaddle:developfrom
TingquanGao:clean-refactor

Conversation

@TingquanGao
Copy link
Copy Markdown
Collaborator

@TingquanGao TingquanGao commented Mar 25, 2026

Summary

Refactor the document format conversion logic for PP-StructureV3 and PaddleOCR-VL pipelines by introducing a unified converter/ module.

Changes

  • Extract converter/ module with MarkdownConverter, WordConverter, LatexConverter
  • Unify build_handle_funcs_dict() across both pipelines, eliminating ~90 lines of duplication
  • Consolidate format functions into format_funcs.py
  • Extract build_word_blocks() to eliminate _to_word() duplication between Result classes
  • Add save_to_word() support for PaddleOCR-VL pipeline
  • Remove inspect.signature hack from mixin.py

Bug Fixes

  • Fix Word output containing LaTeX placeholder text (% [Image not found])
  • Fix LaTeX reference block returning a tuple instead of str
  • Fix missing use_layout_detection check in PaddleOCRVLPagesResult._to_markdown()
  • Fix block.content in-place mutation side effect

Architecture

Before: Format conversion logic was inlined in mixin.py (900+ lines), result_v2.py (700+ lines), and paddleocr_vl/result.py (600+ lines) with significant duplication.

After: Clean converter/ module with single-responsibility classes. mixin.py reduced to ~130 lines (save logic only).

Testing

  • 54 unit tests covering all Converter classes and format functions
  • PP-StructureV3 E2E: 10/10 scenarios PASS (including word/latex output verification)
  • PaddleOCR-VL E2E: functional verification complete

🤖 Generated with Claude Code

Known Issue (pre-existing in develop)

use_chart_recognition=True triggers a NameError: name 'PretrainedConfig' is not defined in paddlex/inference/models/common/transformers/transformers/conversion_utils.py:104. This is caused by a missing runtime import (only imported under TYPE_CHECKING) introduced in develop commit 3a94e07ef (PR #5058). This issue is unrelated to this refactoring PR and should be fixed separately in develop.

- Extract MarkdownConverter, WordConverter, LatexConverter from mixin.py
  into dedicated modules under common/result/converter/
- Extract format functions into format_funcs.py with clear naming
- Eliminate handle_funcs_dict build duplication via build_handle_funcs_dict()
- Remove inspect.signature hack; use explicit converter APIs
- Remove pass-through fields from MarkdownConverter.convert()
- Slim down result_v2.py and pp_doctranslation/result.py by delegating
  to converter layer

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Mar 25, 2026

Thanks for your contribution!

TingquanGao and others added 11 commits March 26, 2026 07:23
Replace _detect_columns / _segment_page / _sort_dual_column_blocks with
XY-Cut projection-based segmentation:

- Add _find_projection_gaps(): projects blocks onto X or Y axis using
  an occupation array and returns unoccupied gap intervals.
- Add _xy_cut_segment(): recursively splits a page into horizontal
  strips via Y-axis gaps, then detects columns within each strip via
  X-axis gaps. Supports single / dual / triple column layouts and merges
  adjacent same-type segments.
- Update convert_v2(): add original_image_height param, call
  _xy_cut_segment() instead of old two-step detect+segment; extend
  section writing to handle triple-column (3 cols + 2 column breaks).
- result.py _to_word(): include original_image_height in returned dict.
- mixin.py save_to_word(): pass original_image_height to convert_v2().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When use_chart_recognition=True, chart blocks have VLM text content but
no image. The previous code skipped them (Word) or crashed (LaTeX) by
assuming block.image is always set.

- build_word_blocks(): separate chart from image/seal; if chart has
  VLM text content, convert pipe-delimited text to tab-delimited and
  set label to "table" to reuse existing table rendering
- _to_latex() in result_v2.py: same logic, avoids TypeError on None

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tection

- Extend LAYOUT_EXCLUDE_LABELS to include aside_text, seal, number, and
  formula_number, preventing narrow marginal blocks from creating false
  column gaps during X-axis projection
- Re-insert seal and formula_number blocks into the correct segment after
  column detection (body content, just excluded from the detection pass)
- Add _classify_number_position() to remap 'number' labels to header /
  footer / aside_text based on bbox position (top 10%, bottom 10%, side 15%)
- Add _write_aside_text() to output aside_text as a framed paragraph with
  w:framePr, positioned in left or right margin based on x_center vs page_width/2
- Filter page-edge X-axis gaps (within 8% of page left/right boundary) before
  column-count detection; prevents page margins from being mistaken for column
  dividers, fixing the bug where single-column pages were detected as triple
  and dual-column pages collapsed to single

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
python-docx w:framePr support causes anchor symbols and title
displacement. aside_text blocks are now silently discarded in
convert_v2(); they remain in LAYOUT_EXCLUDE_LABELS and
HEADER_FOOTER_LABELS to avoid polluting column detection and
body segmentation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- build_word_blocks(): when include_bbox=True, also inject page_index
  from block attribute so convert_v2() can group blocks by page
- restructure_pages(): assign page_index to each block before merging,
  enabling per-page layout detection in multi-page documents
- _write_block(): calculate image width proportionally from bbox/page
  width ratio instead of fixed 5-inch fallback

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add pixel-to-EMU layout metrics, vertical spacing, horizontal indent,
proportional table width, and unequal column width support.

- _build_page_metrics(): compute A4 scale factors, content bbox,
  margins (clamped 0.3-2.0 inch), usable_width_emu per page
- _compute_vertical_spacing(): y-gap → space_before (EMU), 3pt quantized,
  capped at 1 inch; first block always 0
- _compute_horizontal_indent(): single-column left indent from bbox offset,
  skips centered blocks; >3% page width threshold
- _write_block(): new params space_before_emu, left_indent_emu,
  usable_width_emu; applied to text/image/table paragraphs
- _set_section_columns(): new params col_widths_twips / gap_widths_twips
  for unequal-width columns via individual w:col XML elements
- _xy_cut_segment(): store _x_gaps in multi-column segments
- convert_v2() main loop: apply page margins to section, compute
  col_widths_twips from _x_gaps, pass spacing/indent/usable_width
  to each _write_block() call

All changes are convert_v2()-only; convert() and old callers unaffected.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…vert_v2()

Three root causes of ~1.5" vertical overflow are addressed:

1. Page size mismatch: set section.page_width/page_height to A4 (7560820 /
   10693400 EMU) so Word page matches the A4-based _build_page_metrics()
   scale. Previously python-docx defaulted to US Letter (11.00") causing
   ~0.69" overflow.

2. Default 1.15x line spacing: override pPrDefault w:line to 240 (1.0x) and
   w:after to 0 immediately after Document() creation. This eliminates ~0.5"
   of unbudgeted vertical space from python-docx's template default.

3. Column break paragraph spacing: explicitly set space_before=0 /
   space_after=0 on the empty paragraph holding the column break, preventing
   ~0.3" of inherited default spacing per break.

Defense-in-depth: _set_paragraph_style() now also forces line_spacing=1.0.
Table spacer paragraphs get line_spacing=Pt(1) to minimize their footprint.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…vert_v2()

Add height estimation and proportional compression to prevent Word reflow from
causing single-page content to overflow onto a second page.

Changes:
- _build_page_metrics(): add usable_height_emu to returned metrics
- _estimate_block_height(): estimate rendered EMU height per block type
  (image via PIL aspect ratio, table with 1.3x safety, text with bbox inflation
  + LINE_HEIGHT_FACTOR 1.2, no-bbox fallback via char-count estimate)
- _estimate_page_content_height(): sum all segments (single/multi-column),
  taking max column height for multi-col segments + section break overhead
- convert_v2(): compute v_scale (SAFETY_MARGIN=0.95) before writing; apply
  v_scale to all space_before_emu; scale images via max_height_emu when
  overflow is severe (v_scale < 0.85)
- _write_block(): add max_height_emu param; constrain image height while
  preserving aspect ratio via PIL natural dimensions
- convert_v2(): minimize section break paragraph height (font=1pt,
  line_spacing=1pt) to reduce CONTINUOUS section break overhead

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add _IMAGE_LABELS constant (replaces 4 inline tuple literals)
- Add _get_image_size() helper (deduplicates PIL open/close in _write_block
  and _estimate_block_height)
- Add _col_widths_emu_from_gaps() helper for x_gaps → col_widths_emu conversion
- Fix _estimate_block_height() image ratio: use original_image_width (page px
  width) as denominator instead of back-converting column_width_emu to pixels,
  matching _write_block() semantics; add original_image_width param
- Pass original_image_width through _estimate_page_content_height()
- Remove unused x_gap_cols parameter and seg_idx loop variable
- Move import math to module level; move docx.shared.Inches import to function
  top instead of inside try block

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…h_px arg

- _estimate_page_content_height() now returns (height, spacings_cache) so the
  write loop in convert_v2() can reuse pre-computed _compute_vertical_spacing
  results instead of calling them a second time (2-6x duplicate work per page)
- Extract _minimize_section_break_para(para) from convert_v2() inline block;
  keeps the helper independently testable and reduces convert_v2() clutter
- Fix _col_widths_emu_from_gaps() called with original_image_width as
  page_width_px; now _estimate_page_content_height derives page_width_px
  from original_image_width itself, consistent with convert_v2() semantics
- convert_v2() write loop iterates with enumerate(segments) to index into
  spacings_cache; falls back to _compute_vertical_spacing on cache miss

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
word_converter.py:
- Add module-level _HEADER_FOOTER_LABELS constant (was inlined inside
  convert_v2() body on every call); convert_v2() now references it directly
- Remove deepcopy(content) in build_word_blocks(): content is always str
  (immutable), deep-copying it was a no-op cost
- _xy_cut_segment(): replace two-pass list comprehensions for full_span /
  narrow with a single loop (halves iterations per strip)
- _build_page_metrics(): replace four separate bbox coordinate list
  comprehensions with a single zip(*...) unpack

latex_converter.py:
- _generate_table_latex(): eliminate double td.get_text(strip=True) call
  per cell; cache text in local variable before passing to _escape_latex

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant