Refactor: extract DocumentConverter layer by TingquanGao · Pull Request #5065 · PaddlePaddle/PaddleX

TingquanGao · 2026-03-25T15:02:09Z

Summary

Refactor the document format conversion logic for PP-StructureV3 and PaddleOCR-VL pipelines by introducing a unified converter/ module.

Changes

Extract converter/ module with MarkdownConverter, WordConverter, LatexConverter
Unify build_handle_funcs_dict() across both pipelines, eliminating ~90 lines of duplication
Consolidate format functions into format_funcs.py
Extract build_word_blocks() to eliminate _to_word() duplication between Result classes
Add save_to_word() support for PaddleOCR-VL pipeline
Remove inspect.signature hack from mixin.py

Bug Fixes

Fix Word output containing LaTeX placeholder text (% [Image not found])
Fix LaTeX reference block returning a tuple instead of str
Fix missing use_layout_detection check in PaddleOCRVLPagesResult._to_markdown()
Fix block.content in-place mutation side effect

Architecture

Before: Format conversion logic was inlined in mixin.py (900+ lines), result_v2.py (700+ lines), and paddleocr_vl/result.py (600+ lines) with significant duplication.

After: Clean converter/ module with single-responsibility classes. mixin.py reduced to ~130 lines (save logic only).

Testing

54 unit tests covering all Converter classes and format functions
PP-StructureV3 E2E: 10/10 scenarios PASS (including word/latex output verification)
PaddleOCR-VL E2E: functional verification complete

🤖 Generated with Claude Code

Known Issue (pre-existing in develop)

use_chart_recognition=True triggers a NameError: name 'PretrainedConfig' is not defined in paddlex/inference/models/common/transformers/transformers/conversion_utils.py:104. This is caused by a missing runtime import (only imported under TYPE_CHECKING) introduced in develop commit 3a94e07ef (PR #5058). This issue is unrelated to this refactoring PR and should be fixed separately in develop.

- Extract MarkdownConverter, WordConverter, LatexConverter from mixin.py into dedicated modules under common/result/converter/ - Extract format functions into format_funcs.py with clear naming - Eliminate handle_funcs_dict build duplication via build_handle_funcs_dict() - Remove inspect.signature hack; use explicit converter APIs - Remove pass-through fields from MarkdownConverter.convert() - Slim down result_v2.py and pp_doctranslation/result.py by delegating to converter layer Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

paddle-bot · 2026-03-25T15:02:16Z

Thanks for your contribution!

Replace _detect_columns / _segment_page / _sort_dual_column_blocks with XY-Cut projection-based segmentation: - Add _find_projection_gaps(): projects blocks onto X or Y axis using an occupation array and returns unoccupied gap intervals. - Add _xy_cut_segment(): recursively splits a page into horizontal strips via Y-axis gaps, then detects columns within each strip via X-axis gaps. Supports single / dual / triple column layouts and merges adjacent same-type segments. - Update convert_v2(): add original_image_height param, call _xy_cut_segment() instead of old two-step detect+segment; extend section writing to handle triple-column (3 cols + 2 column breaks). - result.py _to_word(): include original_image_height in returned dict. - mixin.py save_to_word(): pass original_image_height to convert_v2(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When use_chart_recognition=True, chart blocks have VLM text content but no image. The previous code skipped them (Word) or crashed (LaTeX) by assuming block.image is always set. - build_word_blocks(): separate chart from image/seal; if chart has VLM text content, convert pipe-delimited text to tab-delimited and set label to "table" to reuse existing table rendering - _to_latex() in result_v2.py: same logic, avoids TypeError on None Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tection - Extend LAYOUT_EXCLUDE_LABELS to include aside_text, seal, number, and formula_number, preventing narrow marginal blocks from creating false column gaps during X-axis projection - Re-insert seal and formula_number blocks into the correct segment after column detection (body content, just excluded from the detection pass) - Add _classify_number_position() to remap 'number' labels to header / footer / aside_text based on bbox position (top 10%, bottom 10%, side 15%) - Add _write_aside_text() to output aside_text as a framed paragraph with w:framePr, positioned in left or right margin based on x_center vs page_width/2 - Filter page-edge X-axis gaps (within 8% of page left/right boundary) before column-count detection; prevents page margins from being mistaken for column dividers, fixing the bug where single-column pages were detected as triple and dual-column pages collapsed to single Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

python-docx w:framePr support causes anchor symbols and title displacement. aside_text blocks are now silently discarded in convert_v2(); they remain in LAYOUT_EXCLUDE_LABELS and HEADER_FOOTER_LABELS to avoid polluting column detection and body segmentation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- build_word_blocks(): when include_bbox=True, also inject page_index from block attribute so convert_v2() can group blocks by page - restructure_pages(): assign page_index to each block before merging, enabling per-page layout detection in multi-page documents - _write_block(): calculate image width proportionally from bbox/page width ratio instead of fixed 5-inch fallback Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add pixel-to-EMU layout metrics, vertical spacing, horizontal indent, proportional table width, and unequal column width support. - _build_page_metrics(): compute A4 scale factors, content bbox, margins (clamped 0.3-2.0 inch), usable_width_emu per page - _compute_vertical_spacing(): y-gap → space_before (EMU), 3pt quantized, capped at 1 inch; first block always 0 - _compute_horizontal_indent(): single-column left indent from bbox offset, skips centered blocks; >3% page width threshold - _write_block(): new params space_before_emu, left_indent_emu, usable_width_emu; applied to text/image/table paragraphs - _set_section_columns(): new params col_widths_twips / gap_widths_twips for unequal-width columns via individual w:col XML elements - _xy_cut_segment(): store _x_gaps in multi-column segments - convert_v2() main loop: apply page margins to section, compute col_widths_twips from _x_gaps, pass spacing/indent/usable_width to each _write_block() call All changes are convert_v2()-only; convert() and old callers unaffected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…vert_v2() Three root causes of ~1.5" vertical overflow are addressed: 1. Page size mismatch: set section.page_width/page_height to A4 (7560820 / 10693400 EMU) so Word page matches the A4-based _build_page_metrics() scale. Previously python-docx defaulted to US Letter (11.00") causing ~0.69" overflow. 2. Default 1.15x line spacing: override pPrDefault w:line to 240 (1.0x) and w:after to 0 immediately after Document() creation. This eliminates ~0.5" of unbudgeted vertical space from python-docx's template default. 3. Column break paragraph spacing: explicitly set space_before=0 / space_after=0 on the empty paragraph holding the column break, preventing ~0.3" of inherited default spacing per break. Defense-in-depth: _set_paragraph_style() now also forces line_spacing=1.0. Table spacer paragraphs get line_spacing=Pt(1) to minimize their footprint. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…vert_v2() Add height estimation and proportional compression to prevent Word reflow from causing single-page content to overflow onto a second page. Changes: - _build_page_metrics(): add usable_height_emu to returned metrics - _estimate_block_height(): estimate rendered EMU height per block type (image via PIL aspect ratio, table with 1.3x safety, text with bbox inflation + LINE_HEIGHT_FACTOR 1.2, no-bbox fallback via char-count estimate) - _estimate_page_content_height(): sum all segments (single/multi-column), taking max column height for multi-col segments + section break overhead - convert_v2(): compute v_scale (SAFETY_MARGIN=0.95) before writing; apply v_scale to all space_before_emu; scale images via max_height_emu when overflow is severe (v_scale < 0.85) - _write_block(): add max_height_emu param; constrain image height while preserving aspect ratio via PIL natural dimensions - convert_v2(): minimize section break paragraph height (font=1pt, line_spacing=1pt) to reduce CONTINUOUS section break overhead Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add _IMAGE_LABELS constant (replaces 4 inline tuple literals) - Add _get_image_size() helper (deduplicates PIL open/close in _write_block and _estimate_block_height) - Add _col_widths_emu_from_gaps() helper for x_gaps → col_widths_emu conversion - Fix _estimate_block_height() image ratio: use original_image_width (page px width) as denominator instead of back-converting column_width_emu to pixels, matching _write_block() semantics; add original_image_width param - Pass original_image_width through _estimate_page_content_height() - Remove unused x_gap_cols parameter and seg_idx loop variable - Move import math to module level; move docx.shared.Inches import to function top instead of inside try block Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…h_px arg - _estimate_page_content_height() now returns (height, spacings_cache) so the write loop in convert_v2() can reuse pre-computed _compute_vertical_spacing results instead of calling them a second time (2-6x duplicate work per page) - Extract _minimize_section_break_para(para) from convert_v2() inline block; keeps the helper independently testable and reduces convert_v2() clutter - Fix _col_widths_emu_from_gaps() called with original_image_width as page_width_px; now _estimate_page_content_height derives page_width_px from original_image_width itself, consistent with convert_v2() semantics - convert_v2() write loop iterates with enumerate(segments) to index into spacings_cache; falls back to _compute_vertical_spacing on cache miss Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

word_converter.py: - Add module-level _HEADER_FOOTER_LABELS constant (was inlined inside convert_v2() body on every call); convert_v2() now references it directly - Remove deepcopy(content) in build_word_blocks(): content is always str (immutable), deep-copying it was a no-op cost - _xy_cut_segment(): replace two-pass list comprehensions for full_span / narrow with a single loop (halves iterations per strip) - _build_page_metrics(): replace four separate bbox coordinate list comprehensions with a single zip(*...) unpack latex_converter.py: - _generate_table_latex(): eliminate double td.get_text(strip=True) call per cell; cache text in local variable before passing to _escape_latex Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

TingquanGao and others added 11 commits March 26, 2026 07:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: extract DocumentConverter layer#5065

Refactor: extract DocumentConverter layer#5065
TingquanGao wants to merge 12 commits intoPaddlePaddle:developfrom
TingquanGao:clean-refactor

TingquanGao commented Mar 25, 2026 •

edited

Loading

Uh oh!

paddle-bot bot commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TingquanGao commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Bug Fixes

Architecture

Testing

Known Issue (pre-existing in develop)

Uh oh!

paddle-bot bot commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TingquanGao commented Mar 25, 2026 •

edited

Loading