Refactor: extract DocumentConverter layer#5065
Open
TingquanGao wants to merge 12 commits intoPaddlePaddle:developfrom
Open
Refactor: extract DocumentConverter layer#5065TingquanGao wants to merge 12 commits intoPaddlePaddle:developfrom
TingquanGao wants to merge 12 commits intoPaddlePaddle:developfrom
Conversation
- Extract MarkdownConverter, WordConverter, LatexConverter from mixin.py into dedicated modules under common/result/converter/ - Extract format functions into format_funcs.py with clear naming - Eliminate handle_funcs_dict build duplication via build_handle_funcs_dict() - Remove inspect.signature hack; use explicit converter APIs - Remove pass-through fields from MarkdownConverter.convert() - Slim down result_v2.py and pp_doctranslation/result.py by delegating to converter layer Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Thanks for your contribution! |
Replace _detect_columns / _segment_page / _sort_dual_column_blocks with XY-Cut projection-based segmentation: - Add _find_projection_gaps(): projects blocks onto X or Y axis using an occupation array and returns unoccupied gap intervals. - Add _xy_cut_segment(): recursively splits a page into horizontal strips via Y-axis gaps, then detects columns within each strip via X-axis gaps. Supports single / dual / triple column layouts and merges adjacent same-type segments. - Update convert_v2(): add original_image_height param, call _xy_cut_segment() instead of old two-step detect+segment; extend section writing to handle triple-column (3 cols + 2 column breaks). - result.py _to_word(): include original_image_height in returned dict. - mixin.py save_to_word(): pass original_image_height to convert_v2(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When use_chart_recognition=True, chart blocks have VLM text content but no image. The previous code skipped them (Word) or crashed (LaTeX) by assuming block.image is always set. - build_word_blocks(): separate chart from image/seal; if chart has VLM text content, convert pipe-delimited text to tab-delimited and set label to "table" to reuse existing table rendering - _to_latex() in result_v2.py: same logic, avoids TypeError on None Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tection - Extend LAYOUT_EXCLUDE_LABELS to include aside_text, seal, number, and formula_number, preventing narrow marginal blocks from creating false column gaps during X-axis projection - Re-insert seal and formula_number blocks into the correct segment after column detection (body content, just excluded from the detection pass) - Add _classify_number_position() to remap 'number' labels to header / footer / aside_text based on bbox position (top 10%, bottom 10%, side 15%) - Add _write_aside_text() to output aside_text as a framed paragraph with w:framePr, positioned in left or right margin based on x_center vs page_width/2 - Filter page-edge X-axis gaps (within 8% of page left/right boundary) before column-count detection; prevents page margins from being mistaken for column dividers, fixing the bug where single-column pages were detected as triple and dual-column pages collapsed to single Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
python-docx w:framePr support causes anchor symbols and title displacement. aside_text blocks are now silently discarded in convert_v2(); they remain in LAYOUT_EXCLUDE_LABELS and HEADER_FOOTER_LABELS to avoid polluting column detection and body segmentation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- build_word_blocks(): when include_bbox=True, also inject page_index from block attribute so convert_v2() can group blocks by page - restructure_pages(): assign page_index to each block before merging, enabling per-page layout detection in multi-page documents - _write_block(): calculate image width proportionally from bbox/page width ratio instead of fixed 5-inch fallback Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add pixel-to-EMU layout metrics, vertical spacing, horizontal indent, proportional table width, and unequal column width support. - _build_page_metrics(): compute A4 scale factors, content bbox, margins (clamped 0.3-2.0 inch), usable_width_emu per page - _compute_vertical_spacing(): y-gap → space_before (EMU), 3pt quantized, capped at 1 inch; first block always 0 - _compute_horizontal_indent(): single-column left indent from bbox offset, skips centered blocks; >3% page width threshold - _write_block(): new params space_before_emu, left_indent_emu, usable_width_emu; applied to text/image/table paragraphs - _set_section_columns(): new params col_widths_twips / gap_widths_twips for unequal-width columns via individual w:col XML elements - _xy_cut_segment(): store _x_gaps in multi-column segments - convert_v2() main loop: apply page margins to section, compute col_widths_twips from _x_gaps, pass spacing/indent/usable_width to each _write_block() call All changes are convert_v2()-only; convert() and old callers unaffected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…vert_v2() Three root causes of ~1.5" vertical overflow are addressed: 1. Page size mismatch: set section.page_width/page_height to A4 (7560820 / 10693400 EMU) so Word page matches the A4-based _build_page_metrics() scale. Previously python-docx defaulted to US Letter (11.00") causing ~0.69" overflow. 2. Default 1.15x line spacing: override pPrDefault w:line to 240 (1.0x) and w:after to 0 immediately after Document() creation. This eliminates ~0.5" of unbudgeted vertical space from python-docx's template default. 3. Column break paragraph spacing: explicitly set space_before=0 / space_after=0 on the empty paragraph holding the column break, preventing ~0.3" of inherited default spacing per break. Defense-in-depth: _set_paragraph_style() now also forces line_spacing=1.0. Table spacer paragraphs get line_spacing=Pt(1) to minimize their footprint. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…vert_v2() Add height estimation and proportional compression to prevent Word reflow from causing single-page content to overflow onto a second page. Changes: - _build_page_metrics(): add usable_height_emu to returned metrics - _estimate_block_height(): estimate rendered EMU height per block type (image via PIL aspect ratio, table with 1.3x safety, text with bbox inflation + LINE_HEIGHT_FACTOR 1.2, no-bbox fallback via char-count estimate) - _estimate_page_content_height(): sum all segments (single/multi-column), taking max column height for multi-col segments + section break overhead - convert_v2(): compute v_scale (SAFETY_MARGIN=0.95) before writing; apply v_scale to all space_before_emu; scale images via max_height_emu when overflow is severe (v_scale < 0.85) - _write_block(): add max_height_emu param; constrain image height while preserving aspect ratio via PIL natural dimensions - convert_v2(): minimize section break paragraph height (font=1pt, line_spacing=1pt) to reduce CONTINUOUS section break overhead Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add _IMAGE_LABELS constant (replaces 4 inline tuple literals) - Add _get_image_size() helper (deduplicates PIL open/close in _write_block and _estimate_block_height) - Add _col_widths_emu_from_gaps() helper for x_gaps → col_widths_emu conversion - Fix _estimate_block_height() image ratio: use original_image_width (page px width) as denominator instead of back-converting column_width_emu to pixels, matching _write_block() semantics; add original_image_width param - Pass original_image_width through _estimate_page_content_height() - Remove unused x_gap_cols parameter and seg_idx loop variable - Move import math to module level; move docx.shared.Inches import to function top instead of inside try block Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…h_px arg - _estimate_page_content_height() now returns (height, spacings_cache) so the write loop in convert_v2() can reuse pre-computed _compute_vertical_spacing results instead of calling them a second time (2-6x duplicate work per page) - Extract _minimize_section_break_para(para) from convert_v2() inline block; keeps the helper independently testable and reduces convert_v2() clutter - Fix _col_widths_emu_from_gaps() called with original_image_width as page_width_px; now _estimate_page_content_height derives page_width_px from original_image_width itself, consistent with convert_v2() semantics - convert_v2() write loop iterates with enumerate(segments) to index into spacings_cache; falls back to _compute_vertical_spacing on cache miss Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
word_converter.py: - Add module-level _HEADER_FOOTER_LABELS constant (was inlined inside convert_v2() body on every call); convert_v2() now references it directly - Remove deepcopy(content) in build_word_blocks(): content is always str (immutable), deep-copying it was a no-op cost - _xy_cut_segment(): replace two-pass list comprehensions for full_span / narrow with a single loop (halves iterations per strip) - _build_page_metrics(): replace four separate bbox coordinate list comprehensions with a single zip(*...) unpack latex_converter.py: - _generate_table_latex(): eliminate double td.get_text(strip=True) call per cell; cache text in local variable before passing to _escape_latex Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Refactor the document format conversion logic for PP-StructureV3 and PaddleOCR-VL pipelines by introducing a unified
converter/module.Changes
converter/module withMarkdownConverter,WordConverter,LatexConverterbuild_handle_funcs_dict()across both pipelines, eliminating ~90 lines of duplicationformat_funcs.pybuild_word_blocks()to eliminate_to_word()duplication between Result classessave_to_word()support for PaddleOCR-VL pipelineinspect.signaturehack frommixin.pyBug Fixes
% [Image not found])referenceblock returning a tuple instead ofstruse_layout_detectioncheck inPaddleOCRVLPagesResult._to_markdown()block.contentin-place mutation side effectArchitecture
Before: Format conversion logic was inlined in
mixin.py(900+ lines),result_v2.py(700+ lines), andpaddleocr_vl/result.py(600+ lines) with significant duplication.After: Clean
converter/module with single-responsibility classes.mixin.pyreduced to ~130 lines (save logic only).Testing
🤖 Generated with Claude Code
Known Issue (pre-existing in develop)
use_chart_recognition=Truetriggers aNameError: name 'PretrainedConfig' is not definedinpaddlex/inference/models/common/transformers/transformers/conversion_utils.py:104. This is caused by a missing runtime import (only imported underTYPE_CHECKING) introduced in develop commit3a94e07ef(PR #5058). This issue is unrelated to this refactoring PR and should be fixed separately in develop.