Skip to content

[Feat] Add PaddleOCRLoader for local OCR LangChain integration#17813

Open
Ihebdhouibi wants to merge 4 commits intoPaddlePaddle:mainfrom
Ihebdhouibi:feat/paddleocr-loader
Open

[Feat] Add PaddleOCRLoader for local OCR LangChain integration#17813
Ihebdhouibi wants to merge 4 commits intoPaddlePaddle:mainfrom
Ihebdhouibi:feat/paddleocr-loader

Conversation

@Ihebdhouibi
Copy link
Copy Markdown
Contributor

Add PaddleOCRLoader that wraps the local PaddleOCR library (PP-OCRv5 and PP-StructureV3) to produce LangChain Document objects.

New files:

  • langchain_paddleocr/document_loaders/paddleocr.py: PaddleOCRLoader, PaddleOCRConfig dataclass, custom exception hierarchy

  • tests/unit_tests/document_loaders/test_paddleocr_loader.py: 29 unit tests

  • tests/integration_tests/document_loaders/test_paddleocr_loader.py: Integration tests

Modified files:

  • langchain_paddleocr/init.py: Add PaddleOCRLoader export (lazy import for PaddleOCRVLLoader)

  • langchain_paddleocr/document_loaders/init.py: Same

  • README.md / README_cn.md: Add PaddleOCRLoader usage docs

@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Mar 15, 2026

Thanks for your contribution!

Copy link
Copy Markdown
Member

@Bobholamovic Bobholamovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your contribution! I'd like to clarify a point to avoid any potential confusion: within the project, PaddleOCR typically refers to the general OCR capability, whereas the PP-Structure series is intended for more complex document parsing. Since they differ quite a bit in terms of scope and design goals, it might be better to keep them separated. From an architectural perspective, I would gently suggest not combining them into a single class, so that the design remains clearer and easier to maintain.

@Ihebdhouibi
Copy link
Copy Markdown
Contributor Author

Thank you very much for your contribution! I'd like to clarify a point to avoid any potential confusion: within the project, PaddleOCR typically refers to the general OCR capability, whereas the PP-Structure series is intended for more complex document parsing. Since they differ quite a bit in terms of scope and design goals, it might be better to keep them separated. From an architectural perspective, I would gently suggest not combining them into a single class, so that the design remains clearer and easier to maintain.

Absolutely spot on, I'll fix that and update the PR

Add PaddleOCRLoader that wraps the local PaddleOCR library (PP-OCRv5 and PP-StructureV3) to produce LangChain Document objects without requiring any cloud API.

New files:

- langchain_paddleocr/document_loaders/paddleocr.py: PaddleOCRLoader, PaddleOCRConfig dataclass, custom exception hierarchy

- tests/unit_tests/document_loaders/test_paddleocr_loader.py: 29 unit tests

- tests/integration_tests/document_loaders/test_paddleocr_loader.py: Integration tests

Modified files:

- langchain_paddleocr/__init__.py: Add PaddleOCRLoader export (lazy import for PaddleOCRVLLoader)

- langchain_paddleocr/document_loaders/__init__.py: Same

- README.md / README_cn.md: Add PaddleOCRLoader usage docs
@Ihebdhouibi Ihebdhouibi force-pushed the feat/paddleocr-loader branch from d4cb412 to 433f158 Compare March 21, 2026 11:11
@Ihebdhouibi
Copy link
Copy Markdown
Contributor Author

@Bobholamovic Changes done as required, pending review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants