[Feat] Add PaddleOCRLoader for local OCR LangChain integration#17813
[Feat] Add PaddleOCRLoader for local OCR LangChain integration#17813Ihebdhouibi wants to merge 4 commits intoPaddlePaddle:mainfrom
Conversation
|
Thanks for your contribution! |
Bobholamovic
left a comment
There was a problem hiding this comment.
Thank you very much for your contribution! I'd like to clarify a point to avoid any potential confusion: within the project, PaddleOCR typically refers to the general OCR capability, whereas the PP-Structure series is intended for more complex document parsing. Since they differ quite a bit in terms of scope and design goals, it might be better to keep them separated. From an architectural perspective, I would gently suggest not combining them into a single class, so that the design remains clearer and easier to maintain.
Absolutely spot on, I'll fix that and update the PR |
Add PaddleOCRLoader that wraps the local PaddleOCR library (PP-OCRv5 and PP-StructureV3) to produce LangChain Document objects without requiring any cloud API. New files: - langchain_paddleocr/document_loaders/paddleocr.py: PaddleOCRLoader, PaddleOCRConfig dataclass, custom exception hierarchy - tests/unit_tests/document_loaders/test_paddleocr_loader.py: 29 unit tests - tests/integration_tests/document_loaders/test_paddleocr_loader.py: Integration tests Modified files: - langchain_paddleocr/__init__.py: Add PaddleOCRLoader export (lazy import for PaddleOCRVLLoader) - langchain_paddleocr/document_loaders/__init__.py: Same - README.md / README_cn.md: Add PaddleOCRLoader usage docs
d4cb412 to
433f158
Compare
|
@Bobholamovic Changes done as required, pending review |
Add PaddleOCRLoader that wraps the local PaddleOCR library (PP-OCRv5 and PP-StructureV3) to produce LangChain Document objects.
New files:
langchain_paddleocr/document_loaders/paddleocr.py: PaddleOCRLoader, PaddleOCRConfig dataclass, custom exception hierarchy
tests/unit_tests/document_loaders/test_paddleocr_loader.py: 29 unit tests
tests/integration_tests/document_loaders/test_paddleocr_loader.py: Integration tests
Modified files:
langchain_paddleocr/init.py: Add PaddleOCRLoader export (lazy import for PaddleOCRVLLoader)
langchain_paddleocr/document_loaders/init.py: Same
README.md / README_cn.md: Add PaddleOCRLoader usage docs