Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions langchain-paddleocr/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,63 @@ for doc in docs[:2]:
print("---")
```

### `PaddleOCRLoader`

The `PaddleOCRLoader` wraps the **local** PaddleOCR library to extract text from PDF and image files — no cloud API or access token required.

It supports two modes:

- **Basic OCR** (default) — fast text extraction using PP-OCRv5.
- **Structure mode** — layout-aware extraction (tables, titles, figures) using PP-StructureV3.

#### Basic OCR

```python
from langchain_paddleocr import PaddleOCRLoader

loader = PaddleOCRLoader(file_path="path/to/document.pdf")
docs = loader.load()

for doc in docs:
print(f"Page {doc.metadata['page']}: {doc.page_content[:100]}...")
print(f"Confidence: {doc.metadata['confidence']:.2f}")
```

#### Structure mode

```python
from langchain_paddleocr import PaddleOCRLoader
from langchain_paddleocr.document_loaders.paddleocr import PaddleOCRConfig

config = PaddleOCRConfig(lang="en", use_table_recognition=True)
loader = PaddleOCRLoader(
file_path=["page1.png", "page2.png"],
use_structure=True,
config=config,
)

for doc in loader.lazy_load():
print(doc.page_content)
print(doc.metadata["layout_blocks"])
```

#### Configuration

Use `PaddleOCRConfig` to pass engine parameters:

| Parameter | Type | Description |
|-----------|------|-------------|
| `lang` | `str` | Language code (`"ch"`, `"en"`, `"fr"`, etc.) |
| `ocr_version` | `str` | Pipeline version (`"PP-OCRv3"`, `"PP-OCRv4"`, `"PP-OCRv5"`) |
| `use_doc_orientation_classify` | `bool` | Enable document orientation classification |
| `use_doc_unwarping` | `bool` | Enable document de-warping |
| `text_det_thresh` | `float` | Detection confidence threshold |
| `text_rec_score_thresh` | `float` | Recognition confidence threshold |
| `use_table_recognition` | `bool` | Enable table recognition (structure mode) |
| `use_chart_recognition` | `bool` | Enable chart recognition (structure mode) |

See the full list in `PaddleOCRConfig`.

## 📖 Documentation

For full documentation, see the [LangChain Docs](https://docs.langchain.com/oss/python/integrations/providers/baidu).
57 changes: 57 additions & 0 deletions langchain-paddleocr/README_cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,63 @@ for doc in docs[:2]:
```


### `PaddleOCRLoader`

`PaddleOCRLoader` 封装了 **本地** PaddleOCR 库,从 PDF 和图像文件中提取文本 — 无需云 API 或访问令牌。

支持两种模式:

- **基础 OCR**(默认)— 使用 PP-OCRv5 进行快速文本提取。
- **版面分析模式** — 使用 PP-StructureV3 进行版面感知提取(表格、标题、图片等)。

#### 基础 OCR

```python
from langchain_paddleocr import PaddleOCRLoader

loader = PaddleOCRLoader(file_path="path/to/document.pdf")
docs = loader.load()

for doc in docs:
print(f"页面 {doc.metadata['page']}: {doc.page_content[:100]}...")
print(f"置信度: {doc.metadata['confidence']:.2f}")
```

#### 版面分析模式

```python
from langchain_paddleocr import PaddleOCRLoader
from langchain_paddleocr.document_loaders.paddleocr import PaddleOCRConfig

config = PaddleOCRConfig(lang="ch", use_table_recognition=True)
loader = PaddleOCRLoader(
file_path=["page1.png", "page2.png"],
use_structure=True,
config=config,
)

for doc in loader.lazy_load():
print(doc.page_content)
print(doc.metadata["layout_blocks"])
```

#### 配置

使用 `PaddleOCRConfig` 传递引擎参数:

| 参数 | 类型 | 说明 |
|------|------|------|
| `lang` | `str` | 语言代码(`"ch"`、`"en"`、`"fr"` 等) |
| `ocr_version` | `str` | 流水线版本(`"PP-OCRv3"`、`"PP-OCRv4"`、`"PP-OCRv5"`) |
| `use_doc_orientation_classify` | `bool` | 启用文档方向分类 |
| `use_doc_unwarping` | `bool` | 启用文档去弯曲 |
| `text_det_thresh` | `float` | 检测置信度阈值 |
| `text_rec_score_thresh` | `float` | 识别置信度阈值 |
| `use_table_recognition` | `bool` | 启用表格识别(版面分析模式) |
| `use_chart_recognition` | `bool` | 启用图表识别(版面分析模式) |

完整参数请参阅 `PaddleOCRConfig`。

## 📖 文档

完整文档请参阅 [LangChain 文档](https://docs.langchain.com/oss/python/integrations/providers/baidu)。
13 changes: 11 additions & 2 deletions langchain-paddleocr/langchain_paddleocr/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,12 @@
from .document_loaders import PaddleOCRVLLoader
from .document_loaders import PaddleOCRLoader

__all__ = ["PaddleOCRVLLoader"]
__all__ = ["PaddleOCRLoader", "PaddleOCRVLLoader"]


def __getattr__(name: str) -> object:
if name == "PaddleOCRVLLoader":
from .document_loaders import PaddleOCRVLLoader

return PaddleOCRVLLoader
msg = f"module {__name__!r} has no attribute {name!r}"
raise AttributeError(msg)
Original file line number Diff line number Diff line change
@@ -1,3 +1,12 @@
from .paddleocr_vl import PaddleOCRVLLoader
from .paddleocr import PaddleOCRLoader

__all__ = ["PaddleOCRVLLoader"]
__all__ = ["PaddleOCRLoader", "PaddleOCRVLLoader"]


def __getattr__(name: str) -> object:
if name == "PaddleOCRVLLoader":
from .paddleocr_vl import PaddleOCRVLLoader

return PaddleOCRVLLoader
msg = f"module {__name__!r} has no attribute {name!r}"
raise AttributeError(msg)
Loading
Loading