A pure-Python library for extracting text and structured content from files commonly found in SharePoint ecosystems:
- Microsoft Office (modern and legacy)
- OpenDocument
- Email formats
- Plain text and config formats
- HTML/EPUB/MHTML
- Archives containing supported files
It also includes an optional SharePoint Graph client (sharepoint_io) for listing/downloading files before extraction.
- Why Use This Library
- Install
- Quick Start
- Core Interface
- CLI
- Optional SharePoint Integration
- Supported Formats
- Archive Processing and Security
- Limitations and Caveats
- API Cheat Sheet
- Exceptions
- License
- Disclaimer
- More Usage Examples
- Pure Python (no Java runtime, no LibreOffice subprocesses)
- Unified extraction interface across many file types
- Works with file paths and in-memory bytes
- Suitable for RAG/indexing pipelines where chunking and metadata matter
- Handles both modern and legacy Office formats in one API
uv add sharepoint-to-textOptional PDF crypto acceleration:
uv add "sharepoint-to-text[pdf-crypto]"From source:
git clone https://github.com/Horsmann/sharepoint-to-text.git
cd sharepoint-to-text
uv sync --all-groupsimport sharepoint2text
result = next(sharepoint2text.read_file("document.docx"))
print(result.get_full_text())read_file(...) returns a generator. Most files produce one result, but archives and .mbox can produce multiple.
import sharepoint2text
payload = b"hello from memory"
result = next(sharepoint2text.read_bytes(payload, extension="txt"))
print(result.get_full_text())import sharepoint2text
result = next(sharepoint2text.read_file("report.pdf"))
# Single text blob
full_text = result.get_full_text()
# Structured chunks (page/slide/sheet depending on format)
for unit in result.iterate_units():
print(unit.get_text())
print(unit.get_metadata())import json
import sharepoint2text
result = next(sharepoint2text.read_file("document.docx"))
print(json.dumps(result.to_json()))Restore from JSON:
from sharepoint2text.parsing.extractors.data_types import ExtractionInterface
restored = ExtractionInterface.from_json(result.to_json())All extracted results implement a common interface (ExtractionInterface):
get_full_text()iterate_units()iterate_images()iterate_tables()get_metadata()to_json()/from_json(...)
Use this interface when you want one pipeline that works across formats.
| Goal | Method |
|---|---|
| One string per document | get_full_text() |
| Chunk by structure (RAG/citations) | iterate_units() |
| All images in a file | iterate_images() |
| All tables in a file | iterate_tables() |
| Format family | Units yielded |
|---|---|
Word / text docs (.docx, .doc, .odt, plain text, config files) |
Usually one unit |
Spreadsheets (.xlsx, .xls, .ods) |
One unit per sheet |
Presentations (.pptx, .ppt, .odp) |
One unit per slide |
| One unit per page | |
Email (.eml, .msg) |
One unit per email |
Mailbox (.mbox) |
Multiple extraction results (one per email) |
Notes:
- Word formats do not store reliable page boundaries, so units are document-level.
iterate_units(ignore_images=True)skips image payloads in unit objects for better performance.
After installation, sharepoint2text is available.
Plain text output:
sharepoint2text --file /path/to/file.docx > extraction.txtJSON output:
sharepoint2text --file /path/to/file.docx --json > extraction.json| Option | Description |
|---|---|
--file FILE, -f FILE |
Required input file |
--output FILE, -o FILE |
Write output to file (default: stdout) |
--json, -j |
Emit list[extraction_object] |
--json-unit, -u |
Emit list[unit_object] |
--include-images, -i |
Include binary image payloads as base64 in JSON output |
--no-attachments, -n |
Exclude email attachments from CLI extraction output |
--max-file-size-mb, -m |
Maximum input size in MiB (default: 100, use 0 to disable) |
--version, -v |
Print CLI version |
Rules:
--jsonand--json-unitare mutually exclusive.--include-imagesrequires--jsonor--json-unit.- CLI enforces a configurable input file limit (default
100 MiB; override with--max-file-size-mb/-m).
sharepoint_io is optional. It helps list/download files from SharePoint, while extraction still runs through sharepoint2text.
import io
import sharepoint2text
from sharepoint2text.sharepoint_io import (
EntraIDAppCredentials,
SharePointRestClient,
)
credentials = EntraIDAppCredentials(
tenant_id="your-tenant-id",
client_id="your-client-id",
client_secret="your-client-secret",
)
client = SharePointRestClient(
site_url="https://contoso.sharepoint.com/sites/Documents",
credentials=credentials,
)
for file_meta in client.list_all_files():
data = client.download_file(file_meta.id)
extractor = sharepoint2text.get_extractor(file_meta.name)
for result in extractor(io.BytesIO(data), path=file_meta.name):
print(result.get_full_text()[:200])Setup details: sharepoint2text/sharepoint_io/SETUP.md
- Modern:
.docx,.docm,.xlsx,.xlsm,.xlsb,.pptx,.pptm - Legacy:
.doc,.dot,.xls,.xlt,.ppt,.pot,.pps,.rtf - Template/show aliases are auto-mapped (for example
.dotx->.docx,.ppsx->.pptx)
.odt,.ods,.odp,.odg,.odf- Template aliases supported (
.ott,.ots,.otp)
.eml,.msg,.mbox- Email extraction includes sender/recipient metadata, subject, and body (
body_plain/body_html). .emland.msgparse attachments and store them onEmailContent.attachments..mboxextraction currently focuses on message headers/body and does not parse/store attachments.- Parsed supported attachments can be extracted via
EmailContent.iterate_supported_attachments(). - If supported-attachment extraction fails, the default behavior is to raise; use
skip_failed=Trueto continue.
.txt,.md,.csv,.tsv,.json.yaml,.yml,.xml,.log,.ini,.cfg,.conf,.properties
.html,.htm,.mhtml,.mht,.epub
.pdf
.zip,.tar,.7z- Compressed tar aliases:
.tar.gz/.tgz,.tar.bz2/.tbz2,.tar.xz/.txz .gz,.bz2,.xzare routed as compressed tar variants
Archives are processed one level deep. Supported non-archive files inside the archive can yield extraction results. Nested archives are intentionally skipped as a safety guard.
Built-in safeguards include zip-bomb protections and file size limits. For 7z, extraction is limited to 100MB archives. Archive entries may also be skipped when they exceed internal per-entry size limits or fail extraction.
- No OCR. Scanned-image PDFs may return empty text.
- Structured table extraction is not implemented for PDF (
iterate_tables()is empty). - Password-protected PDFs (non-empty password) raise
ExtractionFileEncryptedError. - Some JBIG2 images need
jbig2decinstalled for image decoding.
- Inputs are expected to be already decrypted. If a file has encryption, DRM, password protection, or similar security controls, remove/unlock those before calling
sharepoint2text. - Very large or highly compressed files may hit protection limits.
- Raise limits only for trusted inputs.
import sharepoint2text
sharepoint2text.read_file(
path,
max_file_size=100 * 1024 * 1024,
ignore_images=False,
force_plain_text=False,
)
sharepoint2text.read_bytes(
data,
extension="pdf", # or ".pdf"
mime_type=None, # e.g. "application/pdf"
max_file_size=100 * 1024 * 1024,
ignore_images=False,
force_plain_text=False,
)
sharepoint2text.is_supported_file(path)
sharepoint2text.get_extractor(path)- Office/OpenDocument:
read_docx,read_doc,read_xlsx,read_xls,read_pptx,read_ppt,read_rtf,read_odt,read_ods,read_odp,read_odg,read_odf - Other documents:
read_pdf,read_html,read_epub,read_mhtml,read_plain_text - Email:
read_eml_email,read_msg_email,read_mbox_email
All extractor functions accept a binary stream plus optional path and return generators.
Email helper API:
EmailContent.iterate_supported_attachments(skip_failed=False)extracts supported parsed attachments on demand (primarily from.eml/.msg).
Common exceptions:
ExtractionFileFormatNotSupportedErrorExtractionFileEncryptedErrorExtractionFileTooLargeErrorExtractionLegacyMicrosoftParsingErrorExtractionZipBombErrorExtractionFailedError
Apache 2.0. See LICENSE.
This project is not affiliated with, endorsed by, or sponsored by Microsoft.
import sharepoint2text
email = next(sharepoint2text.read_file("message-with-attachments.eml"))
print(email.subject)
print(email.get_full_text()) # plain body if available, otherwise HTML body
print(f"Attachment count: {len(email.attachments)}")
# Extract supported attachment types (pdf, docx, pptx, etc.)
for attachment_result in email.iterate_supported_attachments():
print(type(attachment_result).__name__)
print(attachment_result.get_full_text()[:200])import sharepoint2text
email = next(sharepoint2text.read_file("message-with-attachments.msg"))
for attachment_result in email.iterate_supported_attachments(skip_failed=True):
print(attachment_result.get_metadata().filename)import sharepoint2text
for email in sharepoint2text.read_file("team-archive.mbox"):
print(f"Subject: {email.subject}")
print(email.get_full_text()[:200])from pathlib import Path
import sharepoint2text
for path in Path("docs").rglob("*"):
if not path.is_file() or not sharepoint2text.is_supported_file(path):
continue
for result in sharepoint2text.read_file(path):
meta = result.get_metadata()
for unit in result.iterate_units(ignore_images=True):
chunk = unit.get_text().strip()
if chunk:
payload = {
"text": chunk,
"source": str(path),
"filename": meta.filename,
"unit_number": getattr(unit.get_metadata(), "unit_number", None),
}
# store payload in your index/vector DBimport sharepoint2text
# Example: bytes from HTTP response
data = get_file_bytes_somehow()
result = next(
sharepoint2text.read_bytes(
data,
mime_type="application/pdf",
ignore_images=True,
)
)
print(result.get_full_text()[:500])