Skip to content

Sharepoint to text: A python-libraries which extracts the plain text from legacy and modern office documents and PDF. An enabling library for the agentic age which is driven more by Sharepoint content that you would think ...

License

Notifications You must be signed in to change notification settings

Horsmann/sharepoint-to-text

Repository files navigation

sharepoint-to-text

A pure-Python library for extracting text and structured content from files commonly found in SharePoint ecosystems:

  • Microsoft Office (modern and legacy)
  • OpenDocument
  • PDF
  • Email formats
  • Plain text and config formats
  • HTML/EPUB/MHTML
  • Archives containing supported files

It also includes an optional SharePoint Graph client (sharepoint_io) for listing/downloading files before extraction.

Table of Contents

Why Use This Library

  • Pure Python (no Java runtime, no LibreOffice subprocesses)
  • Unified extraction interface across many file types
  • Works with file paths and in-memory bytes
  • Suitable for RAG/indexing pipelines where chunking and metadata matter
  • Handles both modern and legacy Office formats in one API

Install

uv add sharepoint-to-text

Optional PDF crypto acceleration:

uv add "sharepoint-to-text[pdf-crypto]"

From source:

git clone https://github.com/Horsmann/sharepoint-to-text.git
cd sharepoint-to-text
uv sync --all-groups

Quick Start

1) Read any supported local file

import sharepoint2text

result = next(sharepoint2text.read_file("document.docx"))
print(result.get_full_text())

read_file(...) returns a generator. Most files produce one result, but archives and .mbox can produce multiple.

2) Read bytes already in memory

import sharepoint2text

payload = b"hello from memory"
result = next(sharepoint2text.read_bytes(payload, extension="txt"))
print(result.get_full_text())

3) Choose chunking strategy

import sharepoint2text

result = next(sharepoint2text.read_file("report.pdf"))

# Single text blob
full_text = result.get_full_text()

# Structured chunks (page/slide/sheet depending on format)
for unit in result.iterate_units():
    print(unit.get_text())
    print(unit.get_metadata())

4) Serialize results

import json
import sharepoint2text

result = next(sharepoint2text.read_file("document.docx"))
print(json.dumps(result.to_json()))

Restore from JSON:

from sharepoint2text.parsing.extractors.data_types import ExtractionInterface

restored = ExtractionInterface.from_json(result.to_json())

Core Interface

All extracted results implement a common interface (ExtractionInterface):

  • get_full_text()
  • iterate_units()
  • iterate_images()
  • iterate_tables()
  • get_metadata()
  • to_json() / from_json(...)

Use this interface when you want one pipeline that works across formats.

Which text method should you use?

Goal Method
One string per document get_full_text()
Chunk by structure (RAG/citations) iterate_units()
All images in a file iterate_images()
All tables in a file iterate_tables()

What iterate_units() means by format

Format family Units yielded
Word / text docs (.docx, .doc, .odt, plain text, config files) Usually one unit
Spreadsheets (.xlsx, .xls, .ods) One unit per sheet
Presentations (.pptx, .ppt, .odp) One unit per slide
PDF One unit per page
Email (.eml, .msg) One unit per email
Mailbox (.mbox) Multiple extraction results (one per email)

Notes:

  • Word formats do not store reliable page boundaries, so units are document-level.
  • iterate_units(ignore_images=True) skips image payloads in unit objects for better performance.

CLI

After installation, sharepoint2text is available.

Plain text output:

sharepoint2text --file /path/to/file.docx > extraction.txt

JSON output:

sharepoint2text --file /path/to/file.docx --json > extraction.json

Options

Option Description
--file FILE, -f FILE Required input file
--output FILE, -o FILE Write output to file (default: stdout)
--json, -j Emit list[extraction_object]
--json-unit, -u Emit list[unit_object]
--include-images, -i Include binary image payloads as base64 in JSON output
--no-attachments, -n Exclude email attachments from CLI extraction output
--max-file-size-mb, -m Maximum input size in MiB (default: 100, use 0 to disable)
--version, -v Print CLI version

Rules:

  • --json and --json-unit are mutually exclusive.
  • --include-images requires --json or --json-unit.
  • CLI enforces a configurable input file limit (default 100 MiB; override with --max-file-size-mb / -m).

Optional SharePoint Integration

sharepoint_io is optional. It helps list/download files from SharePoint, while extraction still runs through sharepoint2text.

import io
import sharepoint2text
from sharepoint2text.sharepoint_io import (
    EntraIDAppCredentials,
    SharePointRestClient,
)

credentials = EntraIDAppCredentials(
    tenant_id="your-tenant-id",
    client_id="your-client-id",
    client_secret="your-client-secret",
)

client = SharePointRestClient(
    site_url="https://contoso.sharepoint.com/sites/Documents",
    credentials=credentials,
)

for file_meta in client.list_all_files():
    data = client.download_file(file_meta.id)
    extractor = sharepoint2text.get_extractor(file_meta.name)
    for result in extractor(io.BytesIO(data), path=file_meta.name):
        print(result.get_full_text()[:200])

Setup details: sharepoint2text/sharepoint_io/SETUP.md

Supported Formats

Microsoft Office

  • Modern: .docx, .docm, .xlsx, .xlsm, .xlsb, .pptx, .pptm
  • Legacy: .doc, .dot, .xls, .xlt, .ppt, .pot, .pps, .rtf
  • Template/show aliases are auto-mapped (for example .dotx -> .docx, .ppsx -> .pptx)

OpenDocument

  • .odt, .ods, .odp, .odg, .odf
  • Template aliases supported (.ott, .ots, .otp)

Email

  • .eml, .msg, .mbox
  • Email extraction includes sender/recipient metadata, subject, and body (body_plain / body_html).
  • .eml and .msg parse attachments and store them on EmailContent.attachments.
  • .mbox extraction currently focuses on message headers/body and does not parse/store attachments.
  • Parsed supported attachments can be extracted via EmailContent.iterate_supported_attachments().
  • If supported-attachment extraction fails, the default behavior is to raise; use skip_failed=True to continue.

Plain text and config/data

  • .txt, .md, .csv, .tsv, .json
  • .yaml, .yml, .xml, .log, .ini, .cfg, .conf, .properties

Web and ebook

  • .html, .htm, .mhtml, .mht, .epub

PDF

  • .pdf

Archives

  • .zip, .tar, .7z
  • Compressed tar aliases: .tar.gz/.tgz, .tar.bz2/.tbz2, .tar.xz/.txz
  • .gz, .bz2, .xz are routed as compressed tar variants

Archive Processing and Security

Archives are processed one level deep. Supported non-archive files inside the archive can yield extraction results. Nested archives are intentionally skipped as a safety guard.

Built-in safeguards include zip-bomb protections and file size limits. For 7z, extraction is limited to 100MB archives. Archive entries may also be skipped when they exceed internal per-entry size limits or fail extraction.

Limitations and Caveats

PDF

  • No OCR. Scanned-image PDFs may return empty text.
  • Structured table extraction is not implemented for PDF (iterate_tables() is empty).
  • Password-protected PDFs (non-empty password) raise ExtractionFileEncryptedError.
  • Some JBIG2 images need jbig2dec installed for image decoding.

General

  • Inputs are expected to be already decrypted. If a file has encryption, DRM, password protection, or similar security controls, remove/unlock those before calling sharepoint2text.
  • Very large or highly compressed files may hit protection limits.
  • Raise limits only for trusted inputs.

API Cheat Sheet

Main entry points

import sharepoint2text

sharepoint2text.read_file(
    path,
    max_file_size=100 * 1024 * 1024,
    ignore_images=False,
    force_plain_text=False,
)

sharepoint2text.read_bytes(
    data,
    extension="pdf",      # or ".pdf"
    mime_type=None,        # e.g. "application/pdf"
    max_file_size=100 * 1024 * 1024,
    ignore_images=False,
    force_plain_text=False,
)

sharepoint2text.is_supported_file(path)
sharepoint2text.get_extractor(path)

Format-specific extractors (selected)

  • Office/OpenDocument: read_docx, read_doc, read_xlsx, read_xls, read_pptx, read_ppt, read_rtf, read_odt, read_ods, read_odp, read_odg, read_odf
  • Other documents: read_pdf, read_html, read_epub, read_mhtml, read_plain_text
  • Email: read_eml_email, read_msg_email, read_mbox_email

All extractor functions accept a binary stream plus optional path and return generators.

Email helper API:

  • EmailContent.iterate_supported_attachments(skip_failed=False) extracts supported parsed attachments on demand (primarily from .eml/.msg).

Exceptions

Common exceptions:

  • ExtractionFileFormatNotSupportedError
  • ExtractionFileEncryptedError
  • ExtractionFileTooLargeError
  • ExtractionLegacyMicrosoftParsingError
  • ExtractionZipBombError
  • ExtractionFailedError

License

Apache 2.0. See LICENSE.

Disclaimer

This project is not affiliated with, endorsed by, or sponsored by Microsoft.

More Usage Examples

Extract email body plus supported attachments

import sharepoint2text

email = next(sharepoint2text.read_file("message-with-attachments.eml"))

print(email.subject)
print(email.get_full_text())  # plain body if available, otherwise HTML body
print(f"Attachment count: {len(email.attachments)}")

# Extract supported attachment types (pdf, docx, pptx, etc.)
for attachment_result in email.iterate_supported_attachments():
    print(type(attachment_result).__name__)
    print(attachment_result.get_full_text()[:200])

Continue even if a supported attachment fails to extract

import sharepoint2text

email = next(sharepoint2text.read_file("message-with-attachments.msg"))

for attachment_result in email.iterate_supported_attachments(skip_failed=True):
    print(attachment_result.get_metadata().filename)

Process a mailbox (.mbox) and read message bodies

import sharepoint2text

for email in sharepoint2text.read_file("team-archive.mbox"):
    print(f"Subject: {email.subject}")
    print(email.get_full_text()[:200])

Batch-extract units for RAG-style chunking

from pathlib import Path
import sharepoint2text

for path in Path("docs").rglob("*"):
    if not path.is_file() or not sharepoint2text.is_supported_file(path):
        continue
    for result in sharepoint2text.read_file(path):
        meta = result.get_metadata()
        for unit in result.iterate_units(ignore_images=True):
            chunk = unit.get_text().strip()
            if chunk:
                payload = {
                    "text": chunk,
                    "source": str(path),
                    "filename": meta.filename,
                    "unit_number": getattr(unit.get_metadata(), "unit_number", None),
                }
                # store payload in your index/vector DB

Extract from API bytes when you only know MIME type

import sharepoint2text

# Example: bytes from HTTP response
data = get_file_bytes_somehow()

result = next(
    sharepoint2text.read_bytes(
        data,
        mime_type="application/pdf",
        ignore_images=True,
    )
)
print(result.get_full_text()[:500])

About

Sharepoint to text: A python-libraries which extracts the plain text from legacy and modern office documents and PDF. An enabling library for the agentic age which is driven more by Sharepoint content that you would think ...

Resources

License

Contributing

Stars

Watchers

Forks

Packages