Passionate about space exploration and the Apollo missions, I wanted to create something meaningful while learning how to work with modern AI coding agents. The long-term goal of this project is to digitize the transcripts of all Apollo missions with the highest possible quality, and to make them accessible, modern, and easily searchable for everyone.
In hindsight, I would probably have spent less time transcribing the files by hand — but why do things manually when you can automate them in twice the time?
As of early 2026, I do not think 100% fidelity is achievable with my current approach (except with manual correction, of course). OCR is close to perfect, but there is still some noise, and the layout is difficult for AI models to fully interpret. With more advanced models and better prompting, the results might improve, but I currently lack both the knowledge and the budget to experiment with other models. I also have not explored deep learning or machine learning approaches yet, which is still something I need to learn.
That said, I believe I am getting close to my goal. Apollo 11 was the easiest mission to process, because very accurate transcripts are available online (Apollo in Real Time and the Apollo Flight Journals).
The results are available here :
They include JSON, Markdown, HTML, and printable PDF versions (a legacy version as close as possible to the original layout, and a modern version with a new layout designed for contemporary reading).
The code project for the reprinted legacy version is available here: https://github.com/JeanLouisParent/Nasa-Transcript-Printer The code project for the modern layout and HTML/MD versions is available here:
Feel free to contribute, report issues, or suggest improvements.
What drives this project:
- Deep fascination with space exploration and Apollo mission history
- Hands-on learning with AI-powered development tools (LLMs, vision models, coding agents)
- Making historic space communications more accessible and analyzable
- Building something concrete that combines classic engineering with cutting-edge AI
- Creating a foundation for reprinting these historical documents in a modern physical format
The system uses a two-stage architecture:
- Image Pipeline (Parallel): Extracts pages, processes images, and generates assets.
- Intelligence Pipeline (Sequential): Runs VLM OCR, parses text, and applies corrections.
It is designed to handle the idiosyncrasies of typewriter-era documents, including faint text, margin notes, and timestamp inconsistencies.
- Python 3.10+
- LM Studio running a vision model (e.g.,
qwen/qwen3-vl-4b)- Server must be accessible (default:
http://localhost:1234)
- Server must be accessible (default:
-
Clone the repository:
git clone <repository-url> cd ocr_transcript_v2
-
Create a virtual environment:
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
Place your PDF in the input/ directory (e.g., input/AS11_TEC.PDF).
Run the full pipeline:
python main.py process AS11_TEC.PDFThis extracts pages, processes images, runs OCR, and generates structured output.
If you change configuration (e.g., speaker names, text replacements) or code logic, you do not need to re-run the slow OCR step. You can reparse from the cached OCR text in minutes.
python main.py reparse AS11_TEC.PDFIf you only changed block-level merging logic (no parsing changes), run the fastest update:
python main.py postprocess AS11_TEC.PDFTo regenerate the merged JSON and text files:
python main.py export AS11_TEC.PDF| Command | Description |
|---|---|
process <PDF> |
Run the full pipeline (image + OCR + export) |
reparse <PDF> |
Reparse pages from stored OCR text (skip OCR) |
postprocess <PDF> |
Post-process existing per-page JSON |
export <PDF> |
Regenerate merged JSON |
info <PDF> |
Display PDF metadata and page count |
-p, --pages TEXT: Page range (e.g.,1-50,10,12-14).--no-ocr: Skip OCR stage (image processing only).--ocr-url TEXT: Override LM Studio URL.--clean: Delete previous output before running.
Outputs are generated in the output/ directory:
output/AS11_TEC/
├── AS11_TEC_merged.json # Complete structured transcript
└── pages/
└── Page_001/
├── AS11_TEC_page_0001.json # Per-page structured data
├── assets/ # Processed images
└── ocr/ # Raw OCR text files
Configuration is applied in layers: Defaults -> Mission Overrides -> CLI Arguments.
- Global Defaults:
config/defaults.toml(I/O paths, OCR model, threading) - Mission Config:
config/missions.toml(Speaker lists, text replacements, page offsets) - Prompts:
config/prompts.toml(OCR instructions)
See docs/CONFIGURATION.md for details.
- Architecture: System design and module responsibilities.
- Pipeline: Deep dive into image processing and OCR algorithms.
- Post-Processing: Details on parsing, correction, and validation.
- Configuration: Complete configuration reference.
- Schemas: JSON Schema definitions.
MIT License
