"A Study on Archaeological Informatization Using Large Language Models (LLMs) - Proof of Concept for an Automated Metadata Extraction Pipeline from Archaeological Excavation Reports -" PoC (Snapshot)
English | 한국어
This repository is a "snapshot" of the PoC code implemented in the paper "A Study on Archaeological Informatization Using Large Language Models (LLMs) - Proof of Concept for an Automated Metadata Extraction Pipeline from Archaeological Excavation Reports -", published in "Heritage: History and Science" Vol. 58 No. 3 (September 30, 2025). The source code is released for academic and technical transparency. However, this repository is preserved as a record of the research at its point in time, and no further development or third-party contributions are accepted.
- Download the Paper
- Digital Appendix
- Paper Summary Video
- PoC Demo Video
- heripo engine - Production-Grade Open Source Engine
- heripo engine Online Demo
- Subscribe to Research Radar (heripo)
Following the PoC research in this paper, on January 28, 2026, we released the heripo engine open source project, advancing the core technology further.
- GitHub: https://github.com/heripo-lab/heripo-engine
- Online Demo: https://engine-demo.heripo.com (try it without local installation)
Compared to this PoC, the following advancements have been made:
- OCR Support: High-quality OCR via Docling SDK enables processing of scanned documents
- Apple Silicon Optimization: GPU acceleration on M1/M2/M3/M4/M5 chips
- Production-Grade Design: TypeScript-based monorepo with 100% test coverage
- Extensible Pipeline: Source data extraction -> data ledger -> standardization -> ontology
- Multi-LLM Support: Compatible with OpenAI, Anthropic, Google, and other LLM providers
heripo engine takes the concepts validated in this PoC beyond proof-of-concept into a production-ready system. See the heripo engine repository for details.
- This code is a PoC focused on validating the research idea and feasibility. It does not guarantee production-level stability, security, or code quality.
- This repository is not maintained; issues and PRs are not accepted. The purpose is to publicly share the research record.
- Some code (especially the UI/frontend) was rapidly written with the assistance of LLM tools and does not follow refined design patterns. Experimental structures exist, such as logic concentrated in single page files.
- I am a senior software engineer specializing in web frontend development.
- The code quality in this repository does not represent my frontend engineering expertise.
- TypeScript/Next.js was chosen for the researcher's familiarity and experimental convenience. For production use, choose a language/framework better suited to your needs.
- This code was experimentally built around the following "3 sample reports" and is not optimized for other report formats. However, it generally works for most reports that do not require OCR.
- 백제역사문화연구원, 2025, 『부여 화지산 백제과원 및 둘레길 조성사업부지내 유적』.
- 일영문화유산연구원, 2025, 『제주 항파두리 항몽유적 내성지(7차)』.
- 겨레문화유산연구원, 2025, 『공주 석장리 구석기 유적(14차)』.
- OCR is not supported. Note that even recent reports may have text rendered as outlines, requiring OCR.
- Test only in a local environment. Do not deploy to a server or run in a publicly accessible environment.
- Security: Unexpected security issues may arise from dependency updates/vulnerabilities.
- Cost: OpenAI API call costs can be significant (potentially tens of thousands of KRW per run). Run only locally and at a testing level.
- The primary purpose of releasing this repository is research transparency and sharing the academic record. Releasing the code does not constitute a recommendation or requirement to run it.
- However, since anyone can execute the code once it is public, minimal execution instructions are provided.
- All risks and responsibilities arising from running this code — including security, cost, data protection, and legal issues — rest entirely with the user. The repository owner assumes no liability whatsoever.
- Only users with basic knowledge and experience in software and coding are advised to run this at a limited scale.
After the research in this paper, core technologies have been advanced through various open source projects.
As a direct follow-up to this paper, a project to extract metadata from archaeological excavation reports and generate academically meaningful data is under development. The concepts validated in this PoC will be refined to production level to create a tool that can substantively contribute to the digital transformation of archaeological research.
- Status: Under development
- Release Plan: Will be open-sourced upon reaching a production-ready level
- Goal: Automated metadata extraction and structuring system for archaeological excavation reports
These projects extend the core concept of this paper (LLM-based document processing pipeline) into other domains. While not direct follow-ups to the academic paper, they are practical outputs derived from the research process and are continuously maintained.
A general-purpose AI newsletter automation engine. It extends the metadata extraction pipeline concept developed in this paper into a type-safe toolkit that automates the entire process: crawling -> analysis -> content generation -> storage.
- Repository: https://github.com/heripo-lab/llm-newsletter-kit-core
- npm Package:
@llm-newsletter-kit/core - Features: TypeScript-based, all components swappable via Provider pattern, 100% test coverage
- License: Apache-2.0
A cultural heritage AI newsletter service. As a real-world deployment and reference implementation of LLM Newsletter Kit, it collects content from 62 crawling targets to automatically generate weekly newsletters.
- Repository: https://github.com/heripo-lab/heripo-research-radar
- npm Package:
@heripo/research-radar - Service: https://heripo.com/research-radar/subscribe
- Metrics: $0.2-1 cost per issue, 24/7 fully automated, 15% CTR
- License: Apache-2.0
We will continue to advance these core technologies and build them into shared assets for both academia and industry.
This section has been moved to a separate file under docs for readability. See the full content at the link below.
- Supported: macOS, Linux
- Not supported: Windows native environment (running directly in Windows PowerShell/Command Prompt)
- Windows users are advised to use WSL (Windows Subsystem for Linux).
- What is WSL? A feature that lets you run a lightweight Linux environment on Windows, without a separate heavyweight virtual machine.
- WSL Installation Guide: https://learn.microsoft.com/en-us/windows/wsl/install
- About WSL: https://learn.microsoft.com/en-us/windows/wsl/about
- Reason for no Windows native support
- This project was originally developed for personal research on macOS.
- The primary purpose of open-sourcing is to share research transparency; running the code is not recommended for users without professional development knowledge. No additional compatibility work (especially for Windows native) was done during the open-source preparation.
- Docker was considered but not provided, as it could be an additional barrier for Windows users.
- Contact: If you have a compelling reason to run this, please email [email protected].
The following instructions are for "local one-time testing" only. Use for deployment or production purposes is prohibited.
- Required Software
- Node.js: LTS version recommended (not heavily version-dependent)
- Installation guide (GUI-focused): Node.js Installation Guide
- Python: Must be installed on your system
- Installation guide (GUI-focused): Python Installation Guide
- Clone the Repository and Install Dependencies
- After cloning the repository, run the following from the project root:
npm install- The automatically executed preinstall scripts perform the following:
- Download data/embedding files:
scripts/download-glossary-embeddings.js-> savesdata/glossary-embeddings.json(large file) - Create Python virtual environment and install dependencies:
scripts/setup-python-env.js-> creates.venvand installs packages fromsrc/modules/pdf-process/requirements.txt
- Download data/embedding files:
- Environment Variable Setup (Important)
- Copy
.env.local.examplefrom the root to create a.env.localfile, then enter your OpenAI API key.- Setup guide: OpenAI API Key Setup Guide
- Example:
OPENAI_API_KEY=sk-proj-xxxxxxxxxxxxxxxxxxxxxx
- Important: API key management and usage/cost control are the sole responsibility of the user. The repository owner assumes no liability.
- Never commit or expose sensitive information.
.env.localis included in.gitignore.
- Build and Run (production build recommended over dev server)
- Build:
npm run build - Run:
npm start - Open in browser: http://localhost:3000
- Runtime Warnings (Reiterated)
- Test only locally in a limited scope. Do not upload to or deploy on a server.
- Be aware of OpenAI API costs. Token usage can spike significantly when processing large PDFs.
The PoC operates as a pipeline: "PDF upload -> PDF text/image extraction (Python) -> image-caption mapping (rule/LLM) -> body analysis and structuring (LLM) -> SQLite storage -> interactive Q&A (LLM + glossary)".
- Frontend (UI)
- File upload/processing page:
src/app/page.tsx- PDF upload (up to 500MB), page type (PageType: single-sided/double-sided), actual first page (realFirstPage) input
- After upload, calls the standardization API (
/api/report/standardize) with processing status polling
- Upload API
POST /api/report/upload-> Saves uploaded files touploads/in the project root and returns areportId- Implementation:
src/app/api/report/upload/route.ts
- Implementation:
- Standardize API
POST /api/report/standardize-> Processes uploaded PDFs for structuring- Implementation:
src/app/api/report/standardize/route.ts - Parameters:
reportId,pageType(1/2),realFirstPage(offset) - Internally calls
standardizeReport():src/app/api/report/standardize/standardize-report.ts
- Implementation:
- PDF Processing (Python)
- Node invokes Python scripts:
src/modules/pdf-process/index.ts- Calls
src/modules/pdf-process/pdf-extractor.pyvia.venv/bin/python - Note: Windows native environment is not supported. Windows users should run this in a WSL (Ubuntu) environment.
- Output (JSON): text/image coordinates, page information, etc.
- Calls
- Python dependencies:
src/modules/pdf-process/requirements.txt
- Image-Caption Mapping
- Rule-based (default):
src/modules/caption-extract-mapper/captionExtractMapperWithRule.ts - LLM-based (optional):
src/modules/caption-extract-mapper/captionExtractMapperWithLLM.ts - Common entry:
src/modules/caption-extract-mapper/index.ts
- Body Analysis and Data Construction (LLM)
- Main entry:
src/modules/make-data/index.ts- Excavated site overview extraction:
makeExcavatedSiteData.ts-> estimates the start of "investigation content" based on the table of contents, links images based on captions, includes cumulative merge logic - Trench/feature/artifact extraction:
makeTrenchFeatureArtifactData.ts-> analyzes in 2-page segments with cumulative merge and retry logic
- Excavated site overview extraction:
- Common LLM call utility:
src/libs/open-ai.ts(requiresOPENAI_API_KEY) - JSON parsing utility:
src/utils/extract-pure-json.ts(strips code blocks from LLM output; preserves raw on error)
- Data Storage (SQLite)
- DB insertion:
src/modules/insert-database/index.ts- File:
data/excavation.db(created if absent) - Performs deduplication, missing ID cleanup, and referential integrity mapping before inserting into
excavated_sites,trenches,features,artifacts
- File:
- Result Files (Debugging/Records)
- Intermediate and final output JSON saved to
public/pdf-result/- e.g.,
${reportId}.json,${reportId}-db-*.json
- e.g.,
- Q&A (LLM + Glossary)
- API:
POST /api/chat-> Answers based on DB content, supplemented by Korean archaeology glossary (embedding search) when needed- Implementation:
src/app/api/chat/route.ts - Glossary search:
src/libs/glossarySearchEngine.ts(embeddings:data/glossary-embeddings.json)
- Implementation:
OPENAI_API_KEY: OpenAI API key (required)- Setup guide: OpenAI API Key Setup Guide
- Important: API key management and usage/cost control are the sole responsibility of the user. The repository owner assumes no liability.
- Refer to
.env.local.exampleto create your.env.localfile. Never commit sensitive information to a public repository.
- Uploaded PDFs:
uploads/(temporary storage) - Debug/result JSON:
public/pdf-result/ - SQLite DB:
data/excavation.db - Glossary embeddings:
data/glossary-embeddings.json(auto-downloaded during preinstall)
- Cost: Costs increase rapidly as LLM call volume grows. Run only in a limited, testing capacity.
- Security: Potential vulnerabilities may exist due to dependency issues. Do not deploy externally; run only in a local environment.
- This repository includes dependency snapshots from the time of paper submission:
package.snapshot.json,package-lock.snapshot.json
- The current
package.jsonversions may reflect minimal patch updates. The snapshots are provided for reproducibility reference. Installation/execution results may vary depending on the environment.
- Image-based (scanned) PDFs where text extraction is impossible are not supported.
- Extraction quality may degrade if report formats vary significantly.
- LLM responses are probabilistic; some fields may be missed or incorrectly extracted.
- The frontend structure is experimental and does not follow refined design patterns.
- MIT License (see LICENSE in the repository root)
- Note: The purpose of this repository is academic transparency and research record sharing. Use in production/operational environments is not recommended.