"A Study on Archaeological Informatization Using Large Language Models (LLMs) - Proof of Concept for an Automated Metadata Extraction Pipeline from Archaeological Excavation Reports -" PoC (Snapshot)

English | 한국어

This repository is a "snapshot" of the PoC code implemented in the paper "A Study on Archaeological Informatization Using Large Language Models (LLMs) - Proof of Concept for an Automated Metadata Extraction Pipeline from Archaeological Excavation Reports -", published in "Heritage: History and Science" Vol. 58 No. 3 (September 30, 2025). The source code is released for academic and technical transparency. However, this repository is preserved as a record of the research at its point in time, and no further development or third-party contributions are accepted.

heripo engine - Production-Grade Engine Release

Following the PoC research in this paper, on January 28, 2026, we released the heripo engine open source project, advancing the core technology further.

GitHub: https://github.com/heripo-lab/heripo-engine
Online Demo: https://engine-demo.heripo.com (try it without local installation)

Key Improvements

Compared to this PoC, the following advancements have been made:

OCR Support: High-quality OCR via Docling SDK enables processing of scanned documents
Apple Silicon Optimization: GPU acceleration on M1/M2/M3/M4/M5 chips
Production-Grade Design: TypeScript-based monorepo with 100% test coverage
Extensible Pipeline: Source data extraction -> data ledger -> standardization -> ontology
Multi-LLM Support: Compatible with OpenAI, Anthropic, Google, and other LLM providers

heripo engine takes the concepts validated in this PoC beyond proof-of-concept into a production-ready system. See the heripo engine repository for details.

Snapshot Notice and Important Notes (Please Read)

This code is a PoC focused on validating the research idea and feasibility. It does not guarantee production-level stability, security, or code quality.
This repository is not maintained; issues and PRs are not accepted. The purpose is to publicly share the research record.
Some code (especially the UI/frontend) was rapidly written with the assistance of LLM tools and does not follow refined design patterns. Experimental structures exist, such as logic concentrated in single page files.
- I am a senior software engineer specializing in web frontend development.
- The code quality in this repository does not represent my frontend engineering expertise.
TypeScript/Next.js was chosen for the researcher's familiarity and experimental convenience. For production use, choose a language/framework better suited to your needs.
This code was experimentally built around the following "3 sample reports" and is not optimized for other report formats. However, it generally works for most reports that do not require OCR.
- 백제역사문화연구원, 2025, 『부여 화지산 백제과원 및 둘레길 조성사업부지내 유적』.
- 일영문화유산연구원, 2025, 『제주 항파두리 항몽유적 내성지(7차)』.
- 겨레문화유산연구원, 2025, 『공주 석장리 구석기 유적(14차)』.
OCR is not supported. Note that even recent reports may have text rendered as outlines, requiring OCR.
Test only in a local environment. Do not deploy to a server or run in a publicly accessible environment.
- Security: Unexpected security issues may arise from dependency updates/vulnerabilities.
- Cost: OpenAI API call costs can be significant (potentially tens of thousands of KRW per run). Run only locally and at a testing level.

Open Source Purpose and Disclaimer

The primary purpose of releasing this repository is research transparency and sharing the academic record. Releasing the code does not constitute a recommendation or requirement to run it.
However, since anyone can execute the code once it is public, minimal execution instructions are provided.
All risks and responsibilities arising from running this code — including security, cost, data protection, and legal issues — rest entirely with the user. The repository owner assumes no liability whatsoever.
Only users with basic knowledge and experience in software and coding are advised to run this at a limited scale.

Follow-Up Open Source Projects

After the research in this paper, core technologies have been advanced through various open source projects.

Official Follow-Up Research Project (Planned)

As a direct follow-up to this paper, a project to extract metadata from archaeological excavation reports and generate academically meaningful data is under development. The concepts validated in this PoC will be refined to production level to create a tool that can substantively contribute to the digital transformation of archaeological research.

Status: Under development
Release Plan: Will be open-sourced upon reaching a production-ready level
Goal: Automated metadata extraction and structuring system for archaeological excavation reports

Related Open Source Projects

These projects extend the core concept of this paper (LLM-based document processing pipeline) into other domains. While not direct follow-ups to the academic paper, they are practical outputs derived from the research process and are continuously maintained.

LLM Newsletter Kit

A general-purpose AI newsletter automation engine. It extends the metadata extraction pipeline concept developed in this paper into a type-safe toolkit that automates the entire process: crawling -> analysis -> content generation -> storage.

Repository: https://github.com/heripo-lab/llm-newsletter-kit-core
npm Package: @llm-newsletter-kit/core
Features: TypeScript-based, all components swappable via Provider pattern, 100% test coverage
License: Apache-2.0

Heripo Research Radar

A cultural heritage AI newsletter service. As a real-world deployment and reference implementation of LLM Newsletter Kit, it collects content from 62 crawling targets to automatically generate weekly newsletters.

Repository: https://github.com/heripo-lab/heripo-research-radar
npm Package: @heripo/research-radar
Service: https://heripo.com/research-radar/subscribe
Metrics: $0.2-1 cost per issue, 24/7 fully automated, 15% CTR
License: Apache-2.0

We will continue to advance these core technologies and build them into shared assets for both academia and industry.

Research Background and Paper Summary

This section has been moved to a separate file under docs for readability. See the full content at the link below.

Research Background and Paper Summary

OS Support

Supported: macOS, Linux
Not supported: Windows native environment (running directly in Windows PowerShell/Command Prompt)
Windows users are advised to use WSL (Windows Subsystem for Linux).
- What is WSL? A feature that lets you run a lightweight Linux environment on Windows, without a separate heavyweight virtual machine.
- WSL Installation Guide: https://learn.microsoft.com/en-us/windows/wsl/install
- About WSL: https://learn.microsoft.com/en-us/windows/wsl/about
Reason for no Windows native support
- This project was originally developed for personal research on macOS.
- The primary purpose of open-sourcing is to share research transparency; running the code is not recommended for users without professional development knowledge. No additional compatibility work (especially for Windows native) was done during the open-source preparation.
- Docker was considered but not provided, as it could be an additional barrier for Windows users.
Contact: If you have a compelling reason to run this, please email [email protected].

Quick Start

The following instructions are for "local one-time testing" only. Use for deployment or production purposes is prohibited.

Required Software

Node.js: LTS version recommended (not heavily version-dependent)
- Installation guide (GUI-focused): Node.js Installation Guide
Python: Must be installed on your system
- Installation guide (GUI-focused): Python Installation Guide

Clone the Repository and Install Dependencies

After cloning the repository, run the following from the project root:
- npm install
- The automatically executed preinstall scripts perform the following:
  - Download data/embedding files: scripts/download-glossary-embeddings.js -> saves data/glossary-embeddings.json (large file)
  - Create Python virtual environment and install dependencies: scripts/setup-python-env.js -> creates .venv and installs packages from src/modules/pdf-process/requirements.txt

Environment Variable Setup (Important)

Copy .env.local.example from the root to create a .env.local file, then enter your OpenAI API key.
- Setup guide: OpenAI API Key Setup Guide
- Example:
  - OPENAI_API_KEY=sk-proj-xxxxxxxxxxxxxxxxxxxxxx
- Important: API key management and usage/cost control are the sole responsibility of the user. The repository owner assumes no liability.
Never commit or expose sensitive information. .env.local is included in .gitignore.

Build and Run (production build recommended over dev server)

Build: npm run build
Run: npm start
Open in browser: http://localhost:3000

Runtime Warnings (Reiterated)

Test only locally in a limited scope. Do not upload to or deploy on a server.
Be aware of OpenAI API costs. Token usage can spike significantly when processing large PDFs.

Source Code Overview and Processing Flow

The PoC operates as a pipeline: "PDF upload -> PDF text/image extraction (Python) -> image-caption mapping (rule/LLM) -> body analysis and structuring (LLM) -> SQLite storage -> interactive Q&A (LLM + glossary)".

Frontend (UI)

File upload/processing page: src/app/page.tsx
- PDF upload (up to 500MB), page type (PageType: single-sided/double-sided), actual first page (realFirstPage) input
- After upload, calls the standardization API (/api/report/standardize) with processing status polling

Upload API

POST /api/report/upload -> Saves uploaded files to uploads/ in the project root and returns a reportId
- Implementation: src/app/api/report/upload/route.ts

Standardize API

POST /api/report/standardize -> Processes uploaded PDFs for structuring
- Implementation: src/app/api/report/standardize/route.ts
- Parameters: reportId, pageType (1/2), realFirstPage (offset)
- Internally calls standardizeReport(): src/app/api/report/standardize/standardize-report.ts

PDF Processing (Python)

Node invokes Python scripts: src/modules/pdf-process/index.ts
- Calls src/modules/pdf-process/pdf-extractor.py via .venv/bin/python
- Note: Windows native environment is not supported. Windows users should run this in a WSL (Ubuntu) environment.
- Output (JSON): text/image coordinates, page information, etc.
Python dependencies: src/modules/pdf-process/requirements.txt

Image-Caption Mapping

Rule-based (default): src/modules/caption-extract-mapper/captionExtractMapperWithRule.ts
LLM-based (optional): src/modules/caption-extract-mapper/captionExtractMapperWithLLM.ts
Common entry: src/modules/caption-extract-mapper/index.ts

Body Analysis and Data Construction (LLM)

Main entry: src/modules/make-data/index.ts
- Excavated site overview extraction: makeExcavatedSiteData.ts -> estimates the start of "investigation content" based on the table of contents, links images based on captions, includes cumulative merge logic
- Trench/feature/artifact extraction: makeTrenchFeatureArtifactData.ts -> analyzes in 2-page segments with cumulative merge and retry logic
Common LLM call utility: src/libs/open-ai.ts (requires OPENAI_API_KEY)
JSON parsing utility: src/utils/extract-pure-json.ts (strips code blocks from LLM output; preserves raw on error)

Data Storage (SQLite)

DB insertion: src/modules/insert-database/index.ts
- File: data/excavation.db (created if absent)
- Performs deduplication, missing ID cleanup, and referential integrity mapping before inserting into excavated_sites, trenches, features, artifacts

Result Files (Debugging/Records)

Intermediate and final output JSON saved to public/pdf-result/
- e.g., ${reportId}.json, ${reportId}-db-*.json

Q&A (LLM + Glossary)

API: POST /api/chat -> Answers based on DB content, supplemented by Korean archaeology glossary (embedding search) when needed
- Implementation: src/app/api/chat/route.ts
- Glossary search: src/libs/glossarySearchEngine.ts (embeddings: data/glossary-embeddings.json)

Environment Variables (.env.local)

OPENAI_API_KEY: OpenAI API key (required)
- Setup guide: OpenAI API Key Setup Guide
- Important: API key management and usage/cost control are the sole responsibility of the user. The repository owner assumes no liability.
Refer to .env.local.example to create your .env.local file. Never commit sensitive information to a public repository.

Data/File Paths

Uploaded PDFs: uploads/ (temporary storage)
Debug/result JSON: public/pdf-result/
SQLite DB: data/excavation.db
Glossary embeddings: data/glossary-embeddings.json (auto-downloaded during preinstall)

Cost and Security Warnings

Cost: Costs increase rapidly as LLM call volume grows. Run only in a limited, testing capacity.
Security: Potential vulnerabilities may exist due to dependency issues. Do not deploy externally; run only in a local environment.

Dependency Snapshot and Compatibility

This repository includes dependency snapshots from the time of paper submission:
- package.snapshot.json, package-lock.snapshot.json
The current package.json versions may reflect minimal patch updates. The snapshots are provided for reproducibility reference. Installation/execution results may vary depending on the environment.

Known Limitations

Image-based (scanned) PDFs where text extraction is impossible are not supported.
Extraction quality may degrade if report formats vary significantly.
LLM responses are probabilistic; some fields may be missed or incorrectly extracted.
The frontend structure is experimental and does not follow refined design patterns.

License

MIT License (see LICENSE in the repository root)
Note: The purpose of this repository is academic transparency and research record sharing. Use in production/operational environments is not recommended.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

"A Study on Archaeological Informatization Using Large Language Models (LLMs) - Proof of Concept for an Automated Metadata Extraction Pipeline from Archaeological Excavation Reports -" PoC (Snapshot)

heripo engine - Production-Grade Engine Release

Key Improvements

Snapshot Notice and Important Notes (Please Read)

Open Source Purpose and Disclaimer

Follow-Up Open Source Projects

Official Follow-Up Research Project (Planned)

Related Open Source Projects

LLM Newsletter Kit

Heripo Research Radar

Research Background and Paper Summary

OS Support

Quick Start

Source Code Overview and Processing Flow

Environment Variables (.env.local)

Data/File Paths

Cost and Security Warnings

Dependency Snapshot and Compatibility

Known Limitations

License

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
docs		docs
public		public
scripts		scripts
src		src
.env.local.example		.env.local.example
.gitignore		.gitignore
.prettierrc		.prettierrc
LICENSE		LICENSE
README.ko.md		README.ko.md
README.md		README.md
eslint.config.mjs		eslint.config.mjs
next.config.js		next.config.js
package-lock.json		package-lock.json
package-lock.snapshot.json		package-lock.snapshot.json
package.json		package.json
package.snapshot.json		package.snapshot.json
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

"A Study on Archaeological Informatization Using Large Language Models (LLMs) - Proof of Concept for an Automated Metadata Extraction Pipeline from Archaeological Excavation Reports -" PoC (Snapshot)

heripo engine - Production-Grade Engine Release

Key Improvements

Snapshot Notice and Important Notes (Please Read)

Open Source Purpose and Disclaimer

Follow-Up Open Source Projects

Official Follow-Up Research Project (Planned)

Related Open Source Projects

LLM Newsletter Kit

Heripo Research Radar

Research Background and Paper Summary

OS Support

Quick Start

Source Code Overview and Processing Flow

Environment Variables (.env.local)

Data/File Paths

Cost and Security Warnings

Dependency Snapshot and Compatibility

Known Limitations

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages