Skip to content

heripo-lab/archaeological-informatization-poc

Repository files navigation

"A Study on Archaeological Informatization Using Large Language Models (LLMs) - Proof of Concept for an Automated Metadata Extraction Pipeline from Archaeological Excavation Reports -" PoC (Snapshot)

English | 한국어

This repository is a "snapshot" of the PoC code implemented in the paper "A Study on Archaeological Informatization Using Large Language Models (LLMs) - Proof of Concept for an Automated Metadata Extraction Pipeline from Archaeological Excavation Reports -", published in "Heritage: History and Science" Vol. 58 No. 3 (September 30, 2025). The source code is released for academic and technical transparency. However, this repository is preserved as a record of the research at its point in time, and no further development or third-party contributions are accepted.

heripo engine - Production-Grade Engine Release

Following the PoC research in this paper, on January 28, 2026, we released the heripo engine open source project, advancing the core technology further.

Key Improvements

Compared to this PoC, the following advancements have been made:

  • OCR Support: High-quality OCR via Docling SDK enables processing of scanned documents
  • Apple Silicon Optimization: GPU acceleration on M1/M2/M3/M4/M5 chips
  • Production-Grade Design: TypeScript-based monorepo with 100% test coverage
  • Extensible Pipeline: Source data extraction -> data ledger -> standardization -> ontology
  • Multi-LLM Support: Compatible with OpenAI, Anthropic, Google, and other LLM providers

heripo engine takes the concepts validated in this PoC beyond proof-of-concept into a production-ready system. See the heripo engine repository for details.

Snapshot Notice and Important Notes (Please Read)

  • This code is a PoC focused on validating the research idea and feasibility. It does not guarantee production-level stability, security, or code quality.
  • This repository is not maintained; issues and PRs are not accepted. The purpose is to publicly share the research record.
  • Some code (especially the UI/frontend) was rapidly written with the assistance of LLM tools and does not follow refined design patterns. Experimental structures exist, such as logic concentrated in single page files.
    • I am a senior software engineer specializing in web frontend development.
    • The code quality in this repository does not represent my frontend engineering expertise.
  • TypeScript/Next.js was chosen for the researcher's familiarity and experimental convenience. For production use, choose a language/framework better suited to your needs.
  • This code was experimentally built around the following "3 sample reports" and is not optimized for other report formats. However, it generally works for most reports that do not require OCR.
    • 백제역사문화연구원, 2025, 『부여 화지산 백제과원 및 둘레길 조성사업부지내 유적』.
    • 일영문화유산연구원, 2025, 『제주 항파두리 항몽유적 내성지(7차)』.
    • 겨레문화유산연구원, 2025, 『공주 석장리 구석기 유적(14차)』.
  • OCR is not supported. Note that even recent reports may have text rendered as outlines, requiring OCR.
  • Test only in a local environment. Do not deploy to a server or run in a publicly accessible environment.
    • Security: Unexpected security issues may arise from dependency updates/vulnerabilities.
    • Cost: OpenAI API call costs can be significant (potentially tens of thousands of KRW per run). Run only locally and at a testing level.

Open Source Purpose and Disclaimer

  • The primary purpose of releasing this repository is research transparency and sharing the academic record. Releasing the code does not constitute a recommendation or requirement to run it.
  • However, since anyone can execute the code once it is public, minimal execution instructions are provided.
  • All risks and responsibilities arising from running this code — including security, cost, data protection, and legal issues — rest entirely with the user. The repository owner assumes no liability whatsoever.
  • Only users with basic knowledge and experience in software and coding are advised to run this at a limited scale.

Follow-Up Open Source Projects

After the research in this paper, core technologies have been advanced through various open source projects.

Official Follow-Up Research Project (Planned)

As a direct follow-up to this paper, a project to extract metadata from archaeological excavation reports and generate academically meaningful data is under development. The concepts validated in this PoC will be refined to production level to create a tool that can substantively contribute to the digital transformation of archaeological research.

  • Status: Under development
  • Release Plan: Will be open-sourced upon reaching a production-ready level
  • Goal: Automated metadata extraction and structuring system for archaeological excavation reports

Related Open Source Projects

These projects extend the core concept of this paper (LLM-based document processing pipeline) into other domains. While not direct follow-ups to the academic paper, they are practical outputs derived from the research process and are continuously maintained.

LLM Newsletter Kit

A general-purpose AI newsletter automation engine. It extends the metadata extraction pipeline concept developed in this paper into a type-safe toolkit that automates the entire process: crawling -> analysis -> content generation -> storage.

Heripo Research Radar

A cultural heritage AI newsletter service. As a real-world deployment and reference implementation of LLM Newsletter Kit, it collects content from 62 crawling targets to automatically generate weekly newsletters.

We will continue to advance these core technologies and build them into shared assets for both academia and industry.


Research Background and Paper Summary

This section has been moved to a separate file under docs for readability. See the full content at the link below.


OS Support

  • Supported: macOS, Linux
  • Not supported: Windows native environment (running directly in Windows PowerShell/Command Prompt)
  • Windows users are advised to use WSL (Windows Subsystem for Linux).
  • Reason for no Windows native support
    • This project was originally developed for personal research on macOS.
    • The primary purpose of open-sourcing is to share research transparency; running the code is not recommended for users without professional development knowledge. No additional compatibility work (especially for Windows native) was done during the open-source preparation.
    • Docker was considered but not provided, as it could be an additional barrier for Windows users.
  • Contact: If you have a compelling reason to run this, please email [email protected].

Quick Start

The following instructions are for "local one-time testing" only. Use for deployment or production purposes is prohibited.

  1. Required Software
  1. Clone the Repository and Install Dependencies
  • After cloning the repository, run the following from the project root:
    • npm install
    • The automatically executed preinstall scripts perform the following:
      • Download data/embedding files: scripts/download-glossary-embeddings.js -> saves data/glossary-embeddings.json (large file)
      • Create Python virtual environment and install dependencies: scripts/setup-python-env.js -> creates .venv and installs packages from src/modules/pdf-process/requirements.txt
  1. Environment Variable Setup (Important)
  • Copy .env.local.example from the root to create a .env.local file, then enter your OpenAI API key.
    • Setup guide: OpenAI API Key Setup Guide
    • Example:
      • OPENAI_API_KEY=sk-proj-xxxxxxxxxxxxxxxxxxxxxx
    • Important: API key management and usage/cost control are the sole responsibility of the user. The repository owner assumes no liability.
  • Never commit or expose sensitive information. .env.local is included in .gitignore.
  1. Build and Run (production build recommended over dev server)
  1. Runtime Warnings (Reiterated)
  • Test only locally in a limited scope. Do not upload to or deploy on a server.
  • Be aware of OpenAI API costs. Token usage can spike significantly when processing large PDFs.

Source Code Overview and Processing Flow

The PoC operates as a pipeline: "PDF upload -> PDF text/image extraction (Python) -> image-caption mapping (rule/LLM) -> body analysis and structuring (LLM) -> SQLite storage -> interactive Q&A (LLM + glossary)".

  1. Frontend (UI)
  • File upload/processing page: src/app/page.tsx
    • PDF upload (up to 500MB), page type (PageType: single-sided/double-sided), actual first page (realFirstPage) input
    • After upload, calls the standardization API (/api/report/standardize) with processing status polling
  1. Upload API
  • POST /api/report/upload -> Saves uploaded files to uploads/ in the project root and returns a reportId
    • Implementation: src/app/api/report/upload/route.ts
  1. Standardize API
  • POST /api/report/standardize -> Processes uploaded PDFs for structuring
    • Implementation: src/app/api/report/standardize/route.ts
    • Parameters: reportId, pageType (1/2), realFirstPage (offset)
    • Internally calls standardizeReport(): src/app/api/report/standardize/standardize-report.ts
  1. PDF Processing (Python)
  • Node invokes Python scripts: src/modules/pdf-process/index.ts
    • Calls src/modules/pdf-process/pdf-extractor.py via .venv/bin/python
    • Note: Windows native environment is not supported. Windows users should run this in a WSL (Ubuntu) environment.
    • Output (JSON): text/image coordinates, page information, etc.
  • Python dependencies: src/modules/pdf-process/requirements.txt
  1. Image-Caption Mapping
  • Rule-based (default): src/modules/caption-extract-mapper/captionExtractMapperWithRule.ts
  • LLM-based (optional): src/modules/caption-extract-mapper/captionExtractMapperWithLLM.ts
  • Common entry: src/modules/caption-extract-mapper/index.ts
  1. Body Analysis and Data Construction (LLM)
  • Main entry: src/modules/make-data/index.ts
    • Excavated site overview extraction: makeExcavatedSiteData.ts -> estimates the start of "investigation content" based on the table of contents, links images based on captions, includes cumulative merge logic
    • Trench/feature/artifact extraction: makeTrenchFeatureArtifactData.ts -> analyzes in 2-page segments with cumulative merge and retry logic
  • Common LLM call utility: src/libs/open-ai.ts (requires OPENAI_API_KEY)
  • JSON parsing utility: src/utils/extract-pure-json.ts (strips code blocks from LLM output; preserves raw on error)
  1. Data Storage (SQLite)
  • DB insertion: src/modules/insert-database/index.ts
    • File: data/excavation.db (created if absent)
    • Performs deduplication, missing ID cleanup, and referential integrity mapping before inserting into excavated_sites, trenches, features, artifacts
  1. Result Files (Debugging/Records)
  • Intermediate and final output JSON saved to public/pdf-result/
    • e.g., ${reportId}.json, ${reportId}-db-*.json
  1. Q&A (LLM + Glossary)
  • API: POST /api/chat -> Answers based on DB content, supplemented by Korean archaeology glossary (embedding search) when needed
    • Implementation: src/app/api/chat/route.ts
    • Glossary search: src/libs/glossarySearchEngine.ts (embeddings: data/glossary-embeddings.json)

Environment Variables (.env.local)

  • OPENAI_API_KEY: OpenAI API key (required)
    • Setup guide: OpenAI API Key Setup Guide
    • Important: API key management and usage/cost control are the sole responsibility of the user. The repository owner assumes no liability.
  • Refer to .env.local.example to create your .env.local file. Never commit sensitive information to a public repository.

Data/File Paths

  • Uploaded PDFs: uploads/ (temporary storage)
  • Debug/result JSON: public/pdf-result/
  • SQLite DB: data/excavation.db
  • Glossary embeddings: data/glossary-embeddings.json (auto-downloaded during preinstall)

Cost and Security Warnings

  • Cost: Costs increase rapidly as LLM call volume grows. Run only in a limited, testing capacity.
  • Security: Potential vulnerabilities may exist due to dependency issues. Do not deploy externally; run only in a local environment.

Dependency Snapshot and Compatibility

  • This repository includes dependency snapshots from the time of paper submission:
    • package.snapshot.json, package-lock.snapshot.json
  • The current package.json versions may reflect minimal patch updates. The snapshots are provided for reproducibility reference. Installation/execution results may vary depending on the environment.

Known Limitations

  • Image-based (scanned) PDFs where text extraction is impossible are not supported.
  • Extraction quality may degrade if report formats vary significantly.
  • LLM responses are probabilistic; some fields may be missed or incorrectly extracted.
  • The frontend structure is experimental and does not follow refined design patterns.

License

  • MIT License (see LICENSE in the repository root)
  • Note: The purpose of this repository is academic transparency and research record sharing. Use in production/operational environments is not recommended.

About

대형 언어 모델(LLM)을 활용해 고고학 발굴조사보고서 PDF의 메타데이터를 자동으로 추출 및 구조화하는 정보화 파이프라인의 개념 검증(PoC) 프로젝트입니다.

Resources

License

Stars

Watchers

Forks

Contributors