Skip to content

HPI-Information-Systems/hamilton

Repository files navigation

Hamilton

Automatic generation of ontology mappings from relational database schemas using Large Language Models.

Hamilton takes a relational database schema (tables, columns, primary/foreign keys) and produces a JSON mapping that defines ontological concepts, attributes, relationships, sub-concepts, and join paths — which can then be converted into a D2RQ mapping.

You can find the paper here

Video / Showcase of Hamilton:

Hamilton Showcase

(click thumbnail to open)

Overview

Hamilton automates the process of creating ontology mappings from relational databases. Given a SQL schema like:

CREATE TABLE Users (
    user_id INTEGER PRIMARY KEY,
    username VARCHAR(255),
    email VARCHAR(255)
);

Hamilton produces a structured JSON mapping defining concepts, attributes, object properties, and their relationships to the source database.

Setup

conda env create -f environment.yml
conda activate burr

Training

The primary training approach uses Accelerate + DeepSpeed ZeRO-3 with LoRA (Low-Rank Adaptation) via PEFT for parameter-efficient fine-tuning of Qwen3 models across multiple GPUs.

accelerate launch \
  --num_processes=8 \
  --use_deepspeed \
  --deepspeed_config_file=ds_config_zero3.json \
  multigpu_training/train.py

All training settings are controlled through multigpu_training/config.py — edit this file before launching:

task = "concepts"                          # Task to train on
input_data_format = "sql"                   # Input format
data_tag = "cot_50max"                      # Dataset tag
MODEL_ID = "Qwen/Qwen3-32B"                # Model to fine-tune
MODEL_TAG = "concepts_cot_50max-qwen3-32B"  # Experiment tag (used in WandB + output dir)
TRAINING_MODE = "full"                      # Training mode: full, reasoning_only, output_only

The multigpu_training/ module is structured as follows:

File Purpose
config.py All configuration: model, task, data paths, LoRA hyperparams, training mode
data.py Data loading, prompt selection, training mode reformatting
model.py Tokenizer, model, and LoRA config setup
training.py SFTConfig training arguments, chat template application
utils.py Torch optimizations, LoRA target module detection, distributed mode checks
train.py Main entry point that orchestrates all components

Training Configuration

Task modes (set task in config.py):

Task Description
concepts Extract ontological concepts from tables
attributes Extract attributes for concepts
relationships Extract object properties between concepts
relationships_with_concepts Joint concepts + relationships
extensive_relationships Detailed relationship extraction
all Full ontology mapping

Key training hyperparameters (set in training.py):

Parameter Default
Epochs 4
Per-device batch size 1
Gradient accumulation steps 4
Learning rate 5e-5
Max sequence length 16384
Precision bf16 (auto)

LoRA hyperparameters (configurable via config.py or environment variables):

Parameter Default Env Variable
Rank (r) 16 LORA_R
Alpha 32 LORA_ALPHA
Dropout 0.05 LORA_DROPOUT

Data Format

Training data is loaded from TXT files at paths derived from the config:

data/test/{data_tag}_{input_data_format}_{task}_fine_train_set.txt
data/test/{data_tag}_{input_data_format}_{task}_fine_test_set.txt

For example, with data_tag="cot_50max", input_data_format="sql", task="concepts":

data/test/cot_50max_sql_concepts_fine_train_set.txt
data/test/cot_50max_sql_concepts_fine_test_set.txt

Training Data Generation

Semantic Data Generation

The trainingdata_generator/ module creates realistic training data with semantic understanding:

from hamilton.trainingdata_generator.generator import TrainingDataGenerator

generator = TrainingDataGenerator(config)
generator.generate()

Generates training examples covering: table-to-concept mapping, attribute extraction, relationship detection, hierarchy/sub-concept identification, and normalization patterns.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages