Automatic generation of ontology mappings from relational database schemas using Large Language Models.
Hamilton takes a relational database schema (tables, columns, primary/foreign keys) and produces a JSON mapping that defines ontological concepts, attributes, relationships, sub-concepts, and join paths — which can then be converted into a D2RQ mapping.
You can find the paper here
(click thumbnail to open)
Hamilton automates the process of creating ontology mappings from relational databases. Given a SQL schema like:
CREATE TABLE Users (
user_id INTEGER PRIMARY KEY,
username VARCHAR(255),
email VARCHAR(255)
);Hamilton produces a structured JSON mapping defining concepts, attributes, object properties, and their relationships to the source database.
conda env create -f environment.yml
conda activate burrThe primary training approach uses Accelerate + DeepSpeed ZeRO-3 with LoRA (Low-Rank Adaptation) via PEFT for parameter-efficient fine-tuning of Qwen3 models across multiple GPUs.
accelerate launch \
--num_processes=8 \
--use_deepspeed \
--deepspeed_config_file=ds_config_zero3.json \
multigpu_training/train.pyAll training settings are controlled through multigpu_training/config.py — edit this file before launching:
task = "concepts" # Task to train on
input_data_format = "sql" # Input format
data_tag = "cot_50max" # Dataset tag
MODEL_ID = "Qwen/Qwen3-32B" # Model to fine-tune
MODEL_TAG = "concepts_cot_50max-qwen3-32B" # Experiment tag (used in WandB + output dir)
TRAINING_MODE = "full" # Training mode: full, reasoning_only, output_onlyThe multigpu_training/ module is structured as follows:
| File | Purpose |
|---|---|
config.py |
All configuration: model, task, data paths, LoRA hyperparams, training mode |
data.py |
Data loading, prompt selection, training mode reformatting |
model.py |
Tokenizer, model, and LoRA config setup |
training.py |
SFTConfig training arguments, chat template application |
utils.py |
Torch optimizations, LoRA target module detection, distributed mode checks |
train.py |
Main entry point that orchestrates all components |
Task modes (set task in config.py):
| Task | Description |
|---|---|
concepts |
Extract ontological concepts from tables |
attributes |
Extract attributes for concepts |
relationships |
Extract object properties between concepts |
relationships_with_concepts |
Joint concepts + relationships |
extensive_relationships |
Detailed relationship extraction |
all |
Full ontology mapping |
Key training hyperparameters (set in training.py):
| Parameter | Default |
|---|---|
| Epochs | 4 |
| Per-device batch size | 1 |
| Gradient accumulation steps | 4 |
| Learning rate | 5e-5 |
| Max sequence length | 16384 |
| Precision | bf16 (auto) |
LoRA hyperparameters (configurable via config.py or environment variables):
| Parameter | Default | Env Variable |
|---|---|---|
| Rank (r) | 16 | LORA_R |
| Alpha | 32 | LORA_ALPHA |
| Dropout | 0.05 | LORA_DROPOUT |
Training data is loaded from TXT files at paths derived from the config:
data/test/{data_tag}_{input_data_format}_{task}_fine_train_set.txt
data/test/{data_tag}_{input_data_format}_{task}_fine_test_set.txt
For example, with data_tag="cot_50max", input_data_format="sql", task="concepts":
data/test/cot_50max_sql_concepts_fine_train_set.txt
data/test/cot_50max_sql_concepts_fine_test_set.txt
The trainingdata_generator/ module creates realistic training data with semantic understanding:
from hamilton.trainingdata_generator.generator import TrainingDataGenerator
generator = TrainingDataGenerator(config)
generator.generate()Generates training examples covering: table-to-concept mapping, attribute extraction, relationship detection, hierarchy/sub-concept identification, and normalization patterns.
