This repository contains the source code, experiment logs, and result analysis for our paper "FedAugment: Table Augmentation Search over Decentralized Data Repositories".
The repository is structured as follows:
fedaugment/
βββ analysis # Jupyter notebooks with result analysis and plotting code
βββ experiments # Python and Bash scripts with experiment configurations
βββ logs # results of our experimental evaluation
βββ scripts # utility scripts for data processing and evaluation
βββ src/fedaugment # main Python package with our implementation
βββ tests # unit tests for part of the codebaseThe following diagram illustrates the overall architecture of the FedAugment workflow:
+----------------------+
| Raw Table Data |
| (CSV/Parquet files) |
+----------+-----------+
|
v
+---------------------------------------------------------------------------------+
| 1. EMBEDDING GENERATION |
| +------------+ +------------+ +------------+ +------------+ |
| | View 1 | | View 2 | | View 3 | ... | View N | |
| | mpnet + | | gtr_t5 + | | gte_base + | | qwen3_8b + | |
| | dj_adpt | | dj_adpt | | dj_adpt | | dj_adpt | |
| +------+-----+ +------+-----+ +------+-----+ +------+-----+ |
| | | | | |
| v v v v |
| [384-dim] [768-dim] [768-dim] [4096-dim] |
| embeddings embeddings embeddings embeddings |
+-------------+----------------+----------------+--------------------+------------+
| | | |
+----------------+---------+------+--------------------+
|
v
+---------------------------------------------------------------------------------+
| 2. PROJECTION MODEL TRAINING |
| |
| Training Data Projection Models: |
| +---------------------+ - CL (Contrastive Learning) --- Neural network |
| | Curated subset | - LA2M (Local Isometry) ------- Clustering-based |
| | (FFT/Grid/Random) | - Vec2Vec --------------------- GAN-based |
| +---------------------+ - Procrustes ------------------ Orthogonal align |
| |
| Output: Learned transformations that map all views to a common vector space |
+----------------------------------------+----------------------------------------+
|
v
+---------------------------------------------------------------------------------+
| 3. ALIGNED EMBEDDING SPACE |
| |
| View 1 View 2 View 3 ... View N |
| | | | | |
| +---------+---------+----------------+ |
| | |
| v |
| +-------------------+ |
| | Common | |
| | Embedding Space | |
| +---------+---------+ |
| | |
| v |
| +-------------------+ |
| | HNSW Index | < Fast approximate nearest neighbor |
| +-------------------+ |
+----------------------------------------+----------------------------------------+
|
v
+---------------------------------------------------------------------------------+
| 4. TABLE AUGMENTATION TASKS |
| |
| +-----------------------------+ +-----------------------------+ |
| | JOIN DISCOVERY | | UNION DISCOVERY | |
| | | | | |
| | Query: Column A | | Query: Table X | |
| | v | | v | |
| | Find columns that can | | Find tables with | |
| | be joined with A | | compatible schemas | |
| | v | | v | |
| | Metrics: P@k, R@k, MAP | | Metrics: P@k, R@k, MAP | |
| +-----------------------------+ +-----------------------------+ |
+---------------------------------------------------------------------------------+
We recommend using uv to manage fedaugment and its dependencies.
Installing the project is as simple as:
uv syncWe offer the following optional dependency groups (use uv sync --extra <group>):
- PyTorch variants (mutually exclusive):
cpu: Force a CPU installation of PyTorchcu126: Force a CUDA 12.6 installation of PyTorchcu128: Force a CUDA 12.8 installation of PyTorch
experiments: Installs additional dependencies for running experimentsflash-attn: For more efficient attention operators in PyTorch
Note that the default PyTorch version depends on your operating system (CPU for Windows and Mac, CUDA 12.x for Linux).
We use the following datasets in our experiments:
The project expects datasets in the following structure (symlinked or stored at data/):
data/
β
βββ datasets/ # Raw tabular data
β β
β βββ {dataset_name}/ # e.g., webtable, omnimatch_city_test, omnimatch_culture_test, santos_small, freyja
β β
β βββ datasets/ # Full original dataset
β β βββ pq/ # Parquet files
β β β βββ {table_id}.pq
β β βββ csv/ # CSV files
β β βββ {table_id}.csv
β β
β βββ queries/ # Query tables and ground truth (query tables are optional, if absent, queries use tables from datasets/)
β β βββ pq/ # Query tables (parquet)
β β βββ csv/ # Query tables (csv)
β β βββ join_queries.csv # Join query list (only for join tasks)
β β βββ join_ground_truth.csv # Join ground truth pairs (only for join tasks)
β β βββ union_queries.csv # Union query list (only for union tasks)
β β βββ union_ground_truth.csv # Union ground truth pairs (only for union tasks)
β β
β βββ split/ # Train/test/val splits (webtable only)
β β βββ train/
β β βββ test/
β β βββ val/
β β
β βββ sample-{pct}/ # Sampled subsets from data curation (webtable only)
β βββ {curation_method}/ # e.g., fft_cos-mpnet-dj_adpt-k=63049
β βββ {table_id}.pq
β
β
βββ embeddings/ # Generated embeddings
β
βββ {dataset_name}/ # Mirrors datasets/ structure
β
βββ datasets/ # Dataset embeddings (full corpus)
β β
β βββ {model}-{strategy}.fa/ # Feature archive per embedding pipeline
β βββ embeddings.npy # (N, D) float32 array
β βββ column_ids.npy # (N,) string array: "{table_id}::{column_name}"
β βββ metadata.json # Pipeline metadata
β
βββ queries/ # Query embeddings (if there are no dedicated queries, we symlink to datasets/)
β β
β βββ {model}-{strategy}.fa/ # Feature archive per embedding pipeline
β βββ embeddings.npy # (N, D) float32 array
β βββ column_ids.npy # (N,) string array: "{table_id}::{column_name}"
β βββ metadata.json # Pipeline metadata
β
βββ sample-{pct}/ # Embeddings for sampled subsets (percentages: 001, 005, 050)
β β
β βββ {curation_method}/ # Curation strategy
β βββ {model}-{strategy}.fa/
β βββ embeddings.npy
β βββ column_ids.npy
β βββ metadata.json
β
βββ split/ # Embeddings for splits (webtable only)
βββ train/
β βββ {model}-{strategy}.fa/
βββ test/
β βββ {model}-{strategy}.fa/
βββ val/
βββ {model}-{strategy}.fa/Query Files:
-
join_queries.csv: List of join query columnsquery_table,query_column csvData1549285__2.csv,AST% csvData1549285__2.csv,BLK%
-
join_ground_truth.csv: Ground truth join pairsquery_table,candidate_table,query_column,candidate_column csvData1549285__2.csv,csvData20409520__4.csv,DRtg,DRtg
Embedding Files (stored in .fa/ directories):
embeddings.npy: NumPy array of shape(N_columns, embedding_dim), dtypefloat32column_ids.npy: NumPy array of shape(N_columns,), dtypeStringDType(), format"{table_id}::{column_name}"metadata.json: Pipeline metadata including model, shape, and source information
- Dataset names:
webtable,omnimatch_city_test,omnimatch_culture_test,santos_small,freyja - Embedding pipelines:
{model}-{strategy}- Models:
mpnet,distilroberta,gte_base,gtr_t5,mini_l12,mini_l6, etc. - Strategies:
dj_orig(DeepJoin original),dj_adpt(DeepJoin adapted), etc.
- Models:
- Curation methods:
{algorithm}[_{metric}]-{model}-{strategy}-k={n_columns}[-pca={variance}]- Algorithms:
fft,grid,random - Metrics:
cos(cosine),euc(Euclidean) - Example:
fft_cos-mpnet-dj_adpt-k=63049-pca=0.9
- Algorithms:
To reproduce the experiments in our paper, first ensure you have downloaded the required datasets and placed them in the data/datasets/ directory as described above.
If you prefer to store the datasets elsewhere, create a symbolic link named data in the project root pointing to your dataset directory. For example:
ln -s /your/local/storage dataAfter setting up the datasets, you can run all experiments using:
bash experiments/run_all.sh- 500+ GB RAM
- 1.5+ TB disk space
- NVIDIA A100 80GB GPU or better
TBD