Skip to content

lbhm/fedaugment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

FedAugment

Python Version Ruff License

This repository contains the source code, experiment logs, and result analysis for our paper "FedAugment: Table Augmentation Search over Decentralized Data Repositories".

πŸ—οΈ Architecture Overview

The repository is structured as follows:

fedaugment/
β”œβ”€β”€ analysis        # Jupyter notebooks with result analysis and plotting code
β”œβ”€β”€ experiments     # Python and Bash scripts with experiment configurations
β”œβ”€β”€ logs            # results of our experimental evaluation
β”œβ”€β”€ scripts         # utility scripts for data processing and evaluation
β”œβ”€β”€ src/fedaugment  # main Python package with our implementation
└── tests           # unit tests for part of the codebase

The following diagram illustrates the overall architecture of the FedAugment workflow:

+----------------------+
|    Raw Table Data    |
| (CSV/Parquet files)  |
+----------+-----------+
           |
           v
+---------------------------------------------------------------------------------+
|                              1. EMBEDDING GENERATION                            |
|      +------------+   +------------+   +------------+       +------------+      |
|      |   View 1   |   |   View 2   |   |   View 3   |  ...  |   View N   |      |
|      |  mpnet +   |   |  gtr_t5 +  |   | gte_base + |       | qwen3_8b + |      |
|      |  dj_adpt   |   |  dj_adpt   |   |  dj_adpt   |       |  dj_adpt   |      |
|      +------+-----+   +------+-----+   +------+-----+       +------+-----+      |
|             |                |                |                    |            |
|             v                v                v                    v            |
|         [384-dim]        [768-dim]        [768-dim]            [4096-dim]       |
|         embeddings       embeddings       embeddings           embeddings       |
+-------------+----------------+----------------+--------------------+------------+
              |                |                |                    |
              +----------------+---------+------+--------------------+
                                         |
                                         v
+---------------------------------------------------------------------------------+
|                           2. PROJECTION MODEL TRAINING                          |
|                                                                                 |
|  Training Data                Projection Models:                                |
|  +---------------------+      - CL (Contrastive Learning) --- Neural network    |
|  | Curated subset      |      - LA2M (Local Isometry) ------- Clustering-based  |
|  | (FFT/Grid/Random)   |      - Vec2Vec --------------------- GAN-based         |
|  +---------------------+      - Procrustes ------------------ Orthogonal align  |
|                                                                                 |
|  Output: Learned transformations that map all views to a common vector space    |
+----------------------------------------+----------------------------------------+
                                         |
                                         v
+---------------------------------------------------------------------------------+
|                           3. ALIGNED EMBEDDING SPACE                            |
|                                                                                 |
|             View 1    View 2    View 3    ...    View N                         |
|               |         |         |                |                            |
|               +---------+---------+----------------+                            |
|                                 |                                               |
|                                 v                                               |
|                       +-------------------+                                     |
|                       |      Common       |                                     |
|                       |  Embedding Space  |                                     |
|                       +---------+---------+                                     |
|                                 |                                               |
|                                 v                                               |
|                       +-------------------+                                     |
|                       |    HNSW Index     | < Fast approximate nearest neighbor |
|                       +-------------------+                                     |
+----------------------------------------+----------------------------------------+
                                         |
                                         v
+---------------------------------------------------------------------------------+
|                           4. TABLE AUGMENTATION TASKS                           |
|                                                                                 |
|        +-----------------------------+   +-----------------------------+        |
|        |       JOIN DISCOVERY        |   |       UNION DISCOVERY       |        |
|        |                             |   |                             |        |
|        |  Query: Column A            |   |  Query: Table X             |        |
|        |     v                       |   |     v                       |        |
|        |  Find columns that can      |   |  Find tables with           |        |
|        |  be joined with A           |   |  compatible schemas         |        |
|        |     v                       |   |     v                       |        |
|        |  Metrics: P@k, R@k, MAP     |   |  Metrics: P@k, R@k, MAP     |        |
|        +-----------------------------+   +-----------------------------+        |
+---------------------------------------------------------------------------------+

πŸš€ Getting Started

We recommend using uv to manage fedaugment and its dependencies. Installing the project is as simple as:

uv sync

We offer the following optional dependency groups (use uv sync --extra <group>):

  • PyTorch variants (mutually exclusive):
    • cpu: Force a CPU installation of PyTorch
    • cu126: Force a CUDA 12.6 installation of PyTorch
    • cu128: Force a CUDA 12.8 installation of PyTorch
  • experiments: Installs additional dependencies for running experiments
  • flash-attn: For more efficient attention operators in PyTorch

Note that the default PyTorch version depends on your operating system (CPU for Windows and Mac, CUDA 12.x for Linux).

πŸ“‚ Datasets

Overview

We use the following datasets in our experiments:

Dataset Folder Structure

The project expects datasets in the following structure (symlinked or stored at data/):

data/
β”‚
β”œβ”€β”€ datasets/                              # Raw tabular data
β”‚   β”‚
β”‚   └── {dataset_name}/                    # e.g., webtable, omnimatch_city_test, omnimatch_culture_test, santos_small, freyja
β”‚       β”‚
β”‚       β”œβ”€β”€ datasets/                      # Full original dataset
β”‚       β”‚   β”œβ”€β”€ pq/                        # Parquet files
β”‚       β”‚   β”‚   └── {table_id}.pq
β”‚       β”‚   └── csv/                       # CSV files
β”‚       β”‚       └── {table_id}.csv
β”‚       β”‚
β”‚       β”œβ”€β”€ queries/                       # Query tables and ground truth (query tables are optional, if absent, queries use tables from datasets/)
β”‚       β”‚   β”œβ”€β”€ pq/                        # Query tables (parquet)
β”‚       β”‚   β”œβ”€β”€ csv/                       # Query tables (csv)
β”‚       β”‚   β”œβ”€β”€ join_queries.csv           # Join query list (only for join tasks)
β”‚       β”‚   β”œβ”€β”€ join_ground_truth.csv      # Join ground truth pairs (only for join tasks)
β”‚       β”‚   β”œβ”€β”€ union_queries.csv          # Union query list (only for union tasks)
β”‚       β”‚   └── union_ground_truth.csv     # Union ground truth pairs (only for union tasks)
β”‚       β”‚
β”‚       β”œβ”€β”€ split/                         # Train/test/val splits (webtable only)
β”‚       β”‚   β”œβ”€β”€ train/
β”‚       β”‚   β”œβ”€β”€ test/
β”‚       β”‚   └── val/
β”‚       β”‚
β”‚       └── sample-{pct}/                  # Sampled subsets from data curation (webtable only)
β”‚           └── {curation_method}/         # e.g., fft_cos-mpnet-dj_adpt-k=63049
β”‚               └── {table_id}.pq
β”‚
β”‚
└── embeddings/                            # Generated embeddings
    β”‚
    └── {dataset_name}/                    # Mirrors datasets/ structure
        β”‚
        β”œβ”€β”€ datasets/                      # Dataset embeddings (full corpus)
        β”‚   β”‚
        β”‚   └── {model}-{strategy}.fa/     # Feature archive per embedding pipeline
        β”‚       β”œβ”€β”€ embeddings.npy         # (N, D) float32 array
        β”‚       β”œβ”€β”€ column_ids.npy         # (N,) string array: "{table_id}::{column_name}"
        β”‚       └── metadata.json          # Pipeline metadata
        β”‚
        β”œβ”€β”€ queries/                       # Query embeddings (if there are no dedicated queries, we symlink to datasets/)
        β”‚   β”‚
        β”‚   └── {model}-{strategy}.fa/     # Feature archive per embedding pipeline
        β”‚       β”œβ”€β”€ embeddings.npy         # (N, D) float32 array
        β”‚       β”œβ”€β”€ column_ids.npy         # (N,) string array: "{table_id}::{column_name}"
        β”‚       └── metadata.json          # Pipeline metadata
        β”‚
        β”œβ”€β”€ sample-{pct}/                  # Embeddings for sampled subsets (percentages: 001, 005, 050)
        β”‚   β”‚
        β”‚   └── {curation_method}/         # Curation strategy
        β”‚       └── {model}-{strategy}.fa/
        β”‚           β”œβ”€β”€ embeddings.npy
        β”‚           β”œβ”€β”€ column_ids.npy
        β”‚           └── metadata.json
        β”‚
        └── split/                         # Embeddings for splits (webtable only)
            β”œβ”€β”€ train/
            β”‚   └── {model}-{strategy}.fa/
            β”œβ”€β”€ test/
            β”‚   └── {model}-{strategy}.fa/
            └── val/
                └── {model}-{strategy}.fa/

File Formats

Query Files:

  • join_queries.csv: List of join query columns

    query_table,query_column
    csvData1549285__2.csv,AST%
    csvData1549285__2.csv,BLK%
  • join_ground_truth.csv: Ground truth join pairs

    query_table,candidate_table,query_column,candidate_column
    csvData1549285__2.csv,csvData20409520__4.csv,DRtg,DRtg

Embedding Files (stored in .fa/ directories):

  • embeddings.npy: NumPy array of shape (N_columns, embedding_dim), dtype float32
  • column_ids.npy: NumPy array of shape (N_columns,), dtype StringDType(), format "{table_id}::{column_name}"
  • metadata.json: Pipeline metadata including model, shape, and source information

File Naming Conventions

  • Dataset names: webtable, omnimatch_city_test, omnimatch_culture_test, santos_small, freyja
  • Embedding pipelines: {model}-{strategy}
    • Models: mpnet, distilroberta, gte_base, gtr_t5, mini_l12, mini_l6, etc.
    • Strategies: dj_orig (DeepJoin original), dj_adpt (DeepJoin adapted), etc.
  • Curation methods: {algorithm}[_{metric}]-{model}-{strategy}-k={n_columns}[-pca={variance}]
    • Algorithms: fft, grid, random
    • Metrics: cos (cosine), euc (Euclidean)
    • Example: fft_cos-mpnet-dj_adpt-k=63049-pca=0.9

πŸ§ͺ Experiments

To reproduce the experiments in our paper, first ensure you have downloaded the required datasets and placed them in the data/datasets/ directory as described above. If you prefer to store the datasets elsewhere, create a symbolic link named data in the project root pointing to your dataset directory. For example:

ln -s /your/local/storage data

After setting up the datasets, you can run all experiments using:

bash experiments/run_all.sh

Hardware Requirements

  • 500+ GB RAM
  • 1.5+ TB disk space
  • NVIDIA A100 80GB GPU or better

πŸ“– Citation

TBD

About

Table augmentation search over decentralized data repositories.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages