atlas-profiler

Atlas Profiler is a comprehensive dataset profiling library that automatically detects and annotates data types, including spatial and temporal features. Given a CSV/TSV, file-like object, or pandas DataFrame, it returns rich JSON-style metadata about your dataset, its columns, detected types, value ranges, optional plots, spatial/temporal coverage, and profiling runtime.

Quick Start

Installation

Install from PyPI:

pip install atlas-profiler

Or install from source for development:

git clone https://github.com/VIDA-NYU/atlas-profiler.git
cd atlas-profiler
pip install -e .

Basic Usage

from atlas_profiler import process_dataset

# Profile a CSV file
metadata = process_dataset("data.csv")

# Or profile a pandas DataFrame
import pandas as pd
df = pd.read_csv("data.csv")
metadata = process_dataset(
    df,
    geo_classifier=True,
    geo_classifier_threshold=0.5,
    coverage=True,
)

Documentation

For comprehensive guides, API reference, examples, and advanced configuration, visit the Complete Documentation.

Features

✨ Automatic Type Detection: Identifies structural types (Integer, Float, Text, Boolean, GeoCoordinates, GeoShape) and semantic types (DateTime, Address, URL, ID, etc.)

🌍 Spatial Intelligence: ML-powered spatial column classifier trained on synthetic data, recognizing coordinates, addresses, geospatial identifiers, and administrative areas

⏰ Temporal Analysis: Detects and analyzes temporal columns with coverage and resolution information

📊 Rich Metadata: Comprehensive dataset profiling including:

Column-level statistics and distinct value counts
Dataset-level type summaries
Spatial and temporal coverage information
Optional histograms and sample data
Profiling performance metrics

What It Produces

process_dataset(...) returns a metadata dictionary with:

Dataset metrics: row count, column count, profiled row count
Per-column analysis: structural type, semantic types, missing value ratios, distinct counts, sample values
Dataset summary: numerical, categorical, spatial, and temporal type counts
Coverage information: spatial bounding boxes, temporal ranges, geohash coverage
Attribute keywords: automatically extracted from column names
Performance metrics: per-step profiling timings

Type System

Structural Types

The profiler recognizes these broad structural types:

Type	Meaning
`Integer`	Integer-like values
`Float`	Floating point values
`Text`	String/text values
`Boolean`	Boolean-like values (true/false, yes/no, 0/1)
`GeoCoordinates`	Point geometry or coordinate-pair strings
`GeoShape`	Polygon-like geometry
`MissingData`	Empty column

Semantic Types

The profiler also annotates semantic meaning when evidence is available:

Type	Examples
`DateTime`	Dates, timestamps, year columns
`latitude`, `longitude`	Coordinate columns (paired after profiling)
`address`, `AdministrativeArea`	Address text or admin areas (optionally resolved via Nominatim or `datamart_geo`)
`URL`, `FileName`, `identifier`, `Enumeration`	URLs, file paths, IDs, categorical values

Architecture

Pipeline

process_dataset executes a consistent workflow for every dataset:

Load data from path, file object, or DataFrame
Compute statistics on full data and collect sample values per column
Predict spatial labels (optional) using batch ML inference
Process columns with geo predictions or rule-based type detection
Pair lat/long columns and compute dataset-level type summaries
Compute coverage (optional) for numerical, spatial, and temporal ranges

Spatial ML Classifier

When geo_classifier=True, Atlas Profiler uses a HybridGeoClassifier that:

Samples values from each column
Predicts spatial labels in a single batch
Validates predictions using rule-based checks
Maps predictions to the profiler's type system

Supported spatial labels:

Label Family	Mapped Type
`latitude`, `longitude`	Float + semantic types
`x_coord`, `y_coord`	Projected coordinates
`point`, `polygon`, `line`	Geometry types
`address`, `zip5`, `zip9`	Address/postal codes
`borough`, `city`, `state`, `country`	Administrative areas
`bbl`, `bin`	NYC spatial identifiers
`non_spatial`	Falls back to standard detection

Manual annotations take precedence over ML predictions. Low-confidence or rejected predictions fall back to rule-based detection.

Advanced Usage

Configuration Parameters

Key parameters for process_dataset():

Parameter	Default	Description
`data`	required	Path, file-like object, or pandas DataFrame
`geo_classifier`	`True`	Enable spatial ML classifier
`geo_classifier_threshold`	`0.5`	Confidence cutoff for predictions
`coverage`	`True`	Compute numerical ranges and spatial/temporal coverage
`plots`	`False`	Include histogram-style plot data
`include_sample`	`False`	Include sample rows in output
`indexes`	`True`	Preserve DataFrame indexes as columns
`load_max_size`	`5000000`	Target bytes to profile (larger inputs are sampled)
`metadata`	`None`	Optional seed metadata with manual annotations
`nominatim`	`None`	Nominatim endpoint for address resolution
`datamart_geo_data`	`None`	GeoData instance for admin-area resolution

Manual Annotations

Supply manual type annotations through the metadata argument. Useful when upstream processes or domain knowledge already identifies column types:

metadata = {
    "columns": [
        {
            "name": "latitude",
            "semantic_types": ["http://schema.org/latitude"]
        },
        {
            "name": "longitude", 
            "semantic_types": ["http://schema.org/longitude"]
        }
    ]
}

result = process_dataset(df, metadata=metadata)

Manually annotated columns skip the spatial ML classifier and are reconciled with observed values during processing.

Model Files

The spatial ML classifier uses these model files (automatically downloaded if missing):

model.pt — PyTorch model weights
config.json — Model configuration
label_encoder.json — Label encoding

Files are cached locally and auto_download=True enables automatic retrieval.

For model training details, see training/README.md.

Project Structure

atlas-profiler/
├── atlas_profiler/          # Public API: from atlas_profiler import process_dataset
├── profiler/                # Core profiling package
│   ├── core.py              # process_dataset(), data loading, column pipeline
│   ├── profile_types.py     # Rule-based type detection
│   ├── spatial.py           # Spatial coverage & GeoClassifier
│   ├── temporal.py          # Temporal analysis
│   ├── numerical.py         # Numerical profiling
│   └── types.py             # Type constants
├── training/                # Model training & synthetic data generation
├── tests/                   # Unit tests
├── examples/                # Example notebooks
├── docs/                    # Sphinx documentation
└── pyproject.toml           # Project configuration

Related Projects

This project builds upon and extends Datamart Profiler with additional spatial intelligence via ML-assisted column type classification.

Datamart Profiler: https://pypi.org/project/datamart-profiler/
Research Background: Developed by the NYU Visualization and Data Analytics Lab

License

Atlas Profiler is released under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

atlas-profiler

Quick Start

Installation

Basic Usage

Documentation

Table of Contents

Features

What It Produces

Type System

Structural Types

Semantic Types

Architecture

Pipeline

Spatial ML Classifier

Advanced Usage

Configuration Parameters

Manual Annotations

Model Files

Project Structure

Related Projects

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
atlas_profiler		atlas_profiler
docs		docs
examples		examples
profiler		profiler
tests		tests
training		training
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

atlas-profiler

Quick Start

Installation

Basic Usage

Documentation

Table of Contents

Features

What It Produces

Type System

Structural Types

Semantic Types

Architecture

Pipeline

Spatial ML Classifier

Advanced Usage

Configuration Parameters

Manual Annotations

Model Files

Project Structure

Related Projects

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 1

Languages

Packages