Atlas Profiler is a comprehensive dataset profiling library that automatically detects and annotates data types, including spatial and temporal features. Given a CSV/TSV, file-like object, or pandas DataFrame, it returns rich JSON-style metadata about your dataset, its columns, detected types, value ranges, optional plots, spatial/temporal coverage, and profiling runtime.
Install from PyPI:
pip install atlas-profilerOr install from source for development:
git clone https://github.com/VIDA-NYU/atlas-profiler.git
cd atlas-profiler
pip install -e .from atlas_profiler import process_dataset
# Profile a CSV file
metadata = process_dataset("data.csv")
# Or profile a pandas DataFrame
import pandas as pd
df = pd.read_csv("data.csv")
metadata = process_dataset(
df,
geo_classifier=True,
geo_classifier_threshold=0.5,
coverage=True,
)For comprehensive guides, API reference, examples, and advanced configuration, visit the Complete Documentation.
- Features
- What It Produces
- Type System
- Architecture
- Advanced Usage
- Project Structure
- Related Projects
β¨ Automatic Type Detection: Identifies structural types (Integer, Float, Text, Boolean, GeoCoordinates, GeoShape) and semantic types (DateTime, Address, URL, ID, etc.)
π Spatial Intelligence: ML-powered spatial column classifier trained on synthetic data, recognizing coordinates, addresses, geospatial identifiers, and administrative areas
β° Temporal Analysis: Detects and analyzes temporal columns with coverage and resolution information
π Rich Metadata: Comprehensive dataset profiling including:
- Column-level statistics and distinct value counts
- Dataset-level type summaries
- Spatial and temporal coverage information
- Optional histograms and sample data
- Profiling performance metrics
process_dataset(...) returns a metadata dictionary with:
- Dataset metrics: row count, column count, profiled row count
- Per-column analysis: structural type, semantic types, missing value ratios, distinct counts, sample values
- Dataset summary: numerical, categorical, spatial, and temporal type counts
- Coverage information: spatial bounding boxes, temporal ranges, geohash coverage
- Attribute keywords: automatically extracted from column names
- Performance metrics: per-step profiling timings
The profiler recognizes these broad structural types:
| Type | Meaning |
|---|---|
Integer |
Integer-like values |
Float |
Floating point values |
Text |
String/text values |
Boolean |
Boolean-like values (true/false, yes/no, 0/1) |
GeoCoordinates |
Point geometry or coordinate-pair strings |
GeoShape |
Polygon-like geometry |
MissingData |
Empty column |
The profiler also annotates semantic meaning when evidence is available:
| Type | Examples |
|---|---|
DateTime |
Dates, timestamps, year columns |
latitude, longitude |
Coordinate columns (paired after profiling) |
address, AdministrativeArea |
Address text or admin areas (optionally resolved via Nominatim or datamart_geo) |
URL, FileName, identifier, Enumeration |
URLs, file paths, IDs, categorical values |
process_dataset executes a consistent workflow for every dataset:
- Load data from path, file object, or DataFrame
- Compute statistics on full data and collect sample values per column
- Predict spatial labels (optional) using batch ML inference
- Process columns with geo predictions or rule-based type detection
- Pair lat/long columns and compute dataset-level type summaries
- Compute coverage (optional) for numerical, spatial, and temporal ranges
When geo_classifier=True, Atlas Profiler uses a HybridGeoClassifier that:
- Samples values from each column
- Predicts spatial labels in a single batch
- Validates predictions using rule-based checks
- Maps predictions to the profiler's type system
Supported spatial labels:
| Label Family | Mapped Type |
|---|---|
latitude, longitude |
Float + semantic types |
x_coord, y_coord |
Projected coordinates |
point, polygon, line |
Geometry types |
address, zip5, zip9 |
Address/postal codes |
borough, city, state, country |
Administrative areas |
bbl, bin |
NYC spatial identifiers |
non_spatial |
Falls back to standard detection |
Manual annotations take precedence over ML predictions. Low-confidence or rejected predictions fall back to rule-based detection.
Key parameters for process_dataset():
| Parameter | Default | Description |
|---|---|---|
data |
required | Path, file-like object, or pandas DataFrame |
geo_classifier |
True |
Enable spatial ML classifier |
geo_classifier_threshold |
0.5 |
Confidence cutoff for predictions |
coverage |
True |
Compute numerical ranges and spatial/temporal coverage |
plots |
False |
Include histogram-style plot data |
include_sample |
False |
Include sample rows in output |
indexes |
True |
Preserve DataFrame indexes as columns |
load_max_size |
5000000 |
Target bytes to profile (larger inputs are sampled) |
metadata |
None |
Optional seed metadata with manual annotations |
nominatim |
None |
Nominatim endpoint for address resolution |
datamart_geo_data |
None |
GeoData instance for admin-area resolution |
Supply manual type annotations through the metadata argument. Useful when upstream processes or domain knowledge already identifies column types:
metadata = {
"columns": [
{
"name": "latitude",
"semantic_types": ["http://schema.org/latitude"]
},
{
"name": "longitude",
"semantic_types": ["http://schema.org/longitude"]
}
]
}
result = process_dataset(df, metadata=metadata)Manually annotated columns skip the spatial ML classifier and are reconciled with observed values during processing.
The spatial ML classifier uses these model files (automatically downloaded if missing):
model.ptβ PyTorch model weightsconfig.jsonβ Model configurationlabel_encoder.jsonβ Label encoding
Files are cached locally and auto_download=True enables automatic retrieval.
For model training details, see training/README.md.
atlas-profiler/
βββ atlas_profiler/ # Public API: from atlas_profiler import process_dataset
βββ profiler/ # Core profiling package
β βββ core.py # process_dataset(), data loading, column pipeline
β βββ profile_types.py # Rule-based type detection
β βββ spatial.py # Spatial coverage & GeoClassifier
β βββ temporal.py # Temporal analysis
β βββ numerical.py # Numerical profiling
β βββ types.py # Type constants
βββ training/ # Model training & synthetic data generation
βββ tests/ # Unit tests
βββ examples/ # Example notebooks
βββ docs/ # Sphinx documentation
βββ pyproject.toml # Project configuration
This project builds upon and extends Datamart Profiler with additional spatial intelligence via ML-assisted column type classification.
- Datamart Profiler: https://pypi.org/project/datamart-profiler/
- Research Background: Developed by the NYU Visualization and Data Analytics Lab
Atlas Profiler is released under the MIT License.