Skip to content

NSAPH-Data-Processing/pm25__martin__2dataverse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pm25__martin__2dataverse

Pipeline to download Washington University PM2.5 satellite data from Box and upload to Harvard Dataverse.

Overview

This repository handles data staging for PM2.5 satellite estimates from the Atmospheric Composition Analysis Group (ACAG) at Washington University. It downloads NetCDF files from Box and uploads them to Harvard Dataverse for downstream processing.

Box (ACAG) → Download → Local Storage → Upload → Harvard Dataverse

Supported Datasets

Dataset Resolution Description
V5GL04 0.10° Hybrid PM2.5 estimates
V5GL0502 0.05° Hybrid PM2.5 estimates (higher resolution)
V6GL02 0.10° CNN-based PM2.5 estimates

Each dataset is available in yearly and monthly temporal frequencies.

Setup

1. Clone and create environment

git clone https://github.com/NSAPH-Data-Processing/pm25__martin__2dataverse.git
cd pm25__martin__2dataverse

conda env create -f environment.yaml
conda activate pm25_2dataverse

2. Configure Dataverse credentials

  1. Get an API token from Harvard Dataverse (Account → API Token)
  2. Set the API token as an environment variable:
    export DATAVERSE_API_TOKEN="your-token-here"
  3. Update conf/datasets/*.yaml with your dataset DOI

Usage

Option 1: Run everything with Snakemake (recommended)

# Run all downloads and uploads
snakemake --cores 1

# Dry run (preview what would run)
snakemake --cores 1 -n

# Only download, no upload
snakemake --cores 1 download_all

Option 2: Run individual scripts

Download from Box

# Download V5GL04 yearly data (default)
python src/download_from_box.py

# Download specific dataset and frequency
python src/download_from_box.py datasets=V6GL02 temporal_freq=monthly

Upload to Dataverse

# Upload downloaded files to Dataverse
python src/upload_to_dataverse.py

# Upload specific dataset
python src/upload_to_dataverse.py datasets=V6GL02 temporal_freq=monthly

Full workflow example

# Download and upload V5GL04 yearly data
python src/download_from_box.py datasets=V5GL04 temporal_freq=yearly
python src/upload_to_dataverse.py datasets=V5GL04 temporal_freq=yearly

# Download and upload V6GL02 monthly data
python src/download_from_box.py datasets=V6GL02 temporal_freq=monthly
python src/upload_to_dataverse.py datasets=V6GL02 temporal_freq=monthly

Configuration

Configuration uses Hydra. Main parameters:

Parameter Options Description
datasets V5GL04, V5GL0502, V6GL02 Which PM2.5 dataset config to load
temporal_freq yearly, monthly Temporal resolution
download_dir path Local storage directory

Dataset configs are in conf/datasets/. Each contains:

  • Box URLs for download
  • Dataverse DOI for upload

API token can be set via:

  • Environment variable DATAVERSE_API_TOKEN (recommended)
  • Or in conf/datasets/*.yaml under api_token

Dataverse Folder Structure

Uploads are organized into folders:

Dataverse Dataset
├── V5GL04/
│   ├── yearly/
│   └── monthly/
├── V5GL0502/
│   ├── yearly/
│   └── monthly/
└── V6GL02/
    ├── yearly/
    └── monthly/

Directory Structure

pm25__martin__2dataverse/
├── src/
│   ├── download_from_box.py    # Download from ACAG Box
│   └── upload_to_dataverse.py  # Upload to Harvard Dataverse
├── conf/
│   ├── config.yaml             # Main configuration
│   └── datasets/               # Dataset-specific configs
│       ├── V5GL04.yaml
│       ├── V5GL0502.yaml
│       └── V6GL02.yaml
├── Snakefile                   # Automated workflow
├── data/                       # Downloaded files (gitignored)
├── environment.yaml
└── README.md

References

van Donkelaar, A., Hammer, M.S., Bindle, L., Brauer, M., Brook, J.R., Garay, M.J., Hsu, N.C., Kalashnikova, O.V., Kahn, R.A., Lee, C., Levy, R.C., Lyapustin, A., Sayer, A.M. and Martin, R.V. (2021). Monthly Global Estimates of Fine Particulate Matter and Their Uncertainty. Environmental Science & Technology. doi:10.1021/acs.est.1c05309

About

Automatic uploads of pm25 data from the ACAG group

Resources

Stars

Watchers

Forks