Pipeline to download Washington University PM2.5 satellite data from Box and upload to Harvard Dataverse.
This repository handles data staging for PM2.5 satellite estimates from the Atmospheric Composition Analysis Group (ACAG) at Washington University. It downloads NetCDF files from Box and uploads them to Harvard Dataverse for downstream processing.
Box (ACAG) → Download → Local Storage → Upload → Harvard Dataverse
| Dataset | Resolution | Description |
|---|---|---|
| V5GL04 | 0.10° | Hybrid PM2.5 estimates |
| V5GL0502 | 0.05° | Hybrid PM2.5 estimates (higher resolution) |
| V6GL02 | 0.10° | CNN-based PM2.5 estimates |
Each dataset is available in yearly and monthly temporal frequencies.
git clone https://github.com/NSAPH-Data-Processing/pm25__martin__2dataverse.git
cd pm25__martin__2dataverse
conda env create -f environment.yaml
conda activate pm25_2dataverse- Get an API token from Harvard Dataverse (Account → API Token)
- Set the API token as an environment variable:
export DATAVERSE_API_TOKEN="your-token-here"
- Update
conf/datasets/*.yamlwith your dataset DOI
# Run all downloads and uploads
snakemake --cores 1
# Dry run (preview what would run)
snakemake --cores 1 -n
# Only download, no upload
snakemake --cores 1 download_all# Download V5GL04 yearly data (default)
python src/download_from_box.py
# Download specific dataset and frequency
python src/download_from_box.py datasets=V6GL02 temporal_freq=monthly# Upload downloaded files to Dataverse
python src/upload_to_dataverse.py
# Upload specific dataset
python src/upload_to_dataverse.py datasets=V6GL02 temporal_freq=monthly# Download and upload V5GL04 yearly data
python src/download_from_box.py datasets=V5GL04 temporal_freq=yearly
python src/upload_to_dataverse.py datasets=V5GL04 temporal_freq=yearly
# Download and upload V6GL02 monthly data
python src/download_from_box.py datasets=V6GL02 temporal_freq=monthly
python src/upload_to_dataverse.py datasets=V6GL02 temporal_freq=monthlyConfiguration uses Hydra. Main parameters:
| Parameter | Options | Description |
|---|---|---|
datasets |
V5GL04, V5GL0502, V6GL02 | Which PM2.5 dataset config to load |
temporal_freq |
yearly, monthly | Temporal resolution |
download_dir |
path | Local storage directory |
Dataset configs are in conf/datasets/. Each contains:
- Box URLs for download
- Dataverse DOI for upload
API token can be set via:
- Environment variable
DATAVERSE_API_TOKEN(recommended) - Or in
conf/datasets/*.yamlunderapi_token
Uploads are organized into folders:
Dataverse Dataset
├── V5GL04/
│ ├── yearly/
│ └── monthly/
├── V5GL0502/
│ ├── yearly/
│ └── monthly/
└── V6GL02/
├── yearly/
└── monthly/
pm25__martin__2dataverse/
├── src/
│ ├── download_from_box.py # Download from ACAG Box
│ └── upload_to_dataverse.py # Upload to Harvard Dataverse
├── conf/
│ ├── config.yaml # Main configuration
│ └── datasets/ # Dataset-specific configs
│ ├── V5GL04.yaml
│ ├── V5GL0502.yaml
│ └── V6GL02.yaml
├── Snakefile # Automated workflow
├── data/ # Downloaded files (gitignored)
├── environment.yaml
└── README.md
van Donkelaar, A., Hammer, M.S., Bindle, L., Brauer, M., Brook, J.R., Garay, M.J., Hsu, N.C., Kalashnikova, O.V., Kahn, R.A., Lee, C., Levy, R.C., Lyapustin, A., Sayer, A.M. and Martin, R.V. (2021). Monthly Global Estimates of Fine Particulate Matter and Their Uncertainty. Environmental Science & Technology. doi:10.1021/acs.est.1c05309