This repo contains a Code the Dream-friendly batch ETL project that:
- Geocodes global cities to lat/lon
- Pulls OpenWeather Air Pollution historical data
- Transforms PostgreSQL-backed raw response records into a gold dataset
- Writes the gold dataset to PostgreSQL and can optionally export Parquet locally or publish the same Parquet artifact to Azure Blob Storage
- Serves a React dashboard backed by a Python API over PostgreSQL data
The pipeline now uses DB-first gold persistence by default, with PostgreSQL as the primary gold-data target and Parquet export as an explicit optional setting. Azure Blob publishing is also optional and can be tested locally through Azurite in Docker Compose. City configuration, geocoding cache, and raw extract persistence are also moving into PostgreSQL as runtime state. The same PostgreSQL runtime path can target either local Docker/Postgres or managed Azure Database for PostgreSQL through environment configuration.
Use docs/setup/run_and_debug_guide.md for:
- local Python installation
- Python library installation
- local non-Docker
.env.localconfiguration - local VS Code debugging
- Docker Compose installation
- Docker Compose configuration, launch, and verification
Use docs/setup/local_postgresql_first_workflow.md for the consolidated local PostgreSQL-first workflow:
- schema bootstrap
- city seeding
- pipeline run commands
- PostgreSQL verification queries
- DB-native test commands
Use docs/setup/azure_postgresql_configuration.md for managed Azure Database for PostgreSQL configuration guidance.
Use docs/setup/environment_profiles_guide.md to keep separate local and Azure env files without overwriting your normal Docker settings.
Use bash scripts/generate_env_profiles.sh to create .env.local and .env.azure from the tracked templates in configs/env/.
Use ENV_FILE=.env.azure ... when you want to run the app, migrations, or Docker Compose against the Azure profile.
This repo currently supports three practical run modes:
- Local without Docker
Uses
.env.localwith local filesystem paths and a local PostgreSQL instance. - Local with Docker
Uses
.env.localwith Docker Compose, container paths, local Postgres, and Azurite. - Cloud-connected
Uses
ENV_FILE=.env.azureso the app or local containers point at Azure-backed resources such as Azure Database for PostgreSQL and Azure Blob Storage.
Important:
- the cloud-connected mode changes which resources the app talks to
- it does not by itself deploy the app into Azure
If you want a quick local Python setup, use the bootstrap script for your OS.
These scripts:
- create
.venvin the project root if missing - upgrade
pip,setuptools, andwheel - install dependencies from
requirements.txt - install the local pipeline package in editable mode for module-based execution
./scripts/setup_venv.shIf you hit a permission error:
bash scripts/setup_venv.shpowershell -ExecutionPolicy Bypass -File .\scripts\setup_venv.ps1After script setup, activate the environment:
- WSL, Linux, macOS:
source .venv/bin/activate - Windows PowerShell:
.venv\Scripts\Activate.ps1
Run the pipeline locally with the package entrypoint:
python -m pipeline.cli --source openweather --history-hours 72This is the preferred local run path because it matches the packaged production-style entrypoint used by the pipeline service.
For the shortest path to a working local DB-first run, start here:
docs/setup/local_postgresql_first_workflow.md
Quick sequence:
alembic upgrade headpython -m pipeline.cli --seed-citiespython -m pipeline.cli --source openweather --history-hours 72- verify
pipeline_runsandair_pollution_goldin PostgreSQL
Optional Blob flow for local testing:
- set
WRITE_GOLD_AZURE_BLOB=1 - keep the Azurite connection string from
.env.local - run
docker compose up --build - open the browser explorer at
http://localhost:8081 - confirm the blob exists under container
goldatexports/air_pollution_gold.parquet
To target real Azure Blob Storage instead of Azurite, keep the same
WRITE_GOLD_AZURE_BLOB, AZURE_STORAGE_CONNECTION_STRING,
AZURE_BLOB_CONTAINER, and AZURE_BLOB_PATH settings but replace the local
connection string with your production Azure values. See
docs/setup/azure_blob_storage_configuration.md
for local-vs-production examples.
The PostgreSQL-first migration path uses Alembic for schema versioning.
Apply the latest schema locally with:
alembic upgrade headIf you are using Docker Compose, run the dedicated migration service:
docker compose run --rm migrateThis creates or upgrades the PostgreSQL schema before pipeline services depend on it.
The pipeline now treats PostgreSQL as the runtime source of truth for cities by default.
CITIES_FILE remains the seed/import input used to populate the cities table.
Expected CSV columns:
city,country_code,stateExample:
Toronto,CA,
Paris,FR,
Lagos,NG,
Sydney,AU,NSWNotes:
country_codeis required for reliable global geocodingstateis optional for most countries
You can point the seed/import workflow to a different city CSV by changing CITIES_FILE.
Examples:
# Local Python
CITIES_FILE=configs/cities_local.csv
# Docker Compose
CITIES_FILE=/app/configs/cities_production.csvFor Docker Compose, custom city files must live inside ./configs/ on the host because that directory is mounted into the containers as /app/configs.
FileNotFoundErrorduring city seed/import usually meansCITIES_FILEis misspelled or points to the wrong place- If Docker cannot find a custom city file, make sure the file exists under
./configs/ - If city seed/import creates no rows, confirm the CSV has at least one data row below the header
- If geocoding fails for a city, verify the
country_code
To populate the cities table from the configured CSV:
python -m pipeline.cli --seed-citiesNormal pipeline execution reads active cities from PostgreSQL when CITIES_SOURCE=postgres.
Browse docs/README.md for the full categorized index.
docs/setup/local_postgresql_first_workflow.mddocs/setup/run_and_debug_guide.mddocs/setup/docker_and_compose_walkthrough.mddocs/setup/github_quality_gates_setup.mddocs/collaboration/github_feature_branch_pr_guide.mddocs/collaboration/pr_review_best_practices.mddocs/collaboration/what_is_a_data_pipeline.mddocs/architecture/architecture.mddocs/architecture/data_flow_diagram.mddocs/architecture/postgresql_schema_design.mddocs/reference/data_dictionary.mddocs/reference/openweather_environmental_api_fields_reference.md