Skip to content

feat: Securities Trading Data Platform — S3 + Databricks Medallion Architecture#5

Open
devin-ai-integration[bot] wants to merge 6 commits into
mainfrom
devin/1778092946-securities-data-platform
Open

feat: Securities Trading Data Platform — S3 + Databricks Medallion Architecture#5
devin-ai-integration[bot] wants to merge 6 commits into
mainfrom
devin/1778092946-securities-data-platform

Conversation

@devin-ai-integration
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot commented May 6, 2026

Summary

Adds a complete end-to-end modern data platform for financial securities trading data, built on AWS S3 and Databricks with a medallion architecture (bronze → silver → gold).

What's Included

Component Files Description
Data Generation 6 Python modules Generates realistic trading data: 500 instruments (equities, ETFs, bonds, options, futures), 100K trades, 150K orders, 500K market data bars (GBM price sim), 18K position snapshots
Terraform IaC 5 .tf files S3 bucket with medallion prefixes, KMS encryption, IAM roles, lifecycle policies, Databricks Unity Catalog, clusters, SQL warehouse
Databricks Notebooks 15 notebooks Bronze (5 Auto Loader ingestion), Silver (5 cleanse/enrich/validate), Gold (5 analytics aggregations)
Workflow Orchestration 1 JSON DAG 16-task Databricks Workflow with parallel execution and dependency management
Documentation README + architecture doc Full architecture diagram, data domain descriptions, quick-start guide

Medallion Tiers

  • Bronze: Raw S3 → Delta Lake via Auto Loader with ingestion metadata
  • Silver: Data quality validation, instrument enrichment, technical indicators (SMA, volatility), P&L recalculation, slippage analysis
  • Gold: 13 analytics tables including daily P&L, portfolio VaR (95%), concentration risk (HHI), sector performance, liquidity scoring, trader risk profiling

Review Feedback Addressed

  • Fixed non-deterministic F.first()/F.last() in gold market analytics — now uses F.min_by()/F.max_by() for correct chronological OHLCV
  • Fixed _generate_isin()/_generate_cusip() to use seeded self.rng for reproducible data generation
  • Removed S3 lifecycle rule that would have deleted Auto Loader checkpoint files after 30 days

Review & Testing Checklist for Human

  • Review Terraform variables in terraform.tfvars.example — update the Databricks workspace URL and account-specific ARNs before deploying
  • Verify the S3 bucket naming convention matches your organization's standards
  • Review the IAM role trust policy in databricks.tf — the placeholder arn:aws:iam::root needs your Databricks account ID
  • Run python -m data_generation.generate_all locally to verify data generation works with your Python environment
  • Test the S3 upload with python -m data_generation.upload_to_s3 --create-bucket against a dev bucket

Test Plan

  1. Install deps: cd securities-trading-data-platform && pip install pandas pyarrow faker numpy pydantic pydantic-settings click rich python-dotenv boto3
  2. Generate data: python -m data_generation.generate_all --num-instruments 50 --num-trades 1000 --num-orders 1000 --num-market-data 5000
  3. Verify Parquet files in generated_data/
  4. Upload to S3: python -m data_generation.upload_to_s3 --create-bucket
  5. Import notebooks into Databricks workspace and run in order: setup → bronze → silver → gold

Notes

  • The data generation uses Geometric Brownian Motion for realistic price simulation with configurable volatility per asset class
  • All configuration is via environment variables with TRADING_ prefix or .env file
  • The Databricks Workflow is scheduled Mon-Fri at 6 PM ET (post-market close)
  • Terraform uses S3 backend for state — create the state bucket and DynamoDB table before terraform init

Link to Devin session: https://partner-workshops.devinenterprise.com/sessions/76ebe6f551394d83a14ca160b058b025
Requested by: @bsmitches


Open in Devin Review

…lion architecture

- Data generation scripts for instruments, trades, orders, market data, positions
- Terraform IaC for S3 buckets (bronze/silver/gold), KMS, IAM, Databricks workspace
- 15 Databricks notebooks: bronze ingestion, silver transformations, gold aggregations
- Databricks Workflow DAG for scheduled pipeline orchestration
- Comprehensive architecture documentation and README
@devin-ai-integration
Copy link
Copy Markdown
Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

devin-ai-integration[bot]

This comment was marked as resolved.

…/CUSIP, remove checkpoint expiry

- market_analytics.py: replace F.first()/F.last() with F.min_by()/F.max_by()
  for deterministic daily open/close from intraday bars
- instruments.py: move _generate_isin/_generate_cusip to instance methods
  using seeded self.rng for reproducible output
- s3.tf: remove checkpoint lifecycle rule that would delete Auto Loader
  state after 30 days, causing duplicate ingestion
devin-ai-integration[bot]

This comment was marked as resolved.

…KMS IAM statement

- risk_metrics.py: multiply 5-min bar volatility by sqrt(78) to convert
  to daily volatility for correct VaR estimation
- s3.tf: use concat() to conditionally include KMS statement only when
  enable_kms_encryption=true, avoiding wildcard Resource grant
devin-ai-integration[bot]

This comment was marked as resolved.

- orders.py: BUY stops now placed above market (trigger on rise),
  SELL stops below market (trigger on fall) — matches standard semantics
- market_data.py: guard against ZeroDivisionError when no EQUITY/ETF
  instruments exist (edge case with --num-instruments 1)
devin-ai-integration[bot]

This comment was marked as resolved.

- transform_trades.py: total_cost now uses notional+costs for BUY,
  costs-only for SELL (notional is proceeds, not a cost on sells)
- trading_volume_analytics.py: remove unused df_market read that would
  crash the pipeline on first run due to missing workflow dependency
devin-ai-integration[bot]

This comment was marked as resolved.

- risk_metrics.py: order by timestamp instead of trade_date for
  deterministic row_number across intraday bars
- upload_to_s3.py: distinguish 403 (access denied) from 404 (not found)
  in head_bucket, only create bucket on 404
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant