feat: Securities Trading Data Platform — S3 + Databricks Medallion Architecture#5
Open
devin-ai-integration[bot] wants to merge 6 commits into
Open
Conversation
…lion architecture - Data generation scripts for instruments, trades, orders, market data, positions - Terraform IaC for S3 buckets (bronze/silver/gold), KMS, IAM, Databricks workspace - 15 Databricks notebooks: bronze ingestion, silver transformations, gold aggregations - Databricks Workflow DAG for scheduled pipeline orchestration - Comprehensive architecture documentation and README
Author
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
…/CUSIP, remove checkpoint expiry - market_analytics.py: replace F.first()/F.last() with F.min_by()/F.max_by() for deterministic daily open/close from intraday bars - instruments.py: move _generate_isin/_generate_cusip to instance methods using seeded self.rng for reproducible output - s3.tf: remove checkpoint lifecycle rule that would delete Auto Loader state after 30 days, causing duplicate ingestion
…KMS IAM statement - risk_metrics.py: multiply 5-min bar volatility by sqrt(78) to convert to daily volatility for correct VaR estimation - s3.tf: use concat() to conditionally include KMS statement only when enable_kms_encryption=true, avoiding wildcard Resource grant
- orders.py: BUY stops now placed above market (trigger on rise), SELL stops below market (trigger on fall) — matches standard semantics - market_data.py: guard against ZeroDivisionError when no EQUITY/ETF instruments exist (edge case with --num-instruments 1)
- transform_trades.py: total_cost now uses notional+costs for BUY, costs-only for SELL (notional is proceeds, not a cost on sells) - trading_volume_analytics.py: remove unused df_market read that would crash the pipeline on first run due to missing workflow dependency
- risk_metrics.py: order by timestamp instead of trade_date for deterministic row_number across intraday bars - upload_to_s3.py: distinguish 403 (access denied) from 404 (not found) in head_bucket, only create bucket on 404
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a complete end-to-end modern data platform for financial securities trading data, built on AWS S3 and Databricks with a medallion architecture (bronze → silver → gold).
What's Included
.tffilesMedallion Tiers
Review Feedback Addressed
F.first()/F.last()in gold market analytics — now usesF.min_by()/F.max_by()for correct chronological OHLCV_generate_isin()/_generate_cusip()to use seededself.rngfor reproducible data generationReview & Testing Checklist for Human
terraform.tfvars.example— update the Databricks workspace URL and account-specific ARNs before deployingdatabricks.tf— the placeholderarn:aws:iam::rootneeds your Databricks account IDpython -m data_generation.generate_alllocally to verify data generation works with your Python environmentpython -m data_generation.upload_to_s3 --create-bucketagainst a dev bucketTest Plan
cd securities-trading-data-platform && pip install pandas pyarrow faker numpy pydantic pydantic-settings click rich python-dotenv boto3python -m data_generation.generate_all --num-instruments 50 --num-trades 1000 --num-orders 1000 --num-market-data 5000generated_data/python -m data_generation.upload_to_s3 --create-bucketNotes
TRADING_prefix or.envfileterraform initLink to Devin session: https://partner-workshops.devinenterprise.com/sessions/76ebe6f551394d83a14ca160b058b025
Requested by: @bsmitches