Skip to content

Latest commit

 

History

History
102 lines (65 loc) · 8.3 KB

File metadata and controls

102 lines (65 loc) · 8.3 KB

Databricks Code Practice

Get fluent in Databricks by typing, not watching.

104 exercises + 5 production-grade pipeline labs. All on Databricks Free Edition.

Clone once, import into Databricks, pick a folder. Exercises fail loud until your code is right; labs ship with synthetic data so you build production-style pipelines, not toy ones.

New (18 April 2026): 5 full-scale pipeline labs + 1 benchmark deep-dive just landed. If you starred this repo for the exercises, they're still here - now alongside end-to-end project work.


Author

Jakub Lasak - Databricks Data Engineer. Helping you interview like seniors, execute like seniors, and think like seniors.

Prepping for interviews? Writing code is one half of the battle - knowing the questions that actually come up is the other. I maintain Databricks Interview Cheat Sheets by seniority level (junior / mid / senior / bundle).

What's Inside

Fluency comes from reps, not reading. Three structured paths:

  • exercises/ - focused reps on a single concept. LeetCode-style, 5-30 min each.
  • pipeline-labs/ - end-to-end medallion pipelines on a business scenario. 2-3 hours each.
  • deep-dives/ - measure the impact of a technique with numbers. 1-2 hours each.
Exercises Pipeline Labs Deep-Dives
Format Single notebook, one TODO per exercise Multi-notebook guided project Single-topic deep investigation
Time 5-30 min per exercise 2-3 hours per lab 1-2 hours
Scope One concept (MERGE, window functions, ...) End-to-end project (ingestion -> bronze -> silver -> gold) One topic measured in depth
Narrative None. "Given table X, write..." Business scenario. "You're building a streaming pipeline for..." Benchmark-driven. "Apply technique, measure the delta."
Order Pick any, skip around Sequential notebooks that build on each other Sequential; each step layers on the last
Goal Drill a skill until it's automatic See how concepts fit in a real project Prove what a technique actually buys you

Catalog

Exercises (exercises/)

Topic Notebooks Exercises Description
Delta Lake 6 51 MERGE operations, time travel, schema enforcement, OPTIMIZE, liquid clustering, change data feed
ELT 7 53 Spark SQL joins, window functions, PySpark transformations, Auto Loader, batch ingestion, medallion architecture, complex data types

Total: 13 notebooks, 104 exercises

More exercise topics coming - next up: Streaming, Unity Catalog, Performance, and DLT.

Pipeline Labs (pipeline-labs/)

Multi-notebook, end-to-end medallion pipelines with a business scenario. Each runs 2-3 hours and ships with a synthetic data generator.

Lab What You Build Focus
Apparel Retail 360 (DLT) End-to-end retail analytics pipeline on Delta Live Tables with a full medallion architecture. DLT, Medallion, SCD Type 2, Streaming, Data Quality Expectations
Fintech Transaction Monitoring Real-time fraud-monitoring pipeline for a payment processor handling 500K+ transactions/day. Structured Streaming, Rescued Data, Watermarked Dedup, Stream-Static Joins, Liquid Clustering
DE Associate Certification Prep Production-grade pipeline covering every exam domain of the Databricks Data Engineer Associate cert. Auto Loader, COPY INTO, Medallion, SCD2, Jobs, Unity Catalog
PySpark Developer Cert Prep E-commerce analytics pipeline covering every domain of the Spark Developer Associate cert. DataFrame API, Structured Streaming, Data Skew, Performance Tuning

Deep-Dives (deep-dives/)

Single-topic labs that measure the impact of a technique with numbers, not intuition.

Lab What You Build Focus
6 Delta Optimization Techniques Iteratively apply and measure core Delta performance levers on a synthetic 50M-row dataset. Partitioning, Z-Order, OPTIMIZE, Auto Optimize, Liquid Clustering, VACUUM

How to Use

  1. Sign up for Databricks Free Edition (free, no credit card)
  2. Clone or import this repo into Databricks (Workspace -> Create -> Git folder)
  3. Navigate to the folder you want, open its README, follow the instructions

Everything runs on Free Edition: serverless compute, Unity Catalog, Delta Lake. No cloud account, no cluster config.

Which Should I Start With?

Stay in the Loop

New exercises and labs ship regularly. Follow on LinkedIn or subscribe to the Substack newsletter to be notified when new content drops.

Feedback

Found a bug? Have a suggestion? Open an issue.


Disclaimer: This is an independent educational resource created by Jakub Lasak. Not affiliated with, endorsed by, or sponsored by Databricks, Inc. "Databricks" and "Delta Lake" are trademarks of their respective owners.