Skip to content

Latest commit

 

History

History
354 lines (256 loc) · 13.9 KB

File metadata and controls

354 lines (256 loc) · 13.9 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Agent-Based Development Workflow

Every time the user requests a feature or bug fix, you MUST follow the process below:

Development Process

  1. Understanding The Task: Use the developer agent to understand what the user is asking for and to read GitHub issues
  2. Feature Development & Bug Fixes: Use the developer agent for implementing features and fixing bugs. IMPORTANT: Always begin by writing a failing test (or tests) that reflects the expected behavior
  3. Code Review: After development work, invoke the code-reviewer agent to review the implementation
  4. Iteration: Use the developer agent again to address feedback from the code reviewer
  5. Repeat: Continue the developer → code-reviewer cycle until no more feedback remains
  6. Documentation: If the feature or bug fix requires documentation updates, invoke the technical-writer agent

IMPORTANT: Make sure to share the project overview, architecture overview, and other concepts outlined below with the agent when it is invoked.

Agent Responsibilities

Developer Agent:

  • Understands a feature request or a reported issue
  • Implements new features following SQLMesh's architecture patterns
  • Fixes bugs with proper understanding of the codebase
  • Writes comprehensive tests following SQLMesh's testing conventions
  • Follows established code style and conventions

Code-Reviewer Agent:

  • Reviews implementation for quality and architectural compliance
  • Identifies potential issues, edge cases, and improvements
  • Ensures adherence to SQLMesh patterns and best practices
  • Validates test coverage and quality

Technical-Writer Agent:

  • Creates and updates user-facing documentation
  • Writes API documentation for new features
  • Updates existing docs after code changes
  • Creates migration guides and deep-dive technical explanations

Project Overview

SQLMesh is a next-generation data transformation framework that enables:

  • Virtual data environments for isolated development without warehouse costs
  • Plan/apply workflow (like Terraform) for safe deployments
  • Multi-dialect SQL support with automatic transpilation
  • Incremental processing to run only necessary transformations
  • Built-in testing and CI/CD integration

Requirements: Python >= 3.9 (Note: Python 3.13+ is not yet supported)

Essential Commands

Environment setup

# Create and activate a Python virtual environment (Python >= 3.9, < 3.13)
python -m venv .venv
source ./.venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install development dependencies
make install-dev

# Setup pre-commit hooks (important for code quality)
make install-pre-commit

Common Development Tasks

# Run linters and formatters (ALWAYS run before committing)
make style

# Fast tests for quick feedback during development
make fast-test

# Slow tests for comprehensive coverage
make slow-test

# Run specific test file
pytest tests/core/test_context.py -v

# Run tests with specific marker
pytest -m "not slow and not docker" -v

# Build package
make package

# Serve documentation locally
make docs-serve

Engine-Specific Testing

# DuckDB (default, no setup required)
make duckdb-test

# Other engines require credentials/Docker
make snowflake-test  # Needs SNOWFLAKE_* env vars
make bigquery-test   # Needs GOOGLE_APPLICATION_CREDENTIALS
make databricks-test # Needs DATABRICKS_* env vars

UI Development

# In web/client directory
pnpm run dev   # Start development server
pnpm run build # Production build
pnpm run test  # Run tests

# Docker-based UI
make ui-up     # Start UI in Docker
make ui-down   # Stop UI

Architecture Overview

Core Components

sqlmesh/core/context.py: The main Context class orchestrates all SQLMesh operations. This is the entry point for understanding how models are loaded, plans are created, and executions happen.

sqlmesh/core/model/: Model definitions and kinds (FULL, INCREMENTAL_BY_TIME_RANGE, SCD_TYPE_2, etc.). Each model kind has specific behaviors for how data is processed.

sqlmesh/core/snapshot/: The versioning system. Snapshots are immutable versions of models identified by fingerprints. Understanding snapshots is crucial for how SQLMesh tracks changes.

sqlmesh/core/plan/: Plan building and evaluation logic. Plans determine what changes need to be applied and in what order.

sqlmesh/core/engine_adapter/: Database engine adapters provide a unified interface across 16+ SQL engines. Each adapter handles engine-specific SQL generation and execution.

Key Concepts

  1. Virtual Environments: Lightweight branches that share unchanged data between environments, reducing storage costs and deployment time.

  2. Fingerprinting: Models are versioned using content-based fingerprints. Any change to a model's logic creates a new version.

  3. State Sync: Manages metadata across different backends (can be stored in the data warehouse or external databases).

  4. Intervals: Time-based partitioning system for incremental models, tracking what data has been processed.

Important Files

  • sqlmesh/core/context.py: Main orchestration class
  • examples/sushi/: Reference implementation used in tests
  • web/server/main.py: Web UI backend entry point
  • web/client/src/App.tsx: Web UI frontend entry point
  • vscode/extension/src/extension.ts: VSCode extension entry point

GitHub CI/CD Bot Architecture

SQLMesh includes a GitHub CI/CD bot integration that automates data transformation workflows. The implementation is located in sqlmesh/integrations/github/ and follows a clean architectural pattern.

Code Organization

Core Integration Files:

  • sqlmesh/cicd/bot.py: Main CLI entry point (sqlmesh_cicd command)
  • sqlmesh/integrations/github/cicd/controller.py: Core bot orchestration logic
  • sqlmesh/integrations/github/cicd/command.py: Individual command implementations
  • sqlmesh/integrations/github/cicd/config.py: Configuration classes and validation

Architecture Pattern

The bot follows a Command Pattern architecture:

  1. CLI Layer (bot.py): Handles argument parsing and delegates to controllers
  2. Controller Layer (controller.py): Orchestrates workflow execution and manages state
  3. Command Layer (command.py): Implements individual operations (test, deploy, plan, etc.)
  4. Configuration Layer (config.py): Manages bot configuration and validation

Key Components

GitHubCICDController: Main orchestrator that:

  • Manages GitHub API interactions via PyGithub
  • Coordinates workflow execution across different commands
  • Handles error reporting through GitHub Check Runs
  • Manages PR comment interactions and status updates

Command Implementations:

  • run_tests(): Executes unit tests with detailed reporting
  • update_pr_environment(): Creates/updates virtual PR environments
  • gen_prod_plan(): Generates production deployment plans
  • deploy_production(): Handles production deployments
  • check_required_approvers(): Validates approval requirements

Configuration Management:

  • Uses Pydantic models for type-safe configuration
  • Supports both YAML config files and environment variables
  • Validates bot settings and user permissions
  • Handles approval workflows and deployment triggers

Integration with Core SQLMesh

The bot leverages core SQLMesh components:

  • Context: Uses SQLMesh Context for project operations
  • Plan/Apply: Integrates with SQLMesh's plan generation and application
  • Virtual Environments: Creates isolated PR environments using SQLMesh's virtual data environments
  • State Sync: Manages metadata synchronization across environments
  • Testing Framework: Executes SQLMesh unit tests and reports results

Error Handling and Reporting

  • GitHub Check Runs: Creates detailed status reports for each workflow step
  • PR Comments: Provides user-friendly feedback on failures and successes
  • Structured Logging: Uses SQLMesh's logging framework for debugging
  • Exception Handling: Graceful handling of GitHub API failures and SQLMesh errors

Environment Variables for Engine Testing

When running engine-specific tests, these environment variables are required:

  • Snowflake: SNOWFLAKE_ACCOUNT, SNOWFLAKE_WAREHOUSE, SNOWFLAKE_DATABASE, SNOWFLAKE_USER, SNOWFLAKE_PASSWORD
  • BigQuery: BIGQUERY_KEYFILE or GOOGLE_APPLICATION_CREDENTIALS
  • Databricks: DATABRICKS_CATALOG, DATABRICKS_SERVER_HOSTNAME, DATABRICKS_HTTP_PATH, DATABRICKS_ACCESS_TOKEN, DATABRICKS_CONNECT_VERSION
  • Redshift: REDSHIFT_HOST, REDSHIFT_USER, REDSHIFT_PASSWORD, REDSHIFT_DATABASE
  • Athena: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, ATHENA_S3_WAREHOUSE_LOCATION
  • ClickHouse Cloud: CLICKHOUSE_CLOUD_HOST, CLICKHOUSE_CLOUD_USERNAME, CLICKHOUSE_CLOUD_PASSWORD

Migrations System

SQLMesh uses a migration system to evolve its internal state database schema and metadata format. The migrations handle changes to SQLMesh's internal structure, not user data transformations.

Migration Structure

Location: sqlmesh/migrations/ - Contains 80+ migration files from v0001 to v0083+

Naming Convention: v{XXXX}_{descriptive_name}.py (e.g., v0001_init.py, v0083_use_sql_for_scd_time_data_type_data_hash.py)

Core Infrastructure:

  • sqlmesh/core/state_sync/db/migrator.py: Main migration orchestrator
  • sqlmesh/utils/migration.py: Cross-database compatibility utilities
  • sqlmesh/core/state_sync/base.py: Auto-discovery and loading logic

Migration Categories

Schema Evolution:

  • State table creation/modification (snapshots, environments, intervals)
  • Column additions/removals and index management
  • Database engine compatibility fixes (MySQL/MSSQL field size limits)

Data Format Migrations:

  • JSON metadata structure updates (snapshot serialization changes)
  • Path normalization (Windows compatibility)
  • Fingerprint recalculation when SQLGlot parsing changes

Cleanup Operations:

  • Removing obsolete tables and unused data
  • Metadata optimization and attribute cleanup

Key Migration Patterns

# Standard migration function signature
def migrate(state_sync, **kwargs):  # type: ignore
    engine_adapter = state_sync.engine_adapter
    schema = state_sync.schema
    # Migration logic here

# Common operations
engine_adapter.create_state_table(table_name, columns_dict)
engine_adapter.alter_table(alter_expression)
engine_adapter.drop_table(table_name)

State Management Integration

Core State Tables:

  • _snapshots: Model version metadata (most frequently migrated)
  • _environments: Environment definitions
  • _versions: Schema/SQLGlot/SQLMesh version tracking
  • _intervals: Incremental processing metadata

Migration Safety:

  • Automatic backups before migration (unless skip_backup=True)
  • Atomic database transactions for consistency
  • Snapshot count validation before/after migrations
  • Automatic rollback on failures

Migration Execution

Auto-Discovery: Migrations are automatically loaded using pkgutil.iter_modules()

Triggers: Migrations run automatically when:

  • Schema version mismatch detected
  • SQLGlot version changes require fingerprint recalculation
  • Manual sqlmesh migrate command execution

Execution Flow:

  1. Version comparison (local vs remote schema)
  2. Backup creation of state tables
  3. Sequential migration execution (numerical order)
  4. Snapshot fingerprint recalculation if needed
  5. Environment updates with new snapshot references

dbt Integration

SQLMesh provides native support for dbt projects, allowing users to run existing dbt projects while gaining access to SQLMesh's advanced features like virtual environments and plan/apply workflows.

Core dbt Integration

Location: sqlmesh/dbt/ - Complete dbt integration architecture

Key Components:

  • sqlmesh/dbt/loader.py: Main dbt project loader extending SQLMesh's base loader
  • sqlmesh/dbt/manifest.py: dbt manifest parsing and project discovery
  • sqlmesh/dbt/adapter.py: dbt adapter system for SQL execution and schema operations
  • sqlmesh/dbt/model.py: dbt model configurations and materialization mapping
  • sqlmesh/dbt/context.py: dbt project context and environment management

Project Conversion

dbt Converter: sqlmesh/dbt/converter/ - Tools for migrating dbt projects to SQLMesh

Key Features:

  • convert.py: Main conversion orchestration
  • jinja.py & jinja_transforms.py: Jinja template and macro conversion
  • Full support for dbt assets (models, seeds, sources, tests, snapshots, macros)

CLI Commands:

# Initialize SQLMesh in existing dbt project
sqlmesh init -t dbt

# Convert dbt project to SQLMesh format
sqlmesh dbt convert

Supported dbt Features

Project Structure:

  • Full dbt project support (models, seeds, sources, tests, snapshots, macros)
  • dbt package dependencies and version management
  • Profile integration using existing profiles.yml for connections

Materializations:

  • All standard dbt materializations (table, view, incremental, ephemeral)
  • Incremental model strategies (delete+insert, merge, insert_overwrite)
  • SCD Type 2 support and snapshot strategies

Advanced Features:

  • Jinja templating with full macro support
  • Runtime variable passing and configuration
  • dbt test integration and execution
  • Cross-database compatibility with SQLMesh's multi-dialect support

Example Projects

sushi_dbt: examples/sushi_dbt/ - Complete dbt project running with SQLMesh Test Fixtures: tests/fixtures/dbt/sushi_test/ - Comprehensive test dbt project with all asset types

Integration Benefits

When using dbt with SQLMesh, you gain:

  • Virtual Environments: Isolated development without warehouse costs
  • Plan/Apply Workflow: Safe deployments with change previews
  • Multi-Dialect Support: Run the same dbt project across different SQL engines
  • Advanced Testing: Enhanced testing capabilities beyond standard dbt tests
  • State Management: Sophisticated metadata and versioning system