Skip to content

Release 2.0.0

Choose a tag to compare

@dimitri-yatsenko dimitri-yatsenko released this 03 Feb 01:55
· 147 commits to master since this release
3a1c2db

DataJoint 2.0 - Computational Foundation for Agentic Data Pipelines

This is a major release representing a complete rewrite of the DataJoint Python library. It introduces a modernized architecture with an extensible type system, object-augmented schemas, semantic matching, and improved developer experience.

Related:

  • PR #1311 — Complete rewrite implementation
  • Discussion #1235 — DataJoint 2.0 design
  • Discussion #1354 — Object-Augmented Schemas (OAS)
  • Discussion #1256 — Extensible type system
  • Discussion #1243 — Semantic matching and lineage

💥 Breaking Changes

Platform Requirements

  • Python 3.10+ required - Dropped support for Python 3.9 and earlier
  • MySQL 8.0+ required - Dropped support for MySQL 5.x and pre-8.0 versions

Architecture Changes

  • New package structure - Source code moved to src/datajoint/
  • Extensible Type/Codec System - New <codec> syntax replaces hardcoded blob/attach handling. Custom codecs extend dj.Codec with encode()/decode() methods
  • Object-Augmented Schemas (OAS) - Schema-addressed storage (<object@>, <npy@>) creates browsable paths mirroring database structure
  • Semantic Matching with Lineage - ~lineage table tracks attribute origins. Joins/restrictions enforce homologous namesakes must share lineage
  • Table-Specific Jobs Tables - Each Computed/Imported table has its own ~~table_name jobs table (replaces shared jobs table)
  • New Configuration System - pydantic-settings based config with datajoint.json, .secrets/ directory, and DJ_* environment variables
  • New Test Infrastructure - Uses testcontainers for automatic MySQL/MinIO management (no manual docker-compose required)

Removed/Deprecated Features

  • dj.conn() interactive prompts - Use environment variables or config file
  • dj.kill() and dj.kill_quick() - Use database administration tools
  • otumat dependency - S3 credential management simplified
  • Positional tuple inserts deprecated - Use dict with explicit field names
  • ~log table deprecated - Schema-level logging table no longer used

🚀 Major Features

Core Type System

Scientist-friendly type names with portable semantics:

  • Numeric: float32, float64, int64, int32, int16, int8, bool
  • Special: uuid (binary(16)), json, bytes (longblob)
  • Temporal: date, datetime
  • String: char(n), varchar(n), enum(...)
  • Fixed-point: decimal(m,n)

Extensible Codec System

class GraphCodec(dj.Codec):
    name = "graph"
    def get_dtype(self, is_store): return "<blob>"
    def encode(self, value, *, key=None, store_name=None): ...
    def decode(self, stored, *, key=None): ...

# Use in definitions: data : <graph>

Built-in codecs: <blob>, <blob@>, <attach>, <attach@>, <hash@>, <object@>, <npy@>, <filepath@>

Object-Augmented Schemas (OAS)

  • Hash-addressed (<blob@>, <attach@>, <hash@>): Content-addressed with MD5 deduplication (base32-encoded, 26 chars). Paths: _hash/{hash[:2]}/{hash[2:4]}/{hash}
  • Schema-addressed (<object@>, <npy@>): Paths mirror schema structure: {schema}/{table}/{pk}/{attribute}
  • Filepath references (<filepath@>): Reference existing files in stores without copying
  • Lazy references: NpyRef and ObjectRef provide metadata access without I/O

Semantic Matching

  • Lineage tracking identifies attribute origins (schema.table.attribute)
  • Binary operations (join, restrict, union, aggr) enforce lineage compatibility
  • Use schema.rebuild_lineage() for legacy schema migration

Jobs 2.0

  • Per-table job queues with ~~table_name naming pattern
  • Composite index (status, priority, scheduled_time) for efficient job fetching
  • Improved error tracking and job status management

New Query Operator

  • extend(other) - Left-joins a functionally dependent table, preserving primary key and row count

Modernized Output Methods

  • keys() - Returns list of primary key dicts
  • to_arrays(*attrs) - Returns tuple of numpy arrays
  • to_dicts() - Returns list of dictionaries
  • to_pandas() - Returns pandas DataFrame
  • to_polars() - Returns Polars DataFrame
  • to_arrow() - Returns PyArrow Table
  • fetch() preserved with deprecation warning for backward compatibility

Configuration Enhancements

  • datajoint.json project config with parent directory search
  • .secrets/ directory for sensitive values (gitignore this)
  • database.database_prefix setting for automatic schema name prefixing
  • database.create_tables setting to control automatic table creation
  • dj.config.override() context manager for temporary config changes

📚 Documentation

Documentation has been moved to a dedicated repository and completely rewritten using the Diátaxis framework:

Structure:

⚖️ License Change

DataJoint 2.0 is released under Apache 2.0 license (previously LGPLv2.1).