Release 2.0.0
DataJoint 2.0 - Computational Foundation for Agentic Data Pipelines
This is a major release representing a complete rewrite of the DataJoint Python library. It introduces a modernized architecture with an extensible type system, object-augmented schemas, semantic matching, and improved developer experience.
Related:
- PR #1311 — Complete rewrite implementation
- Discussion #1235 — DataJoint 2.0 design
- Discussion #1354 — Object-Augmented Schemas (OAS)
- Discussion #1256 — Extensible type system
- Discussion #1243 — Semantic matching and lineage
💥 Breaking Changes
Platform Requirements
- Python 3.10+ required - Dropped support for Python 3.9 and earlier
- MySQL 8.0+ required - Dropped support for MySQL 5.x and pre-8.0 versions
Architecture Changes
- New package structure - Source code moved to
src/datajoint/ - Extensible Type/Codec System - New
<codec>syntax replaces hardcoded blob/attach handling. Custom codecs extenddj.Codecwithencode()/decode()methods - Object-Augmented Schemas (OAS) - Schema-addressed storage (
<object@>,<npy@>) creates browsable paths mirroring database structure - Semantic Matching with Lineage -
~lineagetable tracks attribute origins. Joins/restrictions enforce homologous namesakes must share lineage - Table-Specific Jobs Tables - Each Computed/Imported table has its own
~~table_namejobs table (replaces shared jobs table) - New Configuration System - pydantic-settings based config with
datajoint.json,.secrets/directory, andDJ_*environment variables - New Test Infrastructure - Uses testcontainers for automatic MySQL/MinIO management (no manual docker-compose required)
Removed/Deprecated Features
dj.conn()interactive prompts - Use environment variables or config filedj.kill()anddj.kill_quick()- Use database administration toolsotumatdependency - S3 credential management simplified- Positional tuple inserts deprecated - Use dict with explicit field names
~logtable deprecated - Schema-level logging table no longer used
🚀 Major Features
Core Type System
Scientist-friendly type names with portable semantics:
- Numeric:
float32,float64,int64,int32,int16,int8,bool - Special:
uuid(binary(16)),json,bytes(longblob) - Temporal:
date,datetime - String:
char(n),varchar(n),enum(...) - Fixed-point:
decimal(m,n)
Extensible Codec System
class GraphCodec(dj.Codec):
name = "graph"
def get_dtype(self, is_store): return "<blob>"
def encode(self, value, *, key=None, store_name=None): ...
def decode(self, stored, *, key=None): ...
# Use in definitions: data : <graph>Built-in codecs: <blob>, <blob@>, <attach>, <attach@>, <hash@>, <object@>, <npy@>, <filepath@>
Object-Augmented Schemas (OAS)
- Hash-addressed (
<blob@>,<attach@>,<hash@>): Content-addressed with MD5 deduplication (base32-encoded, 26 chars). Paths:_hash/{hash[:2]}/{hash[2:4]}/{hash} - Schema-addressed (
<object@>,<npy@>): Paths mirror schema structure:{schema}/{table}/{pk}/{attribute} - Filepath references (
<filepath@>): Reference existing files in stores without copying - Lazy references:
NpyRefandObjectRefprovide metadata access without I/O
Semantic Matching
- Lineage tracking identifies attribute origins (
schema.table.attribute) - Binary operations (join, restrict, union, aggr) enforce lineage compatibility
- Use
schema.rebuild_lineage()for legacy schema migration
Jobs 2.0
- Per-table job queues with
~~table_namenaming pattern - Composite index
(status, priority, scheduled_time)for efficient job fetching - Improved error tracking and job status management
New Query Operator
extend(other)- Left-joins a functionally dependent table, preserving primary key and row count
Modernized Output Methods
keys()- Returns list of primary key dictsto_arrays(*attrs)- Returns tuple of numpy arraysto_dicts()- Returns list of dictionariesto_pandas()- Returns pandas DataFrameto_polars()- Returns Polars DataFrameto_arrow()- Returns PyArrow Tablefetch()preserved with deprecation warning for backward compatibility
Configuration Enhancements
datajoint.jsonproject config with parent directory search.secrets/directory for sensitive values (gitignore this)database.database_prefixsetting for automatic schema name prefixingdatabase.create_tablessetting to control automatic table creationdj.config.override()context manager for temporary config changes
📚 Documentation
Documentation has been moved to a dedicated repository and completely rewritten using the Diátaxis framework:
- Live site: https://docs.datajoint.com
- Repository: https://github.com/datajoint/datajoint-docs
Structure:
- Tutorials — Learn by building real pipelines (Jupyter notebooks)
- How-To Guides — Practical task-oriented guides
- Explanation — Understanding concepts and design
- Reference — Specifications and API documentation
- Migration Guide — Upgrade from legacy versions
⚖️ License Change
DataJoint 2.0 is released under Apache 2.0 license (previously LGPLv2.1).