-
Notifications
You must be signed in to change notification settings - Fork 93
DataJoint 2.0 #1311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
DataJoint 2.0 #1311
Conversation
update test workflow to use src layout
use pytest to manage docker container startup for tests
Chore/dev env fixes
Impr/modernize pre commit
switch settings to use pydantic-settings
…tings-management-adsuG
- Remove ConfigWrapper class and backward compatibility layer - Use direct pydantic BaseSettings with typed nested models - Change context manager from config(...) to config.override(...) - Add validate_assignment=True for runtime type checking - Improve store spec validation with clear error messages - Update all tests to use new API - Preserve dict-style access via __getitem__/__setitem__ for convenience
Config file search: - Search for datajoint.json recursively up from cwd - Stop at .git/.hg boundaries or filesystem root - Warn if no config file found (instead of silently using defaults) Secrets management: - Add .secrets/ directory support (next to datajoint.json) - Support /run/secrets/datajoint/ for Docker/Kubernetes - Use SecretStr for password and aws_secret_access_key - Secrets masked in repr/logs, excluded from save() - Dict access automatically unwraps SecretStr for compatibility Breaking changes: - Config file renamed from dj_local_conf.json to datajoint.json - No more ~/.datajoint_config.json (project-only config) - Secrets should be in env vars or .secrets/ directory
Add convenient type aliases that map to MySQL types: - float32 -> float - float64 -> double - int32 -> int - uint32 -> int unsigned - int16 -> smallint - uint16 -> smallint unsigned - int8 -> tinyint - uint8 -> tinyint unsigned These aliases follow the same pattern as UUID, storing the original type in the column comment for round-tripping.
- int64 -> bigint - uint64 -> bigint unsigned
Add comprehensive tests for the new type aliases feature: - Pattern matching tests for all 10 type aliases - MySQL type mapping verification - Table creation with type aliases - Insert and fetch operations - Primary key usage with type aliases - Nullable column support
Documentation is now consolidated in datajoint-docs repository. Changes: - Delete docs/ folder (legacy MkDocs infrastructure) - Create ARCHITECTURE.md with transpiler design docs - Update README.md links to point to docs.datajoint.com The Developer Guide remains in README.md. Internal architecture documentation for contributors is now in ARCHITECTURE.md. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Transpilation documentation moved to datajoint-docs query-algebra spec. Developer docs now consolidated in README.md. Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Rename CHANGELOG.md to CHANGELOG-archive.md with redirect to GitHub Releases - Add "Writing Release Notes" section to RELEASE_MEMO.md: - Categories (BREAKING, Added, Changed, Deprecated, Fixed, Security) - Format template with examples - Guidelines for good release notes - PR label mapping for release drafter Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Slim README.md to essentials (intro, badges, install, links) - Create CONTRIBUTING.md with: - Development setup (pixi and pip) - Test running instructions - Pre-commit hooks - Environment variables - Condensed docstring style guide - Delete DOCSTRING_STYLE.md (merged into CONTRIBUTING.md) README: 218 → 82 lines All detailed docs now at docs.datajoint.com Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add .github/DISCUSSION_TEMPLATE/rfc.yml for enhancement proposals - Fix table header alignment (center instead of right) - Fix excessive padding in table headers by removing p tag margins Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Raw blobs (no codec) now show "bytes" - Raw json (no codec) shows "json" - Codec fields show "<codec_name>" (e.g., <blob>, <hash>, <object>) - HTML output properly escapes angle brackets for browser display - Improves clarity when viewing table contents Example output: *id raw_blob blob_data json_data 1 bytes <blob> json Co-Authored-By: Claude Opus 4.5 <[email protected]>
Add migrate_external() and migrate_filepath() to datajoint.migrate module for safe migration of 0.x external storage columns to 2.0 JSON format. Migration strategy: 1. Add new <column>_v2 columns with JSON type 2. Copy and convert data from old columns 3. User verifies data accessible via DataJoint 2.0 4. Finalize: rename columns (old → _v1, new → original) This allows 0.x and 2.0 to coexist during migration and provides rollback capability if issues are discovered. Functions: - migrate_external(schema, dry_run=True, finalize=False) - migrate_filepath(schema, dry_run=True, finalize=False) - _find_external_columns(schema) - detect 0.x external columns - _find_filepath_columns(schema) - detect 0.x filepath columns Co-Authored-By: Claude Opus 4.5 <[email protected]>
Implement the `<npy@>` codec for schema-addressed numpy array storage:
- Add SchemaCodec base class for path-addressed storage codecs
- Add NpyRef class for lazy array references with metadata
- Add NpyCodec using .npy format with shape/dtype inspection
- Refactor ObjectCodec to inherit from SchemaCodec
- Rename is_external to is_store throughout codebase
- Export SchemaCodec and NpyRef from public API
- Bump version to 2.0.0a17
Key features:
- Lazy loading: inspect shape/dtype without downloading
- NumPy integration via __array__ protocol
- Safe bulk fetch: returns NpyRef objects, not arrays
- Schema-addressed paths: {schema}/{table}/{pk}/{attr}.npy
Co-Authored-By: Claude Opus 4.5 <[email protected]>
The SchemaCodec (used by NpyCodec and ObjectCodec) needs _schema,
_table, _field, and primary key values to construct schema-addressed
storage paths. Previously, key=None was passed, resulting in
"unknown/unknown" paths.
Now builds proper context dict from table metadata and row values,
enabling navigable paths like:
{schema}/{table}/objects/{pk_path}/{attribute}.npy
Co-Authored-By: Claude Opus 4.5 <[email protected]>
…to feature/npy-codec
Merge PR #1330 (blob preview display) into feature/npy-codec. Bump version from 2.0.0a17 to 2.0.0a18. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Address reviewer feedback from PR #1330: attr should never be None since field_name comes from heading.names. Raising an error surfaces bugs immediately rather than silently returning a misleading placeholder. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Show codec names in table preview instead of =BLOB=
Support memory-mapped loading for large arrays: - Local filesystem stores: mmap directly, no download - Remote stores: download to cache, then mmap Co-Authored-By: Claude Opus 4.5 <[email protected]>
…orage Major changes to hash-addressed storage model: - Rename content_registry.py → hash_registry.py for clarity - Always store full path in metadata (protects against config changes) - Use stored path directly for retrieval (no path regeneration) - Add delete_path() as primary function, deprecate delete_hash() - Add get_size() as primary function, deprecate get_hash_size() - Update gc.py to work with paths instead of hashes - Update builtin_codecs.py HashCodec to use new API This design enables seamless migration from v0.14: - Legacy data keeps old paths in metadata - New data uses new path structure - GC compares stored paths against filesystem Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Remove uuid_from_buffer from hash.py (dead code) - connection.py now uses hashlib.md5().hexdigest() directly - Update test_hash.py to test key_hash instead Co-Authored-By: Claude Opus 4.5 <[email protected]>
Remove dead code that was only tested but never used in production: - hash_exists (gc uses set operations on paths) - delete_hash (gc uses delete_path directly) - get_size (gc collects sizes during walk) - get_hash_size (wrapper for get_size) Remaining API: compute_hash, build_hash_path, get_store_backend, get_store_subfolding, put_hash, get_hash, delete_path Co-Authored-By: Claude Opus 4.5 <[email protected]>
feat: Add NpyCodec for lazy-loading numpy arrays
| "errors", | ||
| "migrate", | ||
| "DataJointError", | ||
| "key", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dimitri-yatsenko The datajoint-docs state that dj.key is removed in 2.0 (see https://github.com/datajoint/datajoint-docs/blob/pre/v2.0/src/reference/specs/fetch-api.md#removed-methods-and-parameters), but' key' is still listed in __all__ in here. Should key be removed from the exports here, or does the documentation need to be updated?
|
The PR description states "Removed
This behavior is correctly documented in datajoint-docs, but the PR description could be updated to: "Renamed method parameter |
|
There's a TODO at |
| ] | ||
| else: # positional | ||
| warnings.warn( | ||
| "Positional inserts (tuples/lists) are deprecated and will be removed in a future version. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This deprecation isn't documented in the migration guide. Should it be added, or is this intentionally a soft deprecation for now?
|
|
||
| See Also | ||
| -------- | ||
| https://docs.datajoint.com/core/datajoint-python/0.14/client/install/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outdated URL (0.14 docs). Consider updating to https://docs.datajoint.com/how-to/read-diagrams/ or adding Diagram dependencies to the installation guide.
| if original_type.startswith("external"): | ||
| raise DataJointError( | ||
| f"Legacy datatype `{original_type}`. Migrate your external stores to datajoint 0.12: " | ||
| "https://docs.datajoint.io/python/admin/5-blob-config.html#migration-between-datajoint-v0-11-and-v0-12" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outdated URL and message. This error triggers for legacy :external: format, but points to 0.11→0.12 migration on old datajoint.io domain. Should reference Phase 6 of the 2.0 migration guide: https://docs.datajoint.com/how-to/migrate-from-0x/#phase-6-external-storage-database
|
|
||
| For single-row fetch, use fetch1() which is unchanged. | ||
|
|
||
| See migration guide: https://docs.datajoint.com/migration/fetch-api |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broken URL - /migration/fetch-api doesn't exist. Should be: https://docs.datajoint.com/reference/specs/fetch-api/
|
diagram.py:26,33 — Lines 26 and 33 use bare |
|
Suggestion: blob.py:162,366 — These if len(blob) != blob_size:
raise DataJointError("Blob size mismatch") |
|
There are a few |
Summary
DataJoint 2.0 is a major release that modernizes the entire codebase while maintaining backward compatibility for core functionality. This release focuses on extensibility, type safety, and developer experience.
Planning: DataJoint 2.0 Plan | Milestone 2.0
Major Features
Codec System (Extensible Types)
Replaces the adapter system with a modern, composable codec architecture:
<blob>,<json>,<attach>,<filepath>,<object>,<hash><blob>wraps<json>for external storage)__init_subclass__validate()method for type checking before insertSemantic Matching
Attribute lineage tracking ensures joins only match semantically compatible attributes:
idornamesemantic_check=Falsefor legacy permissive behaviorPrimary Key Rules
Rigorous primary key propagation through all operators:
dj.U('attr')creates ad-hoc grouping entitiesAutoPopulate 2.0 (Jobs System)
Per-table job management with enhanced tracking:
~~_job_timestampand~~_job_durationcolumns~~table_namejob tabletable.progress()returns (remaining, total)Modern Fetch & Insert API
New fetch methods:
to_dicts()- List of dictionariesto_pandas()- DataFrame with PK as indexto_arrays(*attrs)- NumPy arrays (structured or individual)keys()- Primary keys onlyfetch1()- Single rowInsert improvements:
validate()- Check rows before insertingchunk_size- Batch large insertsinsert_dataframe()- DataFrame with index handlingType Aliases
Core DataJoint types for portability:
int8,int16,int32,int64uint8,uint16,uint32,uint64float32,float64booluuidObject Storage
Content-addressed and object storage types:
<hash>- Content-addressed storage with deduplication<object>- Named object storage (Zarr, folders)<filepath>- Reference to managed files<attach>- File attachments (uploaded on insert)Virtual Schema Infrastructure (#1307)
New schema introspection API for exploring existing databases:
Schema.get_table(name)- Direct table access with auto tier prefix detectionSchema['TableName']- Bracket notation accessfor table in schema- Iterate tables in dependency order'TableName' in schema- Check table existencedj.virtual_schema()- Clean entry point for accessing schemasdj.VirtualModule()- Virtual modules with custom namesCLI Improvements
The
djcommand-line interface for interactive exploration:dj -s schema:alias- Load schemas as virtual modules--host,--user,--password- Connection options-hconflict with--helpSettings Modernization
Pydantic-based configuration with validation:
dj.config.override()context manager.secrets/)DJ_HOST, etc.)License Change
Changed from LGPL to Apache 2.0 license (#1235 (discussion)):
Breaking Changes
Removed Support
fetch()with format parametersafemodeparameter (useprompt)create_virtual_module(usedj.virtual_schema()ordj.VirtualModule())~logtable (IMPR: Deprecate and Remove the~logTable. #1298)API Changes
fetch()→to_dicts(),to_pandas(),to_arrays()fetch(format='frame')→to_pandas()fetch(as_dict=True)→to_dicts()safemode=False→prompt=FalseSemantic Changes
Documentation
Developer Documentation (this repo)
Comprehensive updates in
docs/:User Documentation (datajoint-docs)
Full documentation site following the Diátaxis framework:
Tutorials (learning-oriented, Jupyter notebooks):
How-To Guides (task-oriented):
Reference (specifications):
Project Structure
src/layout for proper packaging (IMPR:srclayout #1267)Test Plan
Closes
Milestone 2.0 Issues
~logTable. #1298 - Deprecate and remove~logtablesuper.deletekwargs toPart.delete#1276 - Part.delete kwargs pass-throughsrclayout #1267 -srclayoutdj.Toporders the preview withorder_by#1242 -dj.Toporders the preview withorder_byBug Fixes
pyarrow(apandasdependency) #1202 - DataJoint import error with missing pyarrowValueErrorin DataJoint-Python 0.14.3 when using numpy 2.2.* #1201 - ValueError with numpy 2.2dj.Diagram()and new release ofpydot==3.0.*#1169 - Error with dj.Diagram() and pydot 3.0Improvements
Related PRs
Migration Guide
See How to Migrate from 1.x for detailed migration instructions.
🤖 Generated with Claude Code