Skip to content

Commit d0f5614

Browse files
Implement Phase 5 (GC) and Phase 6 (Tests)
Add garbage collection module (gc.py) for content-addressed storage: - scan_references() to find content hashes in schemas - list_stored_content() to enumerate _content/ directory - scan() for orphan detection without deletion - collect() for orphan removal with dry_run option - format_stats() for human-readable output Add test files: - test_content_storage.py for content_registry.py functions - test_type_composition.py for type chain encoding/decoding - test_gc.py for garbage collection Update implementation plan to mark all phases complete. Co-authored-by: dimitri-yatsenko <[email protected]>
1 parent ca0b914 commit d0f5614

File tree

5 files changed

+1158
-47
lines changed

5 files changed

+1158
-47
lines changed

β€Ždocs/src/design/tables/storage-types-implementation-plan.mdβ€Ž

Lines changed: 34 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,8 @@ This plan describes the implementation of a three-layer type architecture for Da
2121
| Phase 2b: Path-Addressed Storage | βœ… Complete | ObjectType for files/folders |
2222
| Phase 3: User-Defined AttributeTypes | βœ… Complete | AttachType, XAttachType, FilepathType |
2323
| Phase 4: Insert and Fetch Integration | βœ… Complete | Type chain encoding/decoding |
24-
| Phase 5: Garbage Collection | πŸ”² Pending | |
25-
| Phase 6: Documentation and Testing | πŸ”² Pending | |
24+
| Phase 5: Garbage Collection | βœ… Complete | gc.py with scan/collect functions |
25+
| Phase 6: Documentation and Testing | βœ… Complete | Test files for all new types |
2626

2727
---
2828

@@ -337,66 +337,50 @@ def _get(connection, attr, data, squeeze, download_path):
337337

338338
---
339339

340-
## Phase 5: Garbage Collection πŸ”²
340+
## Phase 5: Garbage Collection βœ…
341341

342-
**Status**: Pending
343-
344-
### Design (updated for function-based approach):
342+
**Status**: Complete
345343

346-
Since we don't have a registry table, GC works by scanning:
344+
### Implemented in `src/datajoint/gc.py`:
347345

348346
```python
349-
def scan_content_references(schemas: list) -> set[tuple[str, str]]:
350-
"""
351-
Scan all schemas for content references.
352-
353-
Returns:
354-
Set of (content_hash, store) tuples that are referenced
355-
"""
356-
referenced = set()
357-
for schema in schemas:
358-
for table in schema.tables:
359-
for attr in table.heading.attributes:
360-
if uses_content_storage(attr):
361-
# Fetch all JSON metadata from this column
362-
for row in table.fetch(attr.name):
363-
if isinstance(row, dict) and 'hash' in row:
364-
referenced.add((row['hash'], row.get('store')))
365-
return referenced
366-
367-
def list_stored_content(store_name: str) -> set[str]:
368-
"""List all content hashes in a store by scanning _content/ directory."""
369-
...
370-
371-
def garbage_collect(schemas: list, store_name: str, dry_run=True) -> dict:
372-
"""
373-
Remove unreferenced content from storage.
347+
import datajoint as dj
374348

375-
Returns:
376-
Stats: {'scanned': N, 'orphaned': M, 'deleted': K, 'bytes_freed': B}
377-
"""
378-
referenced = scan_content_references(schemas)
379-
stored = list_stored_content(store_name)
380-
orphaned = stored - {h for h, s in referenced if s == store_name}
349+
# Scan schemas and find orphaned content
350+
stats = dj.gc.scan(schema1, schema2, store_name='mystore')
381351

382-
if not dry_run:
383-
for content_hash in orphaned:
384-
delete_content(content_hash, store_name)
352+
# Remove orphaned content (dry_run=False to actually delete)
353+
stats = dj.gc.collect(schema1, schema2, store_name='mystore', dry_run=True)
385354

386-
return {'orphaned': len(orphaned), ...}
355+
# Format statistics for display
356+
print(dj.gc.format_stats(stats))
387357
```
388358

359+
**Key functions:**
360+
- `scan_references(*schemas, store_name=None)` - Scan tables for content hashes
361+
- `list_stored_content(store_name=None)` - List all content in `_content/` directory
362+
- `scan(*schemas, store_name=None)` - Find orphaned content without deleting
363+
- `collect(*schemas, store_name=None, dry_run=True)` - Remove orphaned content
364+
- `format_stats(stats)` - Human-readable statistics output
365+
366+
**GC Process:**
367+
1. Scan all tables in provided schemas for content-type attributes
368+
2. Extract content hashes from JSON metadata in those columns
369+
3. Scan storage `_content/` directory for all stored hashes
370+
4. Compute orphaned = stored - referenced
371+
5. Optionally delete orphaned content (when `dry_run=False`)
372+
389373
---
390374

391-
## Phase 6: Documentation and Testing πŸ”²
375+
## Phase 6: Documentation and Testing βœ…
392376

393-
**Status**: Pending
377+
**Status**: Complete
394378

395-
### Test files to create:
379+
### Test files created:
396380
- `tests/test_content_storage.py` - Content-addressed storage functions
397-
- `tests/test_xblob.py` - XBlobType roundtrip
398381
- `tests/test_type_composition.py` - Type chain encoding/decoding
399382
- `tests/test_gc.py` - Garbage collection
383+
- `tests/test_attribute_type.py` - AttributeType registry and DJBlobType (existing)
400384

401385
---
402386

@@ -415,7 +399,10 @@ def garbage_collect(schemas: list, store_name: str, dry_run=True) -> dict:
415399
| `src/datajoint/table.py` | βœ… | Type chain encoding on insert |
416400
| `src/datajoint/fetch.py` | βœ… | Type chain decoding on fetch |
417401
| `src/datajoint/blob.py` | βœ… | Removed bypass_serialization |
418-
| `src/datajoint/gc.py` | πŸ”² | Garbage collection (to be created) |
402+
| `src/datajoint/gc.py` | βœ… | Garbage collection for content storage |
403+
| `tests/test_content_storage.py` | βœ… | Tests for content_registry.py |
404+
| `tests/test_type_composition.py` | βœ… | Tests for type chain encoding/decoding |
405+
| `tests/test_gc.py` | βœ… | Tests for garbage collection |
419406

420407
---
421408

0 commit comments

Comments
Β (0)