@@ -21,8 +21,8 @@ This plan describes the implementation of a three-layer type architecture for Da
2121| Phase 2b: Path-Addressed Storage | β
Complete | ObjectType for files/folders |
2222| Phase 3: User-Defined AttributeTypes | β
Complete | AttachType, XAttachType, FilepathType |
2323| Phase 4: Insert and Fetch Integration | β
Complete | Type chain encoding/decoding |
24- | Phase 5: Garbage Collection | π² Pending | |
25- | Phase 6: Documentation and Testing | π² Pending | |
24+ | Phase 5: Garbage Collection | β
Complete | gc.py with scan/collect functions |
25+ | Phase 6: Documentation and Testing | β
Complete | Test files for all new types |
2626
2727---
2828
@@ -337,66 +337,50 @@ def _get(connection, attr, data, squeeze, download_path):
337337
338338---
339339
340- ## Phase 5: Garbage Collection π²
340+ ## Phase 5: Garbage Collection β
341341
342- ** Status** : Pending
343-
344- ### Design (updated for function-based approach):
342+ ** Status** : Complete
345343
346- Since we don't have a registry table, GC works by scanning :
344+ ### Implemented in ` src/datajoint/gc.py ` :
347345
348346``` python
349- def scan_content_references (schemas : list ) -> set[tuple[str , str ]]:
350- """
351- Scan all schemas for content references.
352-
353- Returns:
354- Set of (content_hash, store) tuples that are referenced
355- """
356- referenced = set ()
357- for schema in schemas:
358- for table in schema.tables:
359- for attr in table.heading.attributes:
360- if uses_content_storage(attr):
361- # Fetch all JSON metadata from this column
362- for row in table.fetch(attr.name):
363- if isinstance (row, dict ) and ' hash' in row:
364- referenced.add((row[' hash' ], row.get(' store' )))
365- return referenced
366-
367- def list_stored_content (store_name : str ) -> set[str ]:
368- """ List all content hashes in a store by scanning _content/ directory."""
369- ...
370-
371- def garbage_collect (schemas : list , store_name : str , dry_run = True ) -> dict :
372- """
373- Remove unreferenced content from storage.
347+ import datajoint as dj
374348
375- Returns:
376- Stats: {'scanned': N, 'orphaned': M, 'deleted': K, 'bytes_freed': B}
377- """
378- referenced = scan_content_references(schemas)
379- stored = list_stored_content(store_name)
380- orphaned = stored - {h for h, s in referenced if s == store_name}
349+ # Scan schemas and find orphaned content
350+ stats = dj.gc.scan(schema1, schema2, store_name = ' mystore' )
381351
382- if not dry_run:
383- for content_hash in orphaned:
384- delete_content(content_hash, store_name)
352+ # Remove orphaned content (dry_run=False to actually delete)
353+ stats = dj.gc.collect(schema1, schema2, store_name = ' mystore' , dry_run = True )
385354
386- return {' orphaned' : len (orphaned), ... }
355+ # Format statistics for display
356+ print (dj.gc.format_stats(stats))
387357```
388358
359+ ** Key functions:**
360+ - ` scan_references(*schemas, store_name=None) ` - Scan tables for content hashes
361+ - ` list_stored_content(store_name=None) ` - List all content in ` _content/ ` directory
362+ - ` scan(*schemas, store_name=None) ` - Find orphaned content without deleting
363+ - ` collect(*schemas, store_name=None, dry_run=True) ` - Remove orphaned content
364+ - ` format_stats(stats) ` - Human-readable statistics output
365+
366+ ** GC Process:**
367+ 1 . Scan all tables in provided schemas for content-type attributes
368+ 2 . Extract content hashes from JSON metadata in those columns
369+ 3 . Scan storage ` _content/ ` directory for all stored hashes
370+ 4 . Compute orphaned = stored - referenced
371+ 5 . Optionally delete orphaned content (when ` dry_run=False ` )
372+
389373---
390374
391- ## Phase 6: Documentation and Testing π²
375+ ## Phase 6: Documentation and Testing β
392376
393- ** Status** : Pending
377+ ** Status** : Complete
394378
395- ### Test files to create :
379+ ### Test files created :
396380- ` tests/test_content_storage.py ` - Content-addressed storage functions
397- - ` tests/test_xblob.py ` - XBlobType roundtrip
398381- ` tests/test_type_composition.py ` - Type chain encoding/decoding
399382- ` tests/test_gc.py ` - Garbage collection
383+ - ` tests/test_attribute_type.py ` - AttributeType registry and DJBlobType (existing)
400384
401385---
402386
@@ -415,7 +399,10 @@ def garbage_collect(schemas: list, store_name: str, dry_run=True) -> dict:
415399| ` src/datajoint/table.py ` | β
| Type chain encoding on insert |
416400| ` src/datajoint/fetch.py ` | β
| Type chain decoding on fetch |
417401| ` src/datajoint/blob.py ` | β
| Removed bypass_serialization |
418- | ` src/datajoint/gc.py ` | π² | Garbage collection (to be created) |
402+ | ` src/datajoint/gc.py ` | β
| Garbage collection for content storage |
403+ | ` tests/test_content_storage.py ` | β
| Tests for content_registry.py |
404+ | ` tests/test_type_composition.py ` | β
| Tests for type chain encoding/decoding |
405+ | ` tests/test_gc.py ` | β
| Tests for garbage collection |
419406
420407---
421408
0 commit comments