Skip to content

Latest commit

 

History

History
77 lines (57 loc) · 2.34 KB

File metadata and controls

77 lines (57 loc) · 2.34 KB

tensor_blob Benchmarks

The tensor_blob crate provides S3-style chunked blob storage with content-addressable chunks, garbage collection, and integrity verification.

Overview

tensor_blob focuses on correctness and durability over raw throughput. Performance characteristics depend heavily on:

  • Chunk size configuration
  • Storage backend (memory vs disk)
  • Network conditions for streaming operations

Expected Performance Characteristics

Operation Complexity Notes
Put (upload) O(size / chunk_size) Linear with data size
Get (download) O(size / chunk_size) Linear with data size
Delete O(chunk_count) Removes metadata + orphan detection
GC O(total_chunks) Full chunk scan
Verify O(size) Re-hash entire blob
Repair O(corrupted_chunks) Only processes damaged chunks

Chunk Deduplication

Identical content shares chunks via SHA-256 content addressing:

  • Duplicate blobs: Store once, reference count tracked
  • Partial overlap: Shared chunks deduplicated at chunk boundaries
  • Storage savings: Depends on data redundancy

Garbage Collection

Operation Behavior
gc() Returns GcStats { deleted, freed_bytes }
Orphan detection Marks unreferenced chunks
Active upload protection GC skips in-progress uploads

Streaming Operations

API Use Case
BlobWriter Streaming upload, bounded memory
BlobReader::next_chunk() Streaming download, chunk-by-chunk
get_full() Small blobs (<10MB), loads to memory

Configuration Impact

Setting Impact
Larger chunk_size Fewer chunks, less overhead, less dedup
Smaller chunk_size More chunks, more overhead, better dedup
Recommended 1-4 MB chunks for most workloads

Integration Notes

  • Blob store persists to TensorStore
  • Metadata includes checksum, size, creation time
  • Links enable blob-to-graph entity relationships
  • Tags support blob categorization and search

Benchmarking Blob Operations

# Run blob-specific benchmarks (if available)
cargo bench --package tensor_blob

# For custom benchmarking, use the streaming API:
# - Measure upload throughput with BlobWriter
# - Measure download throughput with BlobReader
# - Test GC performance with various orphan ratios