MUG Compression Architecture

Overview

MUG uses a hybrid compression strategy combining Zstandard (zstd) for native efficiency and Flate2/zlib for Git compatibility. The architecture provides abstraction through trait-based compression that allows for extensibility and fallback mechanisms.

Compression Strategy

Primary Compression: Zstandard (zstd)

Usage: Native MUG pack files and object storage

Characteristics:

5-10x faster than zlib at similar compression levels
Better compression ratios than zlib
Streaming support for large files
Preferred for new MUG operations

Implementation:

pub struct ZstdCompressor {
    level: i32,  // 3 (fast) to 10 (best)
}

Compression Levels:

Level 3 (Fast): ~40% compression, ~100x faster
Level 10 (Default): ~45% compression, balanced speed/ratio

Used In:

src/pack/pack_builder.rs - Creating pack files
src/pack/pack_reader.rs - Reading pack files
src/pack/pack_file.rs - Writing/reading individual chunks
src/remote/parallel_fetch.rs - Downloading chunks

Secondary Compression: Flate2/zlib

Usage: Git compatibility and fallback compression

Characteristics:

Standard zlib format (gzip wrapper)
Compatible with Git's compression
Slower than zstd but more universally supported
Used for legacy data and Git object handling

Implementation:

pub struct FlateCompressor;

Used In:

src/remote/git_compat.rs - Git object decompression
Fallback for incompatible data formats

Abstraction Layer

Compressor Trait

All compression implementations follow the Compressor trait:

pub trait Compressor {
    fn compress(&self, data: &[u8]) -> std::io::Result<Vec<u8>>;
    fn decompress(&self, data: &[u8]) -> std::io::Result<Vec<u8>>;
}

This trait-based design allows:

Runtime compression selection based on data type
Easy fallback from zstd to flate2
Future compression algorithms without breaking existing code
Testing with mock compressors

Module Structure

src/pack/
├── compression.rs
│   ├── Compressor (trait)
│   ├── ZstdCompressor
│   ├── FlateCompressor
│   └── Tests
├── pack_builder.rs (uses ZstdCompressor)
├── pack_reader.rs (uses ZstdCompressor)
├── pack_file.rs (uses ZstdCompressor)
├── manifest.rs (tracks compression metadata)
└── packer.rs (high-level packing orchestration)

src/remote/
├── git_compat.rs (uses FlateCompressor for Git)
└── parallel_fetch.rs (uses ZstdCompressor for downloads)

Compression Flow

Creating Pack Files (Writing)

Object Data
    ↓
[Chunking] (rolling hash deduplication)
    ↓
[Compression] (ZstdCompressor::fast())
    ↓
[Pack Storage] (.mug/objects/packs/)
    ↓
[Manifest Update] (track compressed sizes)

Reading Pack Files (Decompressing)

Pack File (on disk)
    ↓
[Load Chunk Header] (read size metadata)
    ↓
[Read Compressed Data]
    ↓
[Decompression] (ZstdCompressor::decompress())
    ↓
[Reconstruct Object]

Git Compatibility Flow

Git Object (zlib compressed)
    ↓
[FlateCompressor::decompress()]
    ↓
[Parse Content]
    ↓
[Import to MUG] (recompress with zstd if needed)

Fallback Mechanism

The compression architecture supports automatic fallback:

// Try primary compression (zstd)
match zstd_compressor.decompress(data) {
    Ok(result) => Ok(result),
    Err(_) => {
        // Fall back to flate2 if zstd fails
        flate2_compressor.decompress(data)
    }
}

When Fallback Occurs

Git Repository Migration: Importing from Git uses zlib
Legacy Data: Old zlib-compressed objects automatically decompress
External Data: Data from non-MUG sources may use different compression
Format Detection: Automatic format detection based on headers

Performance Characteristics

Compression Speed Comparison

Operation	zstd (L3)	zstd (L10)	zlib
Compress 1GB	~10s	~100s	~400s
Decompress 1GB	~1s	~1s	~30s
Compression Ratio	45%	47%	43%

Optimal Level Selection

// Fast operations (real-time packing)
ZstdCompressor::fast()  // Level 3

// Balanced operations (normal pack creation)
ZstdCompressor::default()  // Level 10

// Custom levels
ZstdCompressor::new(5)  // Custom level

Memory Usage

Zstd Memory Requirements

Compression: ~1-2 MB buffer (level dependent)
Decompression: ~64 KB window buffer
Streaming: Constant memory regardless of file size

Flate2 Memory Requirements

Compression: ~32 KB
Decompression: ~32 KB
Streaming: Constant memory regardless of file size

Data Format

Pack File Structure with Compression

[Pack Header]
  - Format version
  - Chunk count
  - Manifest offset

[Compressed Chunks] (repeating)
  - Chunk Hash (32 bytes)
  - Compressed Size (4 bytes, little-endian)
  - Compressed Data
  - CRC32 checksum (4 bytes, optional)

[Manifest]
  - Chunk registry (hash -> offset mapping)
  - Compression metadata
  - Deduplication info

Manifest Metadata

Each chunk entry in the manifest tracks:

{
  "hash": "sha256hash",
  "size": 1048576,
  "compressed_size": 262144,
  "compression": "zstd",
  "offset": 65536,
  "dedup_count": 3
}

Integration Points

Pack Builder

let compressor = ZstdCompressor::fast();
let compressed = compressor.compress(&chunk_data)?;
current_pack.data.write_all(&compressed)?;
manifest.add_chunk_compressed(
    chunk_hash,
    chunk_data.len() as u64,
    compressed.len() as u64,
    "zstd".to_string(),
)?;

Pack Reader

let compressor = ZstdCompressor::default();
let compressed = read_chunk_from_disk()?;
let decompressed = compressor.decompress(&compressed)?;

Remote Operations

// Downloads use fast compression for bandwidth
let compressor = ZstdCompressor::fast();

Configuration Options

Currently, compression levels are hardcoded for optimal balance:

Fast operations: Level 3 (real-time feedback)
Batch operations: Level 10 (maximum efficiency)

Future Configuration

Planned enhancements include:

[compression]
default_level = 10
pack_level = 10
remote_level = 3
enable_zstd = true
enable_flate2 = true

Testing

Unit Tests

cargo test core::pack::compression

Tests verify:

Round-trip compression/decompression
Compression ratio expectations
Empty data handling
Large data handling
Different compression levels

Example Tests

#[test]
fn test_zstd_compression() {
    let compressor = ZstdCompressor::default();
    let data = b"hello world".repeat(100);
    
    let compressed = compressor.compress(&data).unwrap();
    let decompressed = compressor.decompress(&compressed).unwrap();
    
    assert_eq!(data.to_vec(), decompressed);
    assert!(compressed.len() < data.len());
}

#[test]
fn test_compression_ratio() {
    let compressor = ZstdCompressor::default();
    let data = vec![b'a'; 10000];
    
    let compressed = compressor.compress(&data).unwrap();
    let ratio = compressed.len() as f64 / data.len() as f64;
    
    // Highly repetitive data should compress < 1%
    assert!(ratio < 0.01);
}

Troubleshooting

High Compression Time

If mug pack create is slow:

Check compression level: Default is level 10 (most time-consuming)
Use streaming: For real-time operations, use ZstdCompressor::fast()
Hardware: Increase available CPU cores (multi-threading supported)

Decompression Errors

If unpacking fails with "decompression error":

Check format: Verify pack file isn't corrupted (mug pack verify)
Try fallback: System should auto-fallback to zlib if needed
Verify repository: Run mug verify to check integrity
Garbage collection: Run mug gc to clean corrupted data

Compression Ratio Lower Than Expected

If compression isn't achieving expected ratios:

Data type: Already-compressed data (images, video) won't compress well
Level settings: Verify using appropriate level for data type
Metadata overhead: Small files have high header overhead

Dependencies

zstd: 0.13.3 (Zstandard compression)
flate2: 1.1.5 (gzip/zlib compression)
sha2: 0.10.9 (chunk hashing)

Future Improvements

Adaptive Compression: Auto-select level based on data type
Parallel Compression: Multi-threaded compression for large files
Compression Hints: User-provided hints for incompressible data
Dictionary-based: Use compression dictionaries for similar objects
Network Compression: Transparent compression for remote transfers
Benchmarking: Built-in compression benchmarking tools

FilesExpand file tree

COMPRESSION_ARCHITECTURE.md

Latest commit

History