Skip to content

Performance: Memory-optimized mode for histogram and heatmap (two-pass streaming) #34

@gregeva

Description

@gregeva

Summary

Add an optional memory-optimized mode for histogram and heatmap calculation that uses a two-pass streaming approach instead of storing all raw values in memory.

Background

Both histogram (Issue #25) and heatmap features store raw values in memory:

  • Histogram: Stores all duration/bytes/count values globally in %histogram_values
  • Heatmap: Stores raw values per time bucket in %heatmap_raw{$bucket}

For very large log files (e.g., chaining 30 days of logs together), this can consume >20GB RAM.

Proposed Enhancement

Implement an alternative two-pass approach for both features:

Pass 1: Stream through log file(s) to determine min/max values for each metric
Pass 2: Stream again, counting values directly into pre-calculated bucket boundaries

Trade-offs

Approach Memory Time/I/O Complexity
Current (store all values) O(n) where n = number of log entries Single pass Simple
Two-pass streaming O(b) where b = number of buckets (fixed, e.g., 50-100) Double I/O More complex

Automatic Mode Selection

Before reading any log files, the application should:

  1. Check total input file sizes: Sum the size of all input log files
  2. Check available system memory: Query free/available RAM
  3. Apply heuristics: Estimate memory required based on file sizes and typical log density
  4. Auto-select mode:
    • If estimated memory usage < available memory threshold (e.g., 50% of free RAM): use standard mode
    • Otherwise: automatically enable memory-optimized two-pass mode
  5. User override: Command-line flags to force either mode regardless of auto-detection

Architectural Considerations to Explore

  1. Command-line options:
    • --low-memory to force two-pass mode
    • --no-low-memory to force single-pass (standard) mode
    • Default: auto-detect based on file sizes and available memory
  2. Threshold configuration: Allow users to set memory threshold for auto-detection
  3. Hybrid approach: Store values up to a limit, then switch to approximate methods
  4. Shared infrastructure: Both histogram and heatmap could share the two-pass logic
  5. Percentile calculation: Two-pass mode would need approximate percentiles (e.g., t-digest, quantile sketches) since exact percentiles require all values
  6. Piped input: May need different strategy (buffering, sampling) since stdin cannot be re-read and size is unknown

Affected Features

  • Heatmap (-hm): Currently stores %heatmap_raw{$bucket} arrays
  • Histogram (-hg): Will store %histogram_values{$metric} arrays

Acceptance Criteria

  • Memory usage remains constant regardless of log file size when memory-optimized mode is active
  • Results are identical or statistically equivalent to the standard approach
  • Performance impact documented (expected ~2x I/O time)
  • Auto-detection correctly identifies when memory-optimized mode is needed
  • Works with multiple input files
  • Graceful handling of piped input (fallback to standard mode with warning, or sampling)
  • Shared implementation between histogram and heatmap where possible
  • User can override auto-detection with command-line flags

Related

Priority

Low - optimization for edge cases with very large log files. Current implementations work for typical use cases.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions