-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
Add an optional memory-optimized mode for histogram and heatmap calculation that uses a two-pass streaming approach instead of storing all raw values in memory.
Background
Both histogram (Issue #25) and heatmap features store raw values in memory:
- Histogram: Stores all duration/bytes/count values globally in
%histogram_values - Heatmap: Stores raw values per time bucket in
%heatmap_raw{$bucket}
For very large log files (e.g., chaining 30 days of logs together), this can consume >20GB RAM.
Proposed Enhancement
Implement an alternative two-pass approach for both features:
Pass 1: Stream through log file(s) to determine min/max values for each metric
Pass 2: Stream again, counting values directly into pre-calculated bucket boundaries
Trade-offs
| Approach | Memory | Time/I/O | Complexity |
|---|---|---|---|
| Current (store all values) | O(n) where n = number of log entries | Single pass | Simple |
| Two-pass streaming | O(b) where b = number of buckets (fixed, e.g., 50-100) | Double I/O | More complex |
Automatic Mode Selection
Before reading any log files, the application should:
- Check total input file sizes: Sum the size of all input log files
- Check available system memory: Query free/available RAM
- Apply heuristics: Estimate memory required based on file sizes and typical log density
- Auto-select mode:
- If estimated memory usage < available memory threshold (e.g., 50% of free RAM): use standard mode
- Otherwise: automatically enable memory-optimized two-pass mode
- User override: Command-line flags to force either mode regardless of auto-detection
Architectural Considerations to Explore
- Command-line options:
--low-memoryto force two-pass mode--no-low-memoryto force single-pass (standard) mode- Default: auto-detect based on file sizes and available memory
- Threshold configuration: Allow users to set memory threshold for auto-detection
- Hybrid approach: Store values up to a limit, then switch to approximate methods
- Shared infrastructure: Both histogram and heatmap could share the two-pass logic
- Percentile calculation: Two-pass mode would need approximate percentiles (e.g., t-digest, quantile sketches) since exact percentiles require all values
- Piped input: May need different strategy (buffering, sampling) since stdin cannot be re-read and size is unknown
Affected Features
- Heatmap (
-hm): Currently stores%heatmap_raw{$bucket}arrays - Histogram (
-hg): Will store%histogram_values{$metric}arrays
Acceptance Criteria
- Memory usage remains constant regardless of log file size when memory-optimized mode is active
- Results are identical or statistically equivalent to the standard approach
- Performance impact documented (expected ~2x I/O time)
- Auto-detection correctly identifies when memory-optimized mode is needed
- Works with multiple input files
- Graceful handling of piped input (fallback to standard mode with warning, or sampling)
- Shared implementation between histogram and heatmap where possible
- User can override auto-detection with command-line flags
Related
- Histogram feature: Feature: Histogram charts for key metrics (duration, bytes, count) #25
- Heatmap feature: implemented in v0.8.0
Priority
Low - optimization for edge cases with very large log files. Current implementations work for typical use cases.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request