Skip to content

Conversation

@keith-decker
Copy link
Contributor

Enhance error reporting for evaluation framework

Description

This PR improves error handling and observability in the evaluation framework by introducing structured error tracking and enhanced logging throughout the evaluation pipeline.

Changes

New Components:

  • Added ErrorEvent dataclass for structured error representation with comprehensive context (timestamp, type, severity, component, message, details, and recovery actions)
  • Added ErrorTracker class for tracking and aggregating errors with rate limiting capabilities and summary statistics

Enhanced Manager Error Handling:

  • Improved logging in _enqueue_invocation() to capture queue errors with context
  • Enhanced _worker_loop() to track and log processing failures with structured data
  • Added error tracking to evaluator invocations with details about which evaluator failed and why
  • Improved handler callback error handling to catch failures in _publish_results() without losing evaluation data
  • Enhanced skip policy evaluation with better error reporting
  • Improved evaluator configuration parsing with detailed error context and available evaluator listing
  • Better error messages for unknown evaluators and unsupported invocation types
  • Added error tracking to evaluator instantiation failures

New Features:

  • Added get_error_summary() method to Manager for diagnostic access to tracked errors
  • Export of ErrorEvent and ErrorTracker from public API
  • Comprehensive test coverage for error reporting functionality

Benefits

  • Better Diagnostics: Structured error data enables easier troubleshooting and monitoring
  • Operational Resilience: Errors are caught and logged without stopping evaluation pipeline
  • Enhanced Observability: Rich context in logs helps track evaluation failures in production
  • Test Coverage: New test suite validates error handling behavior

@keith-decker keith-decker requested review from a team as code owners January 13, 2026 17:06
def has_evaluators(self) -> bool:
return any(self._evaluators.values())

def get_error_summary(self) -> dict[str, Any]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please capture more details on the diagnostic purpose this API is being exposed for?

"Evaluator processing failed",
extra={
"error_type": "processing_error",
"component": "worker",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have the thread name in the component.

@adityamehra
Copy link
Contributor

@keith-decker Once the recent changes for queue and concurrency from main are merged the error handling needs to be updated accordingly.

- Added support for tracking errors by worker name and distinguishing between async and sync errors in ErrorTracker.
- Improved ErrorEvent to include worker name and async context.
- Updated Manager to log detailed error information during queue full and processing failures.
- Added unit tests for concurrent error scenarios and validation of error tracking functionality.
@keith-decker
Copy link
Contributor Author

@adityamehra This PR has been updated to support concurrency and the queue.

@keith-decker keith-decker merged commit 57be4dc into main Jan 26, 2026
14 checks passed
@keith-decker keith-decker deleted the HYBIM-490_evaluation-error-handling branch January 26, 2026 22:59
@github-actions github-actions bot locked and limited conversation to collaborators Jan 26, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants