-
Notifications
You must be signed in to change notification settings - Fork 6
[HYBIM-490] Enhance error reporting for evaluation framework #129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
util/opentelemetry-util-genai-evals/src/opentelemetry/util/genai/evals/manager.py
Outdated
Show resolved
Hide resolved
| def has_evaluators(self) -> bool: | ||
| return any(self._evaluators.values()) | ||
|
|
||
| def get_error_summary(self) -> dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please capture more details on the diagnostic purpose this API is being exposed for?
| "Evaluator processing failed", | ||
| extra={ | ||
| "error_type": "processing_error", | ||
| "component": "worker", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to have the thread name in the component.
util/opentelemetry-util-genai-evals/src/opentelemetry/util/genai/evals/manager.py
Show resolved
Hide resolved
|
@keith-decker Once the recent changes for queue and concurrency from main are merged the error handling needs to be updated accordingly. |
- Added support for tracking errors by worker name and distinguishing between async and sync errors in ErrorTracker. - Improved ErrorEvent to include worker name and async context. - Updated Manager to log detailed error information during queue full and processing failures. - Added unit tests for concurrent error scenarios and validation of error tracking functionality.
|
@adityamehra This PR has been updated to support concurrency and the queue. |
Enhance error reporting for evaluation framework
Description
This PR improves error handling and observability in the evaluation framework by introducing structured error tracking and enhanced logging throughout the evaluation pipeline.
Changes
New Components:
ErrorEventdataclass for structured error representation with comprehensive context (timestamp, type, severity, component, message, details, and recovery actions)ErrorTrackerclass for tracking and aggregating errors with rate limiting capabilities and summary statisticsEnhanced Manager Error Handling:
_enqueue_invocation()to capture queue errors with context_worker_loop()to track and log processing failures with structured data_publish_results()without losing evaluation dataNew Features:
get_error_summary()method to Manager for diagnostic access to tracked errorsErrorEventandErrorTrackerfrom public APIBenefits