Skip to content

feat(operators): wire metrics field and add collect_metrics() to OperatorManager#203

Open
bootcrowns wants to merge 2 commits intoopen-jarvis:mainfrom
bootcrowns:feat/operator-metrics-collection
Open

feat(operators): wire metrics field and add collect_metrics() to OperatorManager#203
bootcrowns wants to merge 2 commits intoopen-jarvis:mainfrom
bootcrowns:feat/operator-metrics-collection

Conversation

@bootcrowns
Copy link
Copy Markdown

Summary

OperatorManifest has had a metrics: List[str] field since the initial commit but OperatorManager never read it, leaving the monitoring infrastructure described in the roadmap (Workstream 1 – Metrics Collection) unimplemented.

This PR wires the metrics field through the manager in three places:

Changes

activate() — passes manifest.metrics into the scheduler task metadata so workers/plugins can introspect which metrics the operator declares.

status() — includes metrics in the per-operator status dict alongside tools, schedule_type, etc., so callers get a complete view of the operator's declared monitoring intent.

collect_metrics(operator_id, *, since, until) [new method] — queries system.telemetry (a TelemetryAggregator instance) and returns only the summary-level stat fields explicitly declared in manifest.metrics. Behaviour:

  • Returns {} gracefully when manifest.metrics is empty or telemetry is not configured.
  • Unknown metric names are skipped with a DEBUG log (forward-compatible with future stats).
  • Supports optional since/until Unix-timestamp filters, passed through to TelemetryAggregator.summary().
  • All telemetry errors are caught and logged, never raised to callers.

Supported metric names

Any field on AggregatedStats from openjarvis.telemetry.aggregator:
total_calls, total_tokens, total_cost, total_latency, total_energy_joules, avg_throughput_tok_per_sec, avg_gpu_utilization_pct, avg_energy_per_output_token_joules, avg_throughput_per_watt, total_prefill_energy_joules, total_decode_energy_joules, avg_mean_itl_ms, avg_median_itl_ms, avg_p95_itl_ms.

Example operator TOML

id = "monitor-agent"
name = "Monitor Agent"
schedule_type = "interval"
schedule_value = "300"
metrics = ["total_calls", "avg_gpu_utilization_pct", "total_energy_joules"]

Testing

  • No existing tests were changed or broken (pure additive surface).
  • collect_metrics is safe to call when system.telemetry is None (returns {}).

Relates to Workstream 1 (Metrics Collection) in the development roadmap.

…atorManager

OperatorManifest has had a `metrics: List[str]` field since the initial
commit but OperatorManager never read it. This change wires that field
through the manager in three ways:

1. activate(): passes `metrics` into the scheduler task metadata so
   workers can introspect which metrics an operator cares about.

2. status(): includes `metrics` in the per-operator status dict returned
   to callers, making it visible alongside tools and schedule info.

3. collect_metrics(operator_id, *, since, until) [new method]: queries
   the system's TelemetryAggregator (system.telemetry) and returns only
   the summary fields explicitly declared in manifest.metrics. Unknown
   metric names are skipped with a DEBUG log so old manifests remain
   forward-compatible. Returns an empty dict gracefully when telemetry
   is not configured.
@robbym-dev
Copy link
Copy Markdown
Collaborator

Hi @bootcrowns ,

Thank you for your contribution! Would you be able to repair the lint errors? Once fixed, I can merge the PR. Thank you so much!

@bootcrowns
Copy link
Copy Markdown
Author

Hi @robbym-dev, thanks for the review! I've wrapped the long lines in collect_metrics to satisfy the E501 lint check. Ready for merge now. I've also pushed the requested fixes for #202. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants