Skip to content

Conversation

@fregataa
Copy link
Member

@fregataa fregataa commented Jan 21, 2026

resolves #NNN (BA-MMM)

Summary

  • Fixed unhandled GPFSNoMetricError exception in GPFS volume performance metric collection
  • When GPFS performance metric data is unavailable, return an empty FSPerfMetric with zero values instead of raising an exception

Problem

The GPFSNoMetricError exception was raised when GPFS API returned no performance metric data (due to KeyError or IndexError), but this exception was not caught or handled anywhere in the codebase. This caused unexpected crashes when metrics were temporarily unavailable.

Solution

Instead of raising an unhandled exception, return a valid FSPerfMetric object with all values set to zero. This allows the system to gracefully handle cases where metric data is unavailable without disrupting normal operations.

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue
  • Installer updates including:
    • Fixtures for db schema changes
    • New mandatory config options
  • Update of end-to-end CLI integration tests in ai.backend.test
  • API server-client counterparts (e.g., manager API -> client SDK)
  • Test case(s) to:
    • Demonstrate the difference of before/after
    • Demonstrate the flow of abstract/conceptual models with a concrete implementation
  • Documentation
    • Contents in the docs directory
    • docstrings in public interfaces and type annotations

@fregataa fregataa added this to the 25.15 milestone Jan 21, 2026
@fregataa fregataa requested a review from HyeockJinKim January 21, 2026 16:21
@fregataa fregataa self-assigned this Jan 21, 2026
Copilot AI review requested due to automatic review settings January 21, 2026 16:21
@github-actions github-actions bot added size:S 10~30 LoC comp:storage-proxy Related to Storage proxy component labels Jan 21, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug where GPFS performance metrics would cause an unhandled exception when unavailable. Instead of raising GPFSNoMetricError (which was not properly handled as an HTTP exception), the code now returns a zero-valued FSPerfMetric when metrics cannot be retrieved.

Changes:

  • Return empty performance metrics (all zeros) instead of raising GPFSNoMetricError when GPFS metrics are unavailable
  • Remove unused import of GPFSNoMetricError from the GPFS volume implementation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +258 to +265
return FSPerfMetric(
iops_read=0,
iops_write=0,
io_bytes_read=0,
io_bytes_write=0,
io_usec_read=0,
io_usec_write=0,
)
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using -1 instead of 0 for unavailable metrics to be consistent with the VAST volume implementation. The VAST implementation uses -1 values to explicitly indicate that metrics are unavailable, which is more semantically correct than 0 (which could be confused with actual zero activity). This would make the error handling pattern more consistent across different storage backend implementations.

Copilot uses AI. Check for mistakes.
@fregataa fregataa requested a review from jopemachine January 22, 2026 02:14
metrics = await self.api_client.get_metric(query)
latest_metric = metrics["performanceData"]["rows"][-1]["values"]
except (KeyError, IndexError):
raise GPFSNoMetricError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPFSNoMetricError is not used anywhere, so we can remove it

@jopemachine
Copy link
Member

Is there a reason for changing the behavior that previously raised an exception?
I think it would be helpful if the rationale were explained in the PR or the related issue.

@fregataa fregataa force-pushed the fix/gpfs-return-empty-performance-metric branch from 1ff4fc3 to e14010f Compare January 22, 2026 03:58
@fregataa fregataa requested a review from jopemachine January 22, 2026 03:59
Copy link
Collaborator

@HyeockJinKim HyeockJinKim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like return an empty value instead of raising an error might be riskier.
What's the reason for this work?

@fregataa fregataa marked this pull request as draft January 22, 2026 05:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:storage-proxy Related to Storage proxy component size:S 10~30 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants