Add Micrometer metrics to track xDS resource snapshot versions#1248
Add Micrometer metrics to track xDS resource snapshot versions#1248
Conversation
📝 WalkthroughWalkthroughAdds Micrometer-based control plane metrics: a new ControlPlaneMetrics class, MeterRegistry wiring into ControlPlaneService, per-resource version accessors on CentralDogmaSnapshot, widened visibility for CentralDogmaSnapshotResources.allResourceVersion(), and a unit test validating per-resource snapshot metrics. Startup/diff/stop flows now emit metrics. Changes
Sequence DiagramsequenceDiagram
participant CPS as ControlPlaneService
participant CDS as CentralDogmaSnapshot
participant CPM as ControlPlaneMetrics
participant MR as MeterRegistry
CPS->>CDS: capture snapshot (startup / on diff)
CDS-->>CPS: return snapshot with per-resource versions
CPS->>CPM: onSnapshotUpdate(snapshot)
CPM->>MR: update per-resource counters and gauges
CPS->>CPM: onStopped()
CPM->>MR: set init gauges to -1
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
xds/src/main/java/com/linecorp/centraldogma/xds/internal/ControlPlaneService.java (1)
211-215: MissingupdateVersionMetricscall when a group is removed.When a group is removed,
cache.setSnapshotis called butupdateVersionMetricsis not. This could leave stale version metrics until the next diff is processed.🐛 Proposed fix
`@Override` protected void onGroupRemoved(String groupName) { centralDogmaXdsResources.removeGroup(groupName); - cache.setSnapshot(DEFAULT_GROUP, centralDogmaXdsResources.snapshot()); + final CentralDogmaSnapshot snapshot = centralDogmaXdsResources.snapshot(); + cache.setSnapshot(DEFAULT_GROUP, snapshot); + updateVersionMetrics(snapshot); }
🧹 Nitpick comments (1)
xds/src/main/java/com/linecorp/centraldogma/xds/internal/ControlPlaneService.java (1)
238-288: Consider extracting a helper to reduce duplication.The four code blocks for updating version gauges are nearly identical. This could be simplified by using a
Map<String, Gauge>for the gauges and extracting a helper method.♻️ Suggested refactor using a Map
- `@Nullable` - private Gauge clustersVersionGauge; - `@Nullable` - private Gauge endpointsVersionGauge; - `@Nullable` - private Gauge listenersVersionGauge; - `@Nullable` - private Gauge routesVersionGauge; + private final Map<String, Gauge> versionGauges = new HashMap<>();Then refactor
updateVersionMetrics:private void updateVersionMetrics(CentralDogmaSnapshot snapshot) { updateVersionGauge("cluster", snapshot.clustersVersion()); updateVersionGauge("endpoint", snapshot.endpointsVersion()); updateVersionGauge("listener", snapshot.listenersVersion()); updateVersionGauge("route", snapshot.routesVersion()); } private void updateVersionGauge(String resourceType, String version) { final Gauge existingGauge = versionGauges.get(resourceType); if (existingGauge != null && version.equals(existingGauge.getId().getTag("version"))) { return; } if (existingGauge != null) { existingGauge.close(); meterRegistry.remove(existingGauge); } final Gauge newGauge = Gauge.builder(VERSION_METRIC_NAME, () -> 1) .tags(Tags.of("resource_type", resourceType, "version", version)) .register(meterRegistry); versionGauges.put(resourceType, newGauge); }
Motivation:
To monitor and track xDS resource changes, it's useful to expose the
version of each resource type as Micrometer metrics. This allows
operators to observe version changes in monitoring systems like
Prometheus.
Modifications:
- Added info gauges in `ControlPlaneService` for each resource type
with version as a tag
- Metric name: `xds.control.plane.snapshot.version`
- Tags: `resource_type` (cluster/endpoint/listener/route/secret),
`version` (SHA-256 hash)
- Added version getter methods to `CentralDogmaSnapshot`:
`clustersVersion()`, `endpointsVersion()`, `listenersVersion()`,
`routesVersion()`, `secretsVersion()`
- Made `allResourceVersion()` package-private in
`CentralDogmaSnapshotResources`
Result:
The version of xDS resources can now be tracked via Micrometer metrics.
0a77bd1 to
85127f0
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
xds/src/main/java/com/linecorp/centraldogma/xds/internal/ControlPlaneService.java (1)
309-334: Consider cleaning up gauges on stop.The version gauges remain registered in the
MeterRegistryafter the service stops. While this may be acceptable ifstop()is only called at application shutdown, consider removing the gauges for cleaner resource management.🔧 Suggested cleanup in stop()
void stop() { stop = true; + // Clean up version gauges + if (clustersVersionGauge != null) { + meterRegistry.remove(clustersVersionGauge); + } + if (endpointsVersionGauge != null) { + meterRegistry.remove(endpointsVersionGauge); + } + if (listenersVersionGauge != null) { + meterRegistry.remove(listenersVersionGauge); + } + if (routesVersionGauge != null) { + meterRegistry.remove(routesVersionGauge); + } + if (secretsVersionGauge != null) { + meterRegistry.remove(secretsVersionGauge); + } + final XdsEndpointService xdsEndpointService = this.xdsEndpointService;
🧹 Nitpick comments (2)
xds/src/main/java/com/linecorp/centraldogma/xds/internal/ControlPlaneService.java (2)
95-104: Consider documenting thread-safety assumption for gauge fields.Line 89-90 documents that
centralDogmaXdsResourcesis "Accessed only from controlPlaneExecutor". These gauge fields follow the same access pattern and would benefit from similar documentation for maintainability.
240-302: Reduce duplication by extracting a helper method.The same gauge update pattern is repeated five times. Consider extracting a helper to improve maintainability:
♻️ Proposed refactor
private Gauge updateVersionGauge(`@Nullable` Gauge existingGauge, String resourceType, String newVersion) { if (existingGauge != null && newVersion.equals(existingGauge.getId().getTag("version"))) { return existingGauge; } if (existingGauge != null) { meterRegistry.remove(existingGauge); } return Gauge.builder(VERSION_METRIC_NAME, () -> 1) .tags(Tags.of("resource", resourceType, "version", newVersion)) .register(meterRegistry); } private void updateVersionMetrics(CentralDogmaSnapshot snapshot) { clustersVersionGauge = updateVersionGauge( clustersVersionGauge, "cluster", snapshot.clusters().version()); endpointsVersionGauge = updateVersionGauge( endpointsVersionGauge, "endpoint", snapshot.endpoints().version()); listenersVersionGauge = updateVersionGauge( listenersVersionGauge, "listener", snapshot.listeners().version()); routesVersionGauge = updateVersionGauge( routesVersionGauge, "route", snapshot.routes().version()); secretsVersionGauge = updateVersionGauge( secretsVersionGauge, "secret", snapshot.secrets().version()); }Note:
gauge.close()is typically a no-op in Micrometer—meterRegistry.remove()handles the deregistration. Theclose()call can be omitted.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In
`@xds/src/main/java/com/linecorp/centraldogma/xds/internal/ControlPlaneService.java`:
- Around line 240-301: In updateVersionMetrics, the Gauge tags use "resource"
but should use "resource_type"; update all Gauge.builder(...) calls that set
Tags.of("resource", ...) to Tags.of("resource_type", ...) for
clustersVersionGauge, endpointsVersionGauge, listenersVersionGauge,
routesVersionGauge, and secretsVersionGauge so the tag key matches the
documented metric contract and dashboards/alerts.
xds/src/main/java/com/linecorp/centraldogma/xds/internal/ControlPlaneService.java
Outdated
Show resolved
Hide resolved
| .tags(Tags.of("resource", "cluster", "version", clustersVersion)) | ||
| .register(meterRegistry); |
There was a problem hiding this comment.
- It may be worth recording 0 to the gauge to differentiate between 1) a version not loaded at all 2) a server down (e.g. due to maintenance)
- FWIU version strings are a hash, which means the number of
MeterIdmay be unbounded with the current implementation.
There was a problem hiding this comment.
It may be worth recording 0 to the gauge to differentiate between 1) a version not loaded at all 2) a server down (e.g. due to maintenance)
Good idea.
The number of MeterId may be unbounded with the current implementation.
When a resource version is changed, the previous meters are closed. Therefore, the number of meters is fixed at 5 regardless of version strings.
There was a problem hiding this comment.
The number of MeterId may be unbounded with the current implementation.
For the record, I was concerned more about downstream aggregators rather than the application
There was a problem hiding this comment.
I agree with you. Since the hash value is not important, how about using counters instead? We can probably just increase the counter when it's changed, and I think it will be enough.
There was a problem hiding this comment.
I'm not sure how to correctly set up an alarm using a counter when the xDS resources is not initialized properly. If the counter does not increase for a certain period time, should that be regarded as an abnormal situation?
If you are concerned the cardinality of the metrics, we may add two metrics.
- A guage to return 0 if it a resource is not initialized.
- A counter to increase when a resource is updated.
There was a problem hiding this comment.
If the counter does not increase for a certain period time, should that be regarded as an abnormal situation?
No, I don't think we should rely on the counter to spot that issue. The same applies to using a gauge. If we want to detect that situation, we would need a different approach, such as checking the counter after a change is committed.
There was a problem hiding this comment.
There are only five resource types, so I don't think frequent changes to resource versions would overload the downstream aggregators. It would be better to explore other approaches, but integrating change timestamps with the monitoring systems doesn't seem easy. For now, I will set up two metrics suggested above.
There was a problem hiding this comment.
For now, I will set up two metrics suggested above.
Yeah, that will work. 👍
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In
`@xds/src/main/java/com/linecorp/centraldogma/xds/internal/ControlPlaneMetrics.java`:
- Around line 30-34: The integer fields used by Gauge suppliers in
ControlPlaneMetrics (clustersInit, endpointsInit, listenersInit, routesInit,
secretsInit) should be declared volatile to guarantee visibility across threads;
update their declarations to volatile and ensure onSnapshotUpdate continues to
write to these fields and the Gauge suppliers (e.g., () -> clustersInit) read
them directly so the meter registry sees consistent updates from other threads.
In
`@xds/src/test/java/com/linecorp/centraldogma/xds/internal/XdsSnapshotVersionMetricsTest.java`:
- Around line 107-119: The test helper methods getRevisionFromCounter and
isResourceInitialized can NPE when meterRegistry.find(...).counter()/gauge()
returns null; add null checks using test assertions (e.g., assertNotNull) for
the Counter returned in getRevisionFromCounter and the Gauge returned in
isResourceInitialized before calling .count()/.value(), and include the
resourceType in the assertion message so the test fails with a clear message if
the meter is missing.
🧹 Nitpick comments (3)
.gitignore (1)
143-145: Unrelated change to PR objectives.This addition to ignore
AGENTS.mdappears unrelated to the xDS metrics feature described in the PR objectives. Consider moving this to a separate commit or PR to maintain a clean commit history focused on the metrics feature.xds/src/test/java/com/linecorp/centraldogma/xds/internal/XdsSnapshotVersionMetricsTest.java (1)
53-105: Consider expanding test coverage.The test only verifies metrics for the
clusterresource type. Consider adding test cases for:
- Other resource types (endpoint, listener, route) to ensure metrics work across all types.
- The
onStopped()behavior to verify the initialized gauge transitions to-1.xds/src/main/java/com/linecorp/centraldogma/xds/internal/ControlPlaneMetrics.java (1)
87-142: Consider reducing code duplication inonSnapshotUpdate.The logic for each resource type is identical: compare version → update stored version → set init flag → increment counter. This pattern is repeated five times. Consider extracting a helper method to reduce duplication and improve maintainability.
♻️ Proposed refactor using a helper method
+ private interface VersionSupplier { + String get(); + } + + private String updateResourceMetrics(String currentVersion, `@Nullable` String storedVersion, + Counter counter, int[] initFlag) { + if (!currentVersion.equals(storedVersion)) { + initFlag[0] = "empty_resources".equals(currentVersion) ? 0 : 1; + counter.increment(); + return currentVersion; + } + return storedVersion; + }Alternatively, consider using an enum or a small data class to hold per-resource state (version, init flag, counter), then iterate over the resource types in a loop.
xds/src/main/java/com/linecorp/centraldogma/xds/internal/ControlPlaneMetrics.java
Outdated
Show resolved
Hide resolved
xds/src/test/java/com/linecorp/centraldogma/xds/internal/XdsSnapshotVersionMetricsTest.java
Show resolved
Hide resolved
…1248) Motivation: To monitor and track xDS resource changes, it's useful to expose the version of each resource type as Micrometer metrics. This allows operators to observe version changes in monitoring systems like Prometheus. Modifications: - Added info gauges in `ControlPlaneService` for each resource type with version as a tag - Metric name: `xds.control.plane.snapshot.version` - Tags: `resource` (cluster/endpoint/listener/route), `version` (SHA-256 hash) - Added version getter methods to `CentralDogmaSnapshot`: `clustersVersion()`, `endpointsVersion()`, `listenersVersion()`, `routesVersion()` - Made `allResourceVersion()` package-private in `CentralDogmaSnapshotResources` Result: The version of xDS resources can now be tracked via Micrometer metrics. Signed-off-by: Seonghyeon Cho <[email protected]>
Motivation:
To monitor and track xDS resource changes, it's useful to expose the
version of each resource type as Micrometer metrics. This allows
operators to observe version changes in monitoring systems like
Prometheus.
Modifications:
ControlPlaneServicefor each resource typewith version as a tag
xds.control.plane.snapshot.versionresource(cluster/endpoint/listener/route),version(SHA-256 hash)CentralDogmaSnapshot:clustersVersion(),endpointsVersion(),listenersVersion(),routesVersion()allResourceVersion()package-private inCentralDogmaSnapshotResourcesResult:
The version of xDS resources can now be tracked via Micrometer metrics.