Skip to content

Add Micrometer metrics to track xDS resource snapshot versions#1248

Merged
ikhoon merged 7 commits intoline:mainfrom
ikhoon:xds-resource-version-metric
Feb 5, 2026
Merged

Add Micrometer metrics to track xDS resource snapshot versions#1248
ikhoon merged 7 commits intoline:mainfrom
ikhoon:xds-resource-version-metric

Conversation

@ikhoon
Copy link
Contributor

@ikhoon ikhoon commented Jan 23, 2026

Motivation:

To monitor and track xDS resource changes, it's useful to expose the
version of each resource type as Micrometer metrics. This allows
operators to observe version changes in monitoring systems like
Prometheus.

Modifications:

  • Added info gauges in ControlPlaneService for each resource type
    with version as a tag
    • Metric name: xds.control.plane.snapshot.version
    • Tags: resource (cluster/endpoint/listener/route),
      version (SHA-256 hash)
  • Added version getter methods to CentralDogmaSnapshot:
    clustersVersion(), endpointsVersion(), listenersVersion(),
    routesVersion()
  • Made allResourceVersion() package-private in
    CentralDogmaSnapshotResources

Result:

The version of xDS resources can now be tracked via Micrometer metrics.

@coderabbitai
Copy link

coderabbitai bot commented Jan 23, 2026

📝 Walkthrough

Walkthrough

Adds Micrometer-based control plane metrics: a new ControlPlaneMetrics class, MeterRegistry wiring into ControlPlaneService, per-resource version accessors on CentralDogmaSnapshot, widened visibility for CentralDogmaSnapshotResources.allResourceVersion(), and a unit test validating per-resource snapshot metrics. Startup/diff/stop flows now emit metrics.

Changes

Cohort / File(s) Summary
Control Plane Service
xds/src/main/java/com/linecorp/centraldogma/xds/internal/ControlPlaneService.java
Initializes ControlPlaneMetrics with a MeterRegistry; captures snapshots on startup/on-diff, updates cache via snapshot handling and calls metrics.onSnapshotUpdate(snapshot); calls metrics.onStopped() on stop.
Metrics Implementation
xds/src/main/java/com/linecorp/centraldogma/xds/internal/ControlPlaneMetrics.java
New package-private class registering per-resource initialized Gauges and revision Counters; tracks per-resource versions, updates gauges/counters on snapshot changes, and sets init flags to -1 on stop.
Snapshot Accessors
xds/src/main/java/com/linecorp/centraldogma/xds/internal/CentralDogmaSnapshot.java
Adds package-private accessors clustersVersion(), endpointsVersion(), listenersVersion(), routesVersion(), secretsVersion() and a helper resolver to obtain per-resource versions.
Resources Visibility
xds/src/main/java/com/linecorp/centraldogma/xds/internal/CentralDogmaSnapshotResources.java
Changes allResourceVersion() visibility from private to package-private to allow use by snapshot accessors/metrics.
Tests
xds/src/test/java/com/linecorp/centraldogma/xds/internal/XdsSnapshotVersionMetricsTest.java
Adds unit test verifying per-resource revision counters and initialized gauges through resource creation and updates using a test MeterRegistry.
Ignore List
.gitignore
Adds AGENTS.md under a new "Codex" section to be ignored.

Sequence Diagram

sequenceDiagram
    participant CPS as ControlPlaneService
    participant CDS as CentralDogmaSnapshot
    participant CPM as ControlPlaneMetrics
    participant MR as MeterRegistry

    CPS->>CDS: capture snapshot (startup / on diff)
    CDS-->>CPS: return snapshot with per-resource versions
    CPS->>CPM: onSnapshotUpdate(snapshot)
    CPM->>MR: update per-resource counters and gauges
    CPS->>CPM: onStopped()
    CPM->>MR: set init gauges to -1
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 I hopped through snapshots, counting each name,
Gauges set to ready, counters rose in frame.
Clusters, routes, secrets — I noted each beat,
Metrics now hum where control planes meet.
A little rabbit clap — observability neat!

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding Micrometer metrics to track xDS resource snapshot versions, which aligns with the primary objective of this PR.
Description check ✅ Passed The description is well-related to the changeset, clearly explaining motivation, modifications, and results of the metrics tracking implementation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
xds/src/main/java/com/linecorp/centraldogma/xds/internal/ControlPlaneService.java (1)

211-215: Missing updateVersionMetrics call when a group is removed.

When a group is removed, cache.setSnapshot is called but updateVersionMetrics is not. This could leave stale version metrics until the next diff is processed.

🐛 Proposed fix
     `@Override`
     protected void onGroupRemoved(String groupName) {
         centralDogmaXdsResources.removeGroup(groupName);
-        cache.setSnapshot(DEFAULT_GROUP, centralDogmaXdsResources.snapshot());
+        final CentralDogmaSnapshot snapshot = centralDogmaXdsResources.snapshot();
+        cache.setSnapshot(DEFAULT_GROUP, snapshot);
+        updateVersionMetrics(snapshot);
     }
🧹 Nitpick comments (1)
xds/src/main/java/com/linecorp/centraldogma/xds/internal/ControlPlaneService.java (1)

238-288: Consider extracting a helper to reduce duplication.

The four code blocks for updating version gauges are nearly identical. This could be simplified by using a Map<String, Gauge> for the gauges and extracting a helper method.

♻️ Suggested refactor using a Map
-    `@Nullable`
-    private Gauge clustersVersionGauge;
-    `@Nullable`
-    private Gauge endpointsVersionGauge;
-    `@Nullable`
-    private Gauge listenersVersionGauge;
-    `@Nullable`
-    private Gauge routesVersionGauge;
+    private final Map<String, Gauge> versionGauges = new HashMap<>();

Then refactor updateVersionMetrics:

private void updateVersionMetrics(CentralDogmaSnapshot snapshot) {
    updateVersionGauge("cluster", snapshot.clustersVersion());
    updateVersionGauge("endpoint", snapshot.endpointsVersion());
    updateVersionGauge("listener", snapshot.listenersVersion());
    updateVersionGauge("route", snapshot.routesVersion());
}

private void updateVersionGauge(String resourceType, String version) {
    final Gauge existingGauge = versionGauges.get(resourceType);
    if (existingGauge != null && version.equals(existingGauge.getId().getTag("version"))) {
        return;
    }
    if (existingGauge != null) {
        existingGauge.close();
        meterRegistry.remove(existingGauge);
    }
    final Gauge newGauge = Gauge.builder(VERSION_METRIC_NAME, () -> 1)
                                .tags(Tags.of("resource_type", resourceType, "version", version))
                                .register(meterRegistry);
    versionGauges.put(resourceType, newGauge);
}

Motivation:

To monitor and track xDS resource changes, it's useful to expose the
version of each resource type as Micrometer metrics. This allows
operators to observe version changes in monitoring systems like
Prometheus.

Modifications:

- Added info gauges in `ControlPlaneService` for each resource type
  with version as a tag
  - Metric name: `xds.control.plane.snapshot.version`
  - Tags: `resource_type` (cluster/endpoint/listener/route/secret),
    `version` (SHA-256 hash)
- Added version getter methods to `CentralDogmaSnapshot`:
  `clustersVersion()`, `endpointsVersion()`, `listenersVersion()`,
  `routesVersion()`, `secretsVersion()`
- Made `allResourceVersion()` package-private in
  `CentralDogmaSnapshotResources`

Result:

The version of xDS resources can now be tracked via Micrometer metrics.
@ikhoon ikhoon force-pushed the xds-resource-version-metric branch from 0a77bd1 to 85127f0 Compare January 23, 2026 07:49
@ikhoon ikhoon added this to the 0.80.0 milestone Jan 23, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
xds/src/main/java/com/linecorp/centraldogma/xds/internal/ControlPlaneService.java (1)

309-334: Consider cleaning up gauges on stop.

The version gauges remain registered in the MeterRegistry after the service stops. While this may be acceptable if stop() is only called at application shutdown, consider removing the gauges for cleaner resource management.

🔧 Suggested cleanup in stop()
 void stop() {
     stop = true;
+    // Clean up version gauges
+    if (clustersVersionGauge != null) {
+        meterRegistry.remove(clustersVersionGauge);
+    }
+    if (endpointsVersionGauge != null) {
+        meterRegistry.remove(endpointsVersionGauge);
+    }
+    if (listenersVersionGauge != null) {
+        meterRegistry.remove(listenersVersionGauge);
+    }
+    if (routesVersionGauge != null) {
+        meterRegistry.remove(routesVersionGauge);
+    }
+    if (secretsVersionGauge != null) {
+        meterRegistry.remove(secretsVersionGauge);
+    }
+
     final XdsEndpointService xdsEndpointService = this.xdsEndpointService;
🧹 Nitpick comments (2)
xds/src/main/java/com/linecorp/centraldogma/xds/internal/ControlPlaneService.java (2)

95-104: Consider documenting thread-safety assumption for gauge fields.

Line 89-90 documents that centralDogmaXdsResources is "Accessed only from controlPlaneExecutor". These gauge fields follow the same access pattern and would benefit from similar documentation for maintainability.


240-302: Reduce duplication by extracting a helper method.

The same gauge update pattern is repeated five times. Consider extracting a helper to improve maintainability:

♻️ Proposed refactor
private Gauge updateVersionGauge(`@Nullable` Gauge existingGauge,
                                  String resourceType,
                                  String newVersion) {
    if (existingGauge != null &&
        newVersion.equals(existingGauge.getId().getTag("version"))) {
        return existingGauge;
    }
    if (existingGauge != null) {
        meterRegistry.remove(existingGauge);
    }
    return Gauge.builder(VERSION_METRIC_NAME, () -> 1)
                .tags(Tags.of("resource", resourceType, "version", newVersion))
                .register(meterRegistry);
}

private void updateVersionMetrics(CentralDogmaSnapshot snapshot) {
    clustersVersionGauge = updateVersionGauge(
            clustersVersionGauge, "cluster", snapshot.clusters().version());
    endpointsVersionGauge = updateVersionGauge(
            endpointsVersionGauge, "endpoint", snapshot.endpoints().version());
    listenersVersionGauge = updateVersionGauge(
            listenersVersionGauge, "listener", snapshot.listeners().version());
    routesVersionGauge = updateVersionGauge(
            routesVersionGauge, "route", snapshot.routes().version());
    secretsVersionGauge = updateVersionGauge(
            secretsVersionGauge, "secret", snapshot.secrets().version());
}

Note: gauge.close() is typically a no-op in Micrometer—meterRegistry.remove() handles the deregistration. The close() call can be omitted.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In
`@xds/src/main/java/com/linecorp/centraldogma/xds/internal/ControlPlaneService.java`:
- Around line 240-301: In updateVersionMetrics, the Gauge tags use "resource"
but should use "resource_type"; update all Gauge.builder(...) calls that set
Tags.of("resource", ...) to Tags.of("resource_type", ...) for
clustersVersionGauge, endpointsVersionGauge, listenersVersionGauge,
routesVersionGauge, and secretsVersionGauge so the tag key matches the
documented metric contract and dashboards/alerts.

@ikhoon ikhoon marked this pull request as ready for review February 3, 2026 03:43
Copy link
Contributor

@minwoox minwoox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! 👍 👍 👍

Copy link
Contributor

@jrhee17 jrhee17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 👍

Comment on lines 249 to 250
.tags(Tags.of("resource", "cluster", "version", clustersVersion))
.register(meterRegistry);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • It may be worth recording 0 to the gauge to differentiate between 1) a version not loaded at all 2) a server down (e.g. due to maintenance)
  • FWIU version strings are a hash, which means the number of MeterId may be unbounded with the current implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth recording 0 to the gauge to differentiate between 1) a version not loaded at all 2) a server down (e.g. due to maintenance)

Good idea.

The number of MeterId may be unbounded with the current implementation.

When a resource version is changed, the previous meters are closed. Therefore, the number of meters is fixed at 5 regardless of version strings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of MeterId may be unbounded with the current implementation.

For the record, I was concerned more about downstream aggregators rather than the application

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. Since the hash value is not important, how about using counters instead? We can probably just increase the counter when it's changed, and I think it will be enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to correctly set up an alarm using a counter when the xDS resources is not initialized properly. If the counter does not increase for a certain period time, should that be regarded as an abnormal situation?

If you are concerned the cardinality of the metrics, we may add two metrics.

  • A guage to return 0 if it a resource is not initialized.
  • A counter to increase when a resource is updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the counter does not increase for a certain period time, should that be regarded as an abnormal situation?

No, I don't think we should rely on the counter to spot that issue. The same applies to using a gauge. If we want to detect that situation, we would need a different approach, such as checking the counter after a change is committed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are only five resource types, so I don't think frequent changes to resource versions would overload the downstream aggregators. It would be better to explore other approaches, but integrating change timestamps with the monitoring systems doesn't seem easy. For now, I will set up two metrics suggested above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I will set up two metrics suggested above.

Yeah, that will work. 👍

@minwoox minwoox self-requested a review February 4, 2026 05:40
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In
`@xds/src/main/java/com/linecorp/centraldogma/xds/internal/ControlPlaneMetrics.java`:
- Around line 30-34: The integer fields used by Gauge suppliers in
ControlPlaneMetrics (clustersInit, endpointsInit, listenersInit, routesInit,
secretsInit) should be declared volatile to guarantee visibility across threads;
update their declarations to volatile and ensure onSnapshotUpdate continues to
write to these fields and the Gauge suppliers (e.g., () -> clustersInit) read
them directly so the meter registry sees consistent updates from other threads.

In
`@xds/src/test/java/com/linecorp/centraldogma/xds/internal/XdsSnapshotVersionMetricsTest.java`:
- Around line 107-119: The test helper methods getRevisionFromCounter and
isResourceInitialized can NPE when meterRegistry.find(...).counter()/gauge()
returns null; add null checks using test assertions (e.g., assertNotNull) for
the Counter returned in getRevisionFromCounter and the Gauge returned in
isResourceInitialized before calling .count()/.value(), and include the
resourceType in the assertion message so the test fails with a clear message if
the meter is missing.
🧹 Nitpick comments (3)
.gitignore (1)

143-145: Unrelated change to PR objectives.

This addition to ignore AGENTS.md appears unrelated to the xDS metrics feature described in the PR objectives. Consider moving this to a separate commit or PR to maintain a clean commit history focused on the metrics feature.

xds/src/test/java/com/linecorp/centraldogma/xds/internal/XdsSnapshotVersionMetricsTest.java (1)

53-105: Consider expanding test coverage.

The test only verifies metrics for the cluster resource type. Consider adding test cases for:

  1. Other resource types (endpoint, listener, route) to ensure metrics work across all types.
  2. The onStopped() behavior to verify the initialized gauge transitions to -1.
xds/src/main/java/com/linecorp/centraldogma/xds/internal/ControlPlaneMetrics.java (1)

87-142: Consider reducing code duplication in onSnapshotUpdate.

The logic for each resource type is identical: compare version → update stored version → set init flag → increment counter. This pattern is repeated five times. Consider extracting a helper method to reduce duplication and improve maintainability.

♻️ Proposed refactor using a helper method
+    private interface VersionSupplier {
+        String get();
+    }
+
+    private String updateResourceMetrics(String currentVersion, `@Nullable` String storedVersion,
+                                         Counter counter, int[] initFlag) {
+        if (!currentVersion.equals(storedVersion)) {
+            initFlag[0] = "empty_resources".equals(currentVersion) ? 0 : 1;
+            counter.increment();
+            return currentVersion;
+        }
+        return storedVersion;
+    }

Alternatively, consider using an enum or a small data class to hold per-resource state (version, init flag, counter), then iterate over the resource types in a loop.

@ikhoon ikhoon merged commit 1759c1b into line:main Feb 5, 2026
12 checks passed
sh-cho pushed a commit to sh-cho/centraldogma that referenced this pull request Feb 6, 2026
…1248)

Motivation:

To monitor and track xDS resource changes, it's useful to expose the
version of each resource type as Micrometer metrics. This allows
operators to observe version changes in monitoring systems like
Prometheus.

Modifications:

- Added info gauges in `ControlPlaneService` for each resource type
  with version as a tag
  - Metric name: `xds.control.plane.snapshot.version`
  - Tags: `resource` (cluster/endpoint/listener/route),
    `version` (SHA-256 hash)
- Added version getter methods to `CentralDogmaSnapshot`:
  `clustersVersion()`, `endpointsVersion()`, `listenersVersion()`,
  `routesVersion()`
- Made `allResourceVersion()` package-private in
  `CentralDogmaSnapshotResources`

Result:

The version of xDS resources can now be tracked via Micrometer metrics.

Signed-off-by: Seonghyeon Cho <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants