[auth] Support token expirations from provider#716
Conversation
Cache credentials per registry glob using provider-reported TTLs (CacheDuration / DefaultCacheDuration). GetCredentials serves from cache unless req.ValidUntil exceeds the entry expiry, letting the renewal loop trigger re-execution selectively without bypassing the cache for all concurrent refs sharing the same registry. Add AuthRequest.ValidUntil so providers that cache internally know the minimum credential lifetime the caller requires.
Set AuthRequest.ValidUntil to the next renewal tick when fetching credentials so providers know the minimum lifetime required. Replace credentialStore.lifetime with renewInterval: lifetime-based expiry is harmful when renewal fails (evicting the last-known-good credential makes things worse). Eviction is driven solely by RAFS reconciliation. Storing renewInterval on the struct removes it from all function signatures.
Reflect the KubeletProvider per-registry TTL cache, ValidUntil-driven re-execution, and the removal of lifetime-based store expiry.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #716 +/- ##
==========================================
+ Coverage 22.02% 22.87% +0.85%
==========================================
Files 130 131 +1
Lines 11931 12063 +132
==========================================
+ Hits 2628 2760 +132
+ Misses 8960 8958 -2
- Partials 343 345 +2
🚀 New features to boost your workflow:
|
| // A zero duration means no caching (plugin will be re-executed on the next request). | ||
| if !expiresAt.IsZero() { | ||
| p.mu.Lock() | ||
| p.registries[registry] = c |
There was a problem hiding this comment.
Thanks for the PR, we seem to cache CredentialProvider's auth by registry, could there be different Image-level Auth distinctions for the same Registry (e.g., images under different namespaces within one Registry)?
There was a problem hiding this comment.
yes there is!
I'm going to try to find a better naming but what I called registry here is actually what the kubelet call a matchImage which is a glob potentially including different namespaces.
So basically, the p.registries map can look something like this:
{
"docker.io": cred1,
"*.io": cred2,
"docker.io/foo": cred3,
"docker.io/foo/bar": cred4,
"docker.io/baz": cred5,
}
and depending on the image ref, it will match the best match in the map.
would it be clearer if I call this registryPattern instead of registry?
There was a problem hiding this comment.
would it be clearer if I call this registryPattern instead of registry?
LGTM, Thanks for the clarification!
* [auth] Add TTL-aware registry cache to KubeletProvider Cache credentials per registry glob using provider-reported TTLs (CacheDuration / DefaultCacheDuration). GetCredentials serves from cache unless req.ValidUntil exceeds the entry expiry, letting the renewal loop trigger re-execution selectively without bypassing the cache for all concurrent refs sharing the same registry. Add AuthRequest.ValidUntil so providers that cache internally know the minimum credential lifetime the caller requires. * [auth] Simplify renewal store and wire ValidUntil Set AuthRequest.ValidUntil to the next renewal tick when fetching credentials so providers know the minimum lifetime required. Replace credentialStore.lifetime with renewInterval: lifetime-based expiry is harmful when renewal fails (evicting the last-known-good credential makes things worse). Eviction is driven solely by RAFS reconciliation. Storing renewInterval on the struct removes it from all function signatures. * [docs] Update credential renewal documentation Reflect the KubeletProvider per-registry TTL cache, ValidUntil-driven re-execution, and the removal of lifetime-based store expiry.
* Merge pull request containerd#707 from DataDog/fricounet/refactor-registry-auth [auth] Refactor providers to use AuthProvider interface * Merge pull request containerd#710 from DataDog/fricounet/kubelet-helpers [auth] Add kubelet credential provider support * [auth] Introduce token renewals system (containerd#714) * [auth] Refactor providers and introduce RenewableProvider Add a RenewableProvider interface with a CanRenew() bool method so that callers can determine at runtime whether a provider's credentials can be refreshed without user interaction. Implement it on DockerProvider, KubeletProvider and KubeSecretProvider (all return true); CRI and Labels providers are excluded. Add String() to all providers for structured log fields. Refactor GetRegistryKeyChain to iterate over a provider list built by a replaceable buildProviders var, enabling unit-test injection without global state manipulation. * [metrics] Add credential renewal counters and store gauge Add CredentialRenewals (counter, by image_ref + result) and CredentialStoreEntries (TTL gauge, by image_ref) to track renewal health and store occupancy. Register both in the global Prometheus registry. * [auth] Add credential store and renewal goroutine Add credentialStore: an in-memory map of image ref -> PassKeyChain with per-entry TTL-based expiration, protected by a sync.RWMutex. Add InitCredentialRenewal, which creates the global store, seeds it from a list of existing refs (used on snapshotter restart), and starts a background goroutine that renews all live entries at a configurable interval. GetRegistryKeyChain now serves from the store on a cache hit and writes back credentials obtained from a RenewableProvider when the store is active. * [config] Add credential_renewal_interval option Add CredentialRenewalInterval (time.Duration, default 0 = disabled) to AuthConfig. When set to a positive value the snapshotter activates the credential renewal subsystem. * [auth] Wire credential renewal at startup and evict on umount Call auth.InitCredentialRenewal after service recovery when CredentialRenewalInterval > 0. Seed the store from unique ImageIDs present in the recovered RAFS cache so credentials for running containers are renewed immediately after a snapshotter restart. Call auth.RemoveCredentials in Filesystem.Umount so the renewal goroutine stops refreshing tokens for images that are no longer mounted, preventing unbounded store growth. * [docs] Extract registry authentication to dedicated page Move the authentication section out of configure_nydus.md into a new docs/registry_authentication.md. Add documentation for the new credential_renewal_interval option and update the cross-reference in configure_nydus.md. * fix lint * [auth] Add renewable-only providers and extract fetchFromProviders Add renewableProviders alongside buildProviders. The renewal goroutine must not use CRI (caches credentials globally after the pull) or Labels (only available at pull time via snapshot labels). Restricting renewal to Docker, Kubelet, and KubeSecret ensures the correct providers are reached at renewal time regardless of what served credentials at pull time. Refactor buildProviders to compose renewableProviders, eliminating duplication. Extract fetchFromProviders from getRegistryKeyChainFromProviders so the renewal goroutine can call providers directly without going through the store cache check. getRegistryKeyChainFromProviders checks the store first, then delegates to fetchFromProviders; the renewal path calls fetchFromProviders directly to obtain fresh credentials. * [auth] Rework renewal for RAFS-driven reconciliation Replace the store-driven eviction model (explicit RemoveCredentials on umount) with a reconciliation loop that derives the active set from the live RAFS instance list on every tick. This fixes two edge cases: - Multiple RAFS instances sharing the same image ref: the first umount no longer evicts the entry while other instances are still running. - Non-renewable providers (CRI, Labels) blocking renewable ones at renewal time: the renewal goroutine now uses renewableProviders() exclusively, so CRI's cached global state never interferes. On each tick, reconcile compares store entries against the live RAFS set. Entries in RAFS are renewed; entries absent from RAFS are evicted once their renewedAt age exceeds interval/2. The interval/2 grace period prevents eviction of entries added during an in-progress pull (the RAFS entry is created after the mount completes, so a tick arriving mid-pull would otherwise evict a valid entry). renewEntry is now a plain function that delegates to fetchFromProviders. The store write on success happens inside fetchFromProviders, so renewEntry only needs to record metrics. InitCredentialRenewal drops the existingRefs parameter; the initial reconcile pass seeds the store from the live RAFS cache directly. * [auth] Remove credential eviction on umount, update docs Remove auth.RemoveCredentials from Filesystem.Umount; eviction is now handled by the reconciliation loop in the renewal goroutine. Remove the auth import from fs.go and the rafs import from snapshotter.go which are no longer needed. Simplify InitCredentialRenewal call site: no existingRefs slice to build, just pass the interval. Update the credential renewal section in docs/registry_authentication.md to reflect that renewal re-queries providers in priority order each tick rather than re-using the provider that originally issued the credentials. * [renewal] Use map[string]struct{} as idiomatic go set * [metrics] Explicitely deregister metrics when ref is removed from store This will prevent the metrics from growing out of hand over time. This also switch the CredentialStoreEntries gauge to using a non-TTL version because the default TTL of 5mins is too small for most cases because the metric is only updated once every renew interval which might be counted in hours. And since passing this interval value to the TTL isn't practical, let's just rely on the same logic to remove deleted ref as for the CountVec. --------- Signed-off-by: Baptiste Girard-Carrabin <baptiste.girardcarrabin@datadoghq.com> * [auth] Support token expirations from provider (containerd#716) * [auth] Add TTL-aware registry cache to KubeletProvider Cache credentials per registry glob using provider-reported TTLs (CacheDuration / DefaultCacheDuration). GetCredentials serves from cache unless req.ValidUntil exceeds the entry expiry, letting the renewal loop trigger re-execution selectively without bypassing the cache for all concurrent refs sharing the same registry. Add AuthRequest.ValidUntil so providers that cache internally know the minimum credential lifetime the caller requires. * [auth] Simplify renewal store and wire ValidUntil Set AuthRequest.ValidUntil to the next renewal tick when fetching credentials so providers know the minimum lifetime required. Replace credentialStore.lifetime with renewInterval: lifetime-based expiry is harmful when renewal fails (evicting the last-known-good credential makes things worse). Eviction is driven solely by RAFS reconciliation. Storing renewInterval on the struct removes it from all function signatures. * [docs] Update credential renewal documentation Reflect the KubeletProvider per-registry TTL cache, ValidUntil-driven re-execution, and the removal of lifetime-based store expiry. * [auth] Hot-reload registry auth via nydusd config API Hook together the token renewal subsystem with the nydusd config API. Move renewal orchestration from pkg/auth to snapshot to avoid circular dependencies as the renewal logic now needs access to the dameons client while the daemon package already indirectly imports the auth package. pkg/auth/renewal.go becomes a pure credential cache with exported InitCredentialStore, RenewCredential, and EvictStaleCredentials. The reconciliation loop formerly in auth/renewal.go and now in snapshot/renewal.go walks all managers->daemons->rags and renews credentials, and hot-reloads daemons directly. It also updates the daemon config file so that they can use fresh creds when they restart. --------- Signed-off-by: Baptiste Girard-Carrabin <baptiste.girardcarrabin@datadoghq.com> Co-authored-by: imeoer <imeoer@linux.alibaba.com>
Overview
Please briefly describe the changes your pull request makes.
This PR improves the prod-readiness of the kubelet credentals provider by properly using the build-in features of the provider which are wildcards and cacheDuration to reduce api calls to the registry.
Related Issues
Please link to the relevant issue. For example:
Fix #123orRelated #456.One more change related to #690.
Next PR will finally hook the renewal subsystem to nydusd
Change Details
Please describe your changes in detail:
lifetimelogic from the renew loop. This interval was potentially harmful in case where creds would have been failing renewal while still being perfectly valid. We should keep attempt to renew creds as long as there is a RAFS for itTest Results
If you have any relevant screenshots or videos that can help illustrate your changes, please add them here.
Tested the new feature locally. Using the aws credential helper, I spun up 2 containers using an image from a private registry and had the expected results:
Change Type
Please select the type of change your pull request relates to:
Self-Checklist
Before submitting a pull request, please ensure you have completed the following: