HNT-1890: shared curated recommendation cache#1306
Conversation
Introduce a shared Redis cache layer between the in-memory SWR cache (L1) and the Corpus GraphQL API to reduce API request volume by ~300x across Merino pods. Key design: - Distributed stale-while-revalidate: one pod revalidates while others serve stale data, using SET NX EX for distributed locking - Embedded soft TTL envelope (2min soft / 10min hard) stored as orjson - All Redis errors gracefully fall through to the wrapped backend - Gated behind config flag (corpus_cache.backend = "redis" / "none") - Adds set_nx and delete to CacheAdapter protocol Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Fix lock release on serialize_fn failure: wrap both fetch and serialize in the try/except that releases the distributed lock - Add Redis adapter cleanup on shutdown: store adapter in module-level var, close it in new shutdown() function called from main.py lifespan - Add config validation: CorpusCacheConfig.__post_init__ ensures hard_ttl > soft_ttl and hard_ttl > lock_ttl - Add tests: config validation, serialize error lock release, empty list caching behavior Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Make cache write best-effort in _revalidate: serialize/write errors no longer propagate when items were successfully fetched from the backend - Catch deserialize_fn errors (e.g. Pydantic ValidationError) in get_or_fetch and treat as cache miss instead of crashing the request - Validate that all TTL values are positive in CorpusCacheConfig - Add unit tests for RedisAdapter.set_nx and delete methods - Add tests for deserialize_fn errors and malformed envelope (KeyError) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Document unconditional lock DELETE tradeoff: in the rare case that revalidation exceeds lock_ttl_sec, at most one extra API call occurs - Add test for stale + lock loser + deserialization failure path Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Rename config key from 'backend' to 'cache' for consistency with existing cache = "redis"/"none" convention used by other providers - Simplify get_or_fetch API: accept (backend_type, surface_id, *extra) and derive data/lock keys internally, eliminating duplicated key building in callers - Remove redundant _config field from wrapper classes - Fix mypy type annotation in test helpers (**overrides: Any) - Strengthen test assertions: verify lock release on write error, verify lock loser doesn't write/delete, verify set_nx call args Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
The _corpus_cache_adapter module-level global was the only resource in the curated_recommendations subsystem needing explicit async cleanup, managed via a separate shutdown() path. This broke the subsystem's self-contained pattern where backends manage their own lifecycle. Now CuratedRecommendationsProvider owns the cache adapter via an optional cache_adapter parameter and a shutdown() method. The module's shutdown() delegates to the provider. This matches how suggest providers handle resource cleanup (provider.shutdown -> backend.shutdown -> cache.close). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Fix TypeError crash on corrupted envelope with non-numeric expires_at: now caught and treated as cache miss instead of 500ing the request - Fix lock leak on asyncio.CancelledError: use try/finally in _revalidate so lock is released even for BaseException subclasses - Broaden _redis_set exception catch to all Exception (not just CacheAdapterError) so serialization errors are also suppressed - Add graceful fallback in _init_corpus_cache: Redis init failure no longer aborts provider initialization - Guard shutdown() against uninitialized provider (NameError on startup failure) - Add defensive get_provider()/get_legacy_provider() with RuntimeError instead of silent NameError - Add pragma: no cover to new protocol methods for consistency - Add tests for non-numeric expires_at and CancelledError lock release Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Replace _make_corpus_item helper with generate_corpus_item from test_sections.py. Keep _make_corpus_section as a thin wrapper since no shared CorpusSection fixture exists yet. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- docs/operations/curated-recommendations/corpus-cache.md: operational doc with inline mermaid flow diagram, design decisions table, config reference, and rollout steps - docs/SUMMARY.md: add curated recommendations section to nav - PR_DESCRIPTION.md: PR description following repo template Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
On cache miss when the lock is held by another pod, instead of calling the Corpus API directly (which defeats cross-pod coalescing), wait 0.1s and retry Redis. If the lock winner has written data, return it. If not, raise BackendError — the L1 SWR layer will handle the error and retry on the next request. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
5894b87 to
8e632b5
Compare
|
Returning a 201 code in the scenario where we do not have any (stale) data would improve reliability, by avoiding connections piling up. TODO: Check how Firefox Desktop & Mobile would handle 201 status code? Does it retry? |
Q: Are we consistent, or does weather and other providers hit Redis on every request? Are the use-cases different? Q: What's the existing pattern for testing with Redis? Q: What are the existing Redis patterns? |
| Key settings: | ||
| - `cache` — `"redis"` to enable, `"none"` to disable (default: disabled) | ||
| - `soft_ttl_sec` — when a cached entry is considered stale and triggers revalidation | ||
| - `hard_ttl_sec` — when Redis evicts the key entirely (safety net) |
There was a problem hiding this comment.
We probably want this to be high: more like 1-24 hours? Days? To match our response time.
| await self.primary.delete(key) | ||
| except RedisError as exc: | ||
| raise CacheAdapterError(f"Failed to DELETE `{repr(key)}` with error: `{exc}`") from exc | ||
|
|
There was a problem hiding this comment.
Again, check if this is a new pattern, and whether it should be one.
|
|
||
| # MERINO__CURATED_RECOMMENDATIONS__CORPUS_CACHE__HARD_TTL_SEC | ||
| # Hard TTL in seconds. Redis evicts the key after this. Should be much longer than soft TTL. | ||
| hard_ttl_sec = 600 |
There was a problem hiding this comment.
| hard_ttl_sec = 600 | |
| hard_ttl_sec = 3600 |
Long hard ttl?
|
|
||
| # MERINO__CURATED_RECOMMENDATIONS__CORPUS_CACHE__LOCK_TTL_SEC | ||
| # Distributed lock TTL in seconds. Auto-releases if the lock holder crashes. | ||
| lock_ttl_sec = 30 |
There was a problem hiding this comment.
TODO: Check if there's a client network timeout of 30s?
| raise ValueError( | ||
| f"hard_ttl_sec ({self.hard_ttl_sec}) must be greater than " | ||
| f"lock_ttl_sec ({self.lock_ttl_sec})" | ||
| ) |
There was a problem hiding this comment.
Is there a good reason to do this differently from the usual config validation?
|
|
||
|
|
||
| def _build_data_key( | ||
| config: CorpusCacheConfig, backend_type: str, surface_id: str, *extra: str |
There was a problem hiding this comment.
Should backend_type and surface_id be an enum? I believe we already have an enum for the latter.
| str(days_offset), | ||
| fetch_fn=lambda: self._backend.fetch(surface_id, days_offset), | ||
| serialize_fn=lambda items: [item.model_dump(mode="json") for item in items], | ||
| deserialize_fn=lambda data: [CorpusItem.model_validate(d) for d in data], |
There was a problem hiding this comment.
Use named arguments.
| return await self._redis_cache.get_or_fetch( | ||
| "scheduled", | ||
| surface_id.value, | ||
| str(days_offset), |
There was a problem hiding this comment.
Let's verify whether we still need this? Do we need to fetch yesterday's data for the schedule?
| config: CorpusCacheConfig, backend_type: str, surface_id: str, *extra: str | ||
| ) -> str: | ||
| """Build the Redis key for cached corpus data.""" | ||
| parts = [config.key_prefix, backend_type, surface_id, *extra] |
There was a problem hiding this comment.
Ideally extra would be structured data.
| "expires_at": time.time() + soft_ttl_sec, | ||
| "data": data, | ||
| } | ||
| return orjson.dumps(envelope) |
There was a problem hiding this comment.
Let's check if there's a pattern to create an envelope or have a separate expiration key?
| if await self._try_acquire_lock(lock_key): | ||
| return await self._revalidate(data_key, lock_key, fetch_fn, serialize_fn) | ||
| # Another pod is populating; wait briefly then retry Redis | ||
| await asyncio.sleep(0.1) |
There was a problem hiding this comment.
Return 201 if this happens?
| async def shutdown(self) -> None: | ||
| """Close resources owned by this provider.""" | ||
| if self._cache_adapter is not None: | ||
| await self._cache_adapter.close() |
There was a problem hiding this comment.
Why is this the first time we need to do a shutdown in the curated-recommendations provider?
References
JIRA: HNT-1890
Description
Adds a shared Redis cache between the per-pod in-memory cache and the Corpus GraphQL API. ~300 pods currently hit the API independently every ~60s. With this, one pod fetches and the rest read from Redis.
Disabled by default. Enable with
corpus_cache.cache = "redis".How it works
Requests are always served from the in-memory cache (L1) — either fresh or stale. Stale data is served whenever possible so that the Redis and API calls never block the request. Revalidation happens in a background task: it checks Redis (L2) first, and only the pod that acquires the distributed lock fetches from the API.
flowchart TB req["Firefox NewTab Request"] subgraph L1 ["L1 — Per-Pod In-Memory SWR"] check_l1{{"Check in-memory cache"}} end respond_fresh["Respond with fresh data"] respond_stale["Respond with stale data"] subgraph bg ["Background Revalidation Task"] direction TB subgraph L2 ["L2 — Shared Redis"] check_l2{{"Check Redis cache"}} acquire_lock{{"Try distributed lock"}} end api["Fetch from Corpus GraphQL API"] write["Write to Redis + release lock + update L1"] end req --> check_l1 check_l1 -- "FRESH HIT" --> respond_fresh check_l1 -- "STALE" --> respond_stale check_l1 -. "MISS (cold start, blocks)" .-> check_l2 respond_stale -. "spawns task" .-> check_l2 check_l2 -- "FRESH HIT" --> done_l2["Update L1 cache"] check_l2 -. "STALE" .-> acquire_lock check_l2 -. "MISS" .-> acquire_lock acquire_lock -- "LOCK ACQUIRED" --> api acquire_lock -. "LOCK HELD + stale exists" .-> serve_stale["Return stale data"] acquire_lock -. "LOCK HELD + no data" .-> retry["Wait → retry Redis → fetch from API"] api --> write --> done_api["Update L1 cache"] style req fill:#2c3e50,stroke:#1a252f,color:#ecf0f1,stroke-width:2px style check_l1 fill:#2980b9,stroke:#1f6da0,color:#fff,stroke-width:2px style check_l2 fill:#d35400,stroke:#a04000,color:#fff,stroke-width:2px style acquire_lock fill:#e67e22,stroke:#bf6516,color:#fff,stroke-width:2px style api fill:#1e8449,stroke:#145a32,color:#fff,stroke-width:2px style write fill:#1e8449,stroke:#145a32,color:#fff,stroke-width:2px style respond_fresh fill:#27ae60,stroke:#1e8449,color:#fff,stroke-width:2px style respond_stale fill:#27ae60,stroke:#1e8449,color:#fff,stroke-width:2px style serve_stale fill:#f4d03f,stroke:#d4ac0f,color:#333 style retry fill:#f4d03f,stroke:#d4ac0f,color:#333 style done_l2 fill:#27ae60,stroke:#1e8449,color:#fff style done_api fill:#27ae60,stroke:#1e8449,color:#fff style L1 fill:#eaf2f8,stroke:#2980b9,stroke-width:2px,color:#2c3e50 style L2 fill:#fef5e7,stroke:#d35400,stroke-width:2px,color:#2c3e50 style bg fill:#f4f6f7,stroke:#95a5a6,stroke-width:2px,stroke-dasharray: 8 4,color:#2c3e50Design decisions
SET NX EX 30Follow-up
PR Review Checklist
Put an
xin the boxes that apply[DISCO-####], and has the same title (if applicable)[load test: (abort|skip|warn)]keywords are applied to the last commit message (if applicable)┆Issue is synchronized with this Jira Task