HNT-1890: shared curated recommendation cache by mmiermans · Pull Request #1306 · mozilla-services/merino-py

mmiermans · 2026-03-12T01:08:45Z

References

Description

Adds a shared Redis cache between the per-pod in-memory cache and the Corpus GraphQL API. ~300 pods currently hit the API independently every ~60s. With this, one pod fetches and the rest read from Redis.

Disabled by default. Enable with corpus_cache.cache = "redis".

How it works

Requests are always served from the in-memory cache (L1) — either fresh or stale. Stale data is served whenever possible so that the Redis and API calls never block the request. Revalidation happens in a background task: it checks Redis (L2) first, and only the pod that acquires the distributed lock fetches from the API.

flowchart TB
    req["Firefox NewTab Request"]

    subgraph L1 ["L1 — Per-Pod In-Memory SWR"]
        check_l1{{"Check in-memory cache"}}
    end

    respond_fresh["Respond with fresh data"]
    respond_stale["Respond with stale data"]

    subgraph bg ["Background Revalidation Task"]
        direction TB

        subgraph L2 ["L2 — Shared Redis"]
            check_l2{{"Check Redis cache"}}
            acquire_lock{{"Try distributed lock"}}
        end

        api["Fetch from Corpus GraphQL API"]
        write["Write to Redis + release lock + update L1"]
    end

    req --> check_l1

    check_l1 -- "FRESH HIT" --> respond_fresh
    check_l1 -- "STALE" --> respond_stale
    check_l1 -. "MISS (cold start, blocks)" .-> check_l2

    respond_stale -. "spawns task" .-> check_l2

    check_l2 -- "FRESH HIT" --> done_l2["Update L1 cache"]
    check_l2 -. "STALE" .-> acquire_lock
    check_l2 -. "MISS" .-> acquire_lock

    acquire_lock -- "LOCK ACQUIRED" --> api
    acquire_lock -. "LOCK HELD + stale exists" .-> serve_stale["Return stale data"]
    acquire_lock -. "LOCK HELD + no data" .-> retry["Wait → retry Redis → fetch from API"]

    api --> write --> done_api["Update L1 cache"]

    style req fill:#2c3e50,stroke:#1a252f,color:#ecf0f1,stroke-width:2px
    style check_l1 fill:#2980b9,stroke:#1f6da0,color:#fff,stroke-width:2px
    style check_l2 fill:#d35400,stroke:#a04000,color:#fff,stroke-width:2px
    style acquire_lock fill:#e67e22,stroke:#bf6516,color:#fff,stroke-width:2px
    style api fill:#1e8449,stroke:#145a32,color:#fff,stroke-width:2px
    style write fill:#1e8449,stroke:#145a32,color:#fff,stroke-width:2px
    style respond_fresh fill:#27ae60,stroke:#1e8449,color:#fff,stroke-width:2px
    style respond_stale fill:#27ae60,stroke:#1e8449,color:#fff,stroke-width:2px
    style serve_stale fill:#f4d03f,stroke:#d4ac0f,color:#333
    style retry fill:#f4d03f,stroke:#d4ac0f,color:#333
    style done_l2 fill:#27ae60,stroke:#1e8449,color:#fff
    style done_api fill:#27ae60,stroke:#1e8449,color:#fff
    style L1 fill:#eaf2f8,stroke:#2980b9,stroke-width:2px,color:#2c3e50
    style L2 fill:#fef5e7,stroke:#d35400,stroke-width:2px,color:#2c3e50
    style bg fill:#f4f6f7,stroke:#95a5a6,stroke-width:2px,stroke-dasharray: 8 4,color:#2c3e50

Design decisions

Decision	Choice	Why
Cache layer	Redis L2 behind existing in-memory L1	Keeps per-pod latency low, Redis only consulted on L1 miss
Write pattern	Distributed stale-while-revalidate	One pod revalidates, others serve stale. Avoids thundering herd
Lock mechanism	`SET NX EX 30`	Simple, self-expiring. Unconditional DELETE on release — worst case is one extra API call
Soft/hard TTL	2 min soft, 10 min hard	Soft triggers revalidation. Hard is a Redis eviction safety net
Cache format	Pydantic model dicts via orjson	Processed models cached, not raw GraphQL. Saves CPU across pods
Redis cluster	Same shared cluster as other providers	~36 keys, ~4MB total. Not worth a separate instance
Failure mode	All Redis errors fall through to API	Redis is an optimization, never a requirement. No new failure modes
Rollout	Global config flag, off by default	Enable in staging first, then production

Follow-up

Add StatsD metrics (hit/miss/stale/error) before enabling in production
Enable in staging, validate API call reduction
Enable in production

PR Review Checklist

Put an x in the boxes that apply

This PR conforms to the Contribution Guidelines
The PR title starts with the JIRA issue reference, format example [DISCO-####], and has the same title (if applicable)
[load test: (abort|skip|warn)] keywords are applied to the last commit message (if applicable)
Documentation has been updated (if applicable)
Functional and performance test coverage has been expanded and maintained (if applicable)

┆Issue is synchronized with this Jira Task

Introduce a shared Redis cache layer between the in-memory SWR cache (L1) and the Corpus GraphQL API to reduce API request volume by ~300x across Merino pods. Key design: - Distributed stale-while-revalidate: one pod revalidates while others serve stale data, using SET NX EX for distributed locking - Embedded soft TTL envelope (2min soft / 10min hard) stored as orjson - All Redis errors gracefully fall through to the wrapped backend - Gated behind config flag (corpus_cache.backend = "redis" / "none") - Adds set_nx and delete to CacheAdapter protocol Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- Fix lock release on serialize_fn failure: wrap both fetch and serialize in the try/except that releases the distributed lock - Add Redis adapter cleanup on shutdown: store adapter in module-level var, close it in new shutdown() function called from main.py lifespan - Add config validation: CorpusCacheConfig.__post_init__ ensures hard_ttl > soft_ttl and hard_ttl > lock_ttl - Add tests: config validation, serialize error lock release, empty list caching behavior Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- Make cache write best-effort in _revalidate: serialize/write errors no longer propagate when items were successfully fetched from the backend - Catch deserialize_fn errors (e.g. Pydantic ValidationError) in get_or_fetch and treat as cache miss instead of crashing the request - Validate that all TTL values are positive in CorpusCacheConfig - Add unit tests for RedisAdapter.set_nx and delete methods - Add tests for deserialize_fn errors and malformed envelope (KeyError) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- Document unconditional lock DELETE tradeoff: in the rare case that revalidation exceeds lock_ttl_sec, at most one extra API call occurs - Add test for stale + lock loser + deserialization failure path Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- Rename config key from 'backend' to 'cache' for consistency with existing cache = "redis"/"none" convention used by other providers - Simplify get_or_fetch API: accept (backend_type, surface_id, *extra) and derive data/lock keys internally, eliminating duplicated key building in callers - Remove redundant _config field from wrapper classes - Fix mypy type annotation in test helpers (**overrides: Any) - Strengthen test assertions: verify lock release on write error, verify lock loser doesn't write/delete, verify set_nx call args Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

The _corpus_cache_adapter module-level global was the only resource in the curated_recommendations subsystem needing explicit async cleanup, managed via a separate shutdown() path. This broke the subsystem's self-contained pattern where backends manage their own lifecycle. Now CuratedRecommendationsProvider owns the cache adapter via an optional cache_adapter parameter and a shutdown() method. The module's shutdown() delegates to the provider. This matches how suggest providers handle resource cleanup (provider.shutdown -> backend.shutdown -> cache.close). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- Fix TypeError crash on corrupted envelope with non-numeric expires_at: now caught and treated as cache miss instead of 500ing the request - Fix lock leak on asyncio.CancelledError: use try/finally in _revalidate so lock is released even for BaseException subclasses - Broaden _redis_set exception catch to all Exception (not just CacheAdapterError) so serialization errors are also suppressed - Add graceful fallback in _init_corpus_cache: Redis init failure no longer aborts provider initialization - Guard shutdown() against uninitialized provider (NameError on startup failure) - Add defensive get_provider()/get_legacy_provider() with RuntimeError instead of silent NameError - Add pragma: no cover to new protocol methods for consistency - Add tests for non-numeric expires_at and CancelledError lock release Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Replace _make_corpus_item helper with generate_corpus_item from test_sections.py. Keep _make_corpus_section as a thin wrapper since no shared CorpusSection fixture exists yet. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- docs/operations/curated-recommendations/corpus-cache.md: operational doc with inline mermaid flow diagram, design decisions table, config reference, and rollout steps - docs/SUMMARY.md: add curated recommendations section to nav - PR_DESCRIPTION.md: PR description following repo template Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

On cache miss when the lock is held by another pod, instead of calling the Corpus API directly (which defeats cross-pod coalescing), wait 0.1s and retry Redis. If the lock winner has written data, return it. If not, raise BackendError — the L1 SWR layer will handle the error and retry on the next request. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

mmiermans · 2026-03-23T19:16:46Z

Returning a 201 code in the scenario where we do not have any (stale) data would improve reliability, by avoiding connections piling up.

TODO: Check how Firefox Desktop & Mobile would handle 201 status code? Does it retry?

mmiermans · 2026-03-23T19:29:34Z

Redis L2 behind existing in-memory L1

Q: Are we consistent, or does weather and other providers hit Redis on every request? Are the use-cases different?

Q: What's the existing pattern for testing with Redis?

Q: What are the existing Redis patterns?

mmiermans · 2026-03-23T19:33:29Z

docs/operations/curated-recommendations/corpus-cache.md

+Key settings:
+- `cache` — `"redis"` to enable, `"none"` to disable (default: disabled)
+- `soft_ttl_sec` — when a cached entry is considered stale and triggers revalidation
+- `hard_ttl_sec` — when Redis evicts the key entirely (safety net)


We probably want this to be high: more like 1-24 hours? Days? To match our response time.

mmiermans · 2026-03-23T19:35:15Z

merino/cache/redis.py

+            await self.primary.delete(key)
+        except RedisError as exc:
+            raise CacheAdapterError(f"Failed to DELETE `{repr(key)}` with error: `{exc}`") from exc
+


Again, check if this is a new pattern, and whether it should be one.

mmiermans · 2026-03-23T19:35:50Z

merino/configs/default.toml

+
+# MERINO__CURATED_RECOMMENDATIONS__CORPUS_CACHE__HARD_TTL_SEC
+# Hard TTL in seconds. Redis evicts the key after this. Should be much longer than soft TTL.
+hard_ttl_sec = 600


Suggested change

hard_ttl_sec = 600

hard_ttl_sec = 3600

Long hard ttl?

mmiermans · 2026-03-23T19:38:38Z

merino/configs/default.toml

+
+# MERINO__CURATED_RECOMMENDATIONS__CORPUS_CACHE__LOCK_TTL_SEC
+# Distributed lock TTL in seconds. Auto-releases if the lock holder crashes.
+lock_ttl_sec = 30


TODO: Check if there's a client network timeout of 30s?

mmiermans · 2026-03-23T19:40:59Z