Skip to content

HNT-1890: shared curated recommendation cache#1306

Draft
mmiermans wants to merge 11 commits intomainfrom
hnt-1890-shared-curated-recommendation-cache
Draft

HNT-1890: shared curated recommendation cache#1306
mmiermans wants to merge 11 commits intomainfrom
hnt-1890-shared-curated-recommendation-cache

Conversation

@mmiermans
Copy link
Collaborator

@mmiermans mmiermans commented Mar 12, 2026

References

JIRA: HNT-1890

Description

Adds a shared Redis cache between the per-pod in-memory cache and the Corpus GraphQL API. ~300 pods currently hit the API independently every ~60s. With this, one pod fetches and the rest read from Redis.

Disabled by default. Enable with corpus_cache.cache = "redis".

How it works

Requests are always served from the in-memory cache (L1) — either fresh or stale. Stale data is served whenever possible so that the Redis and API calls never block the request. Revalidation happens in a background task: it checks Redis (L2) first, and only the pod that acquires the distributed lock fetches from the API.

flowchart TB
    req["Firefox NewTab Request"]

    subgraph L1 ["L1 — Per-Pod In-Memory SWR"]
        check_l1{{"Check in-memory cache"}}
    end

    respond_fresh["Respond with fresh data"]
    respond_stale["Respond with stale data"]

    subgraph bg ["Background Revalidation Task"]
        direction TB

        subgraph L2 ["L2 — Shared Redis"]
            check_l2{{"Check Redis cache"}}
            acquire_lock{{"Try distributed lock"}}
        end

        api["Fetch from Corpus GraphQL API"]
        write["Write to Redis + release lock + update L1"]
    end

    req --> check_l1

    check_l1 -- "FRESH HIT" --> respond_fresh
    check_l1 -- "STALE" --> respond_stale
    check_l1 -. "MISS (cold start, blocks)" .-> check_l2

    respond_stale -. "spawns task" .-> check_l2

    check_l2 -- "FRESH HIT" --> done_l2["Update L1 cache"]
    check_l2 -. "STALE" .-> acquire_lock
    check_l2 -. "MISS" .-> acquire_lock

    acquire_lock -- "LOCK ACQUIRED" --> api
    acquire_lock -. "LOCK HELD + stale exists" .-> serve_stale["Return stale data"]
    acquire_lock -. "LOCK HELD + no data" .-> retry["Wait → retry Redis → fetch from API"]

    api --> write --> done_api["Update L1 cache"]

    style req fill:#2c3e50,stroke:#1a252f,color:#ecf0f1,stroke-width:2px
    style check_l1 fill:#2980b9,stroke:#1f6da0,color:#fff,stroke-width:2px
    style check_l2 fill:#d35400,stroke:#a04000,color:#fff,stroke-width:2px
    style acquire_lock fill:#e67e22,stroke:#bf6516,color:#fff,stroke-width:2px
    style api fill:#1e8449,stroke:#145a32,color:#fff,stroke-width:2px
    style write fill:#1e8449,stroke:#145a32,color:#fff,stroke-width:2px
    style respond_fresh fill:#27ae60,stroke:#1e8449,color:#fff,stroke-width:2px
    style respond_stale fill:#27ae60,stroke:#1e8449,color:#fff,stroke-width:2px
    style serve_stale fill:#f4d03f,stroke:#d4ac0f,color:#333
    style retry fill:#f4d03f,stroke:#d4ac0f,color:#333
    style done_l2 fill:#27ae60,stroke:#1e8449,color:#fff
    style done_api fill:#27ae60,stroke:#1e8449,color:#fff
    style L1 fill:#eaf2f8,stroke:#2980b9,stroke-width:2px,color:#2c3e50
    style L2 fill:#fef5e7,stroke:#d35400,stroke-width:2px,color:#2c3e50
    style bg fill:#f4f6f7,stroke:#95a5a6,stroke-width:2px,stroke-dasharray: 8 4,color:#2c3e50
Loading

Design decisions

Decision Choice Why
Cache layer Redis L2 behind existing in-memory L1 Keeps per-pod latency low, Redis only consulted on L1 miss
Write pattern Distributed stale-while-revalidate One pod revalidates, others serve stale. Avoids thundering herd
Lock mechanism SET NX EX 30 Simple, self-expiring. Unconditional DELETE on release — worst case is one extra API call
Soft/hard TTL 2 min soft, 10 min hard Soft triggers revalidation. Hard is a Redis eviction safety net
Cache format Pydantic model dicts via orjson Processed models cached, not raw GraphQL. Saves CPU across pods
Redis cluster Same shared cluster as other providers ~36 keys, ~4MB total. Not worth a separate instance
Failure mode All Redis errors fall through to API Redis is an optimization, never a requirement. No new failure modes
Rollout Global config flag, off by default Enable in staging first, then production

Follow-up

  • Add StatsD metrics (hit/miss/stale/error) before enabling in production
  • Enable in staging, validate API call reduction
  • Enable in production

PR Review Checklist

Put an x in the boxes that apply

  • This PR conforms to the Contribution Guidelines
  • The PR title starts with the JIRA issue reference, format example [DISCO-####], and has the same title (if applicable)
  • [load test: (abort|skip|warn)] keywords are applied to the last commit message (if applicable)
  • Documentation has been updated (if applicable)
  • Functional and performance test coverage has been expanded and maintained (if applicable)

┆Issue is synchronized with this Jira Task

@mmiermans mmiermans requested a review from jpetto March 12, 2026 01:19
@mmiermans mmiermans changed the title Hnt 1890 shared curated recommendation cache HNT-1890: shared curated recommendation cache Mar 12, 2026
mmiermans and others added 11 commits March 12, 2026 14:07
Introduce a shared Redis cache layer between the in-memory SWR cache (L1)
and the Corpus GraphQL API to reduce API request volume by ~300x across
Merino pods.

Key design:
- Distributed stale-while-revalidate: one pod revalidates while others
  serve stale data, using SET NX EX for distributed locking
- Embedded soft TTL envelope (2min soft / 10min hard) stored as orjson
- All Redis errors gracefully fall through to the wrapped backend
- Gated behind config flag (corpus_cache.backend = "redis" / "none")
- Adds set_nx and delete to CacheAdapter protocol

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Fix lock release on serialize_fn failure: wrap both fetch and serialize
  in the try/except that releases the distributed lock
- Add Redis adapter cleanup on shutdown: store adapter in module-level var,
  close it in new shutdown() function called from main.py lifespan
- Add config validation: CorpusCacheConfig.__post_init__ ensures
  hard_ttl > soft_ttl and hard_ttl > lock_ttl
- Add tests: config validation, serialize error lock release, empty list
  caching behavior

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Make cache write best-effort in _revalidate: serialize/write errors no
  longer propagate when items were successfully fetched from the backend
- Catch deserialize_fn errors (e.g. Pydantic ValidationError) in
  get_or_fetch and treat as cache miss instead of crashing the request
- Validate that all TTL values are positive in CorpusCacheConfig
- Add unit tests for RedisAdapter.set_nx and delete methods
- Add tests for deserialize_fn errors and malformed envelope (KeyError)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Document unconditional lock DELETE tradeoff: in the rare case that
  revalidation exceeds lock_ttl_sec, at most one extra API call occurs
- Add test for stale + lock loser + deserialization failure path

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Rename config key from 'backend' to 'cache' for consistency with
  existing cache = "redis"/"none" convention used by other providers
- Simplify get_or_fetch API: accept (backend_type, surface_id, *extra)
  and derive data/lock keys internally, eliminating duplicated key
  building in callers
- Remove redundant _config field from wrapper classes
- Fix mypy type annotation in test helpers (**overrides: Any)
- Strengthen test assertions: verify lock release on write error,
  verify lock loser doesn't write/delete, verify set_nx call args

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
The _corpus_cache_adapter module-level global was the only resource in
the curated_recommendations subsystem needing explicit async cleanup,
managed via a separate shutdown() path. This broke the subsystem's
self-contained pattern where backends manage their own lifecycle.

Now CuratedRecommendationsProvider owns the cache adapter via an
optional cache_adapter parameter and a shutdown() method. The module's
shutdown() delegates to the provider. This matches how suggest providers
handle resource cleanup (provider.shutdown -> backend.shutdown ->
cache.close).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Fix TypeError crash on corrupted envelope with non-numeric expires_at:
  now caught and treated as cache miss instead of 500ing the request
- Fix lock leak on asyncio.CancelledError: use try/finally in _revalidate
  so lock is released even for BaseException subclasses
- Broaden _redis_set exception catch to all Exception (not just
  CacheAdapterError) so serialization errors are also suppressed
- Add graceful fallback in _init_corpus_cache: Redis init failure no
  longer aborts provider initialization
- Guard shutdown() against uninitialized provider (NameError on startup
  failure)
- Add defensive get_provider()/get_legacy_provider() with RuntimeError
  instead of silent NameError
- Add pragma: no cover to new protocol methods for consistency
- Add tests for non-numeric expires_at and CancelledError lock release

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Replace _make_corpus_item helper with generate_corpus_item from
test_sections.py. Keep _make_corpus_section as a thin wrapper since
no shared CorpusSection fixture exists yet.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- docs/operations/curated-recommendations/corpus-cache.md: operational
  doc with inline mermaid flow diagram, design decisions table, config
  reference, and rollout steps
- docs/SUMMARY.md: add curated recommendations section to nav
- PR_DESCRIPTION.md: PR description following repo template

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
On cache miss when the lock is held by another pod, instead of calling
the Corpus API directly (which defeats cross-pod coalescing), wait 0.1s
and retry Redis. If the lock winner has written data, return it. If not,
raise BackendError — the L1 SWR layer will handle the error and retry
on the next request.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@mmiermans mmiermans force-pushed the hnt-1890-shared-curated-recommendation-cache branch from 5894b87 to 8e632b5 Compare March 12, 2026 21:07
@mmiermans
Copy link
Collaborator Author

Returning a 201 code in the scenario where we do not have any (stale) data would improve reliability, by avoiding connections piling up.

TODO: Check how Firefox Desktop & Mobile would handle 201 status code? Does it retry?

@mmiermans
Copy link
Collaborator Author

Redis L2 behind existing in-memory L1

Q: Are we consistent, or does weather and other providers hit Redis on every request? Are the use-cases different?

Q: What's the existing pattern for testing with Redis?

Q: What are the existing Redis patterns?

Key settings:
- `cache` — `"redis"` to enable, `"none"` to disable (default: disabled)
- `soft_ttl_sec` — when a cached entry is considered stale and triggers revalidation
- `hard_ttl_sec` — when Redis evicts the key entirely (safety net)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want this to be high: more like 1-24 hours? Days? To match our response time.

await self.primary.delete(key)
except RedisError as exc:
raise CacheAdapterError(f"Failed to DELETE `{repr(key)}` with error: `{exc}`") from exc

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, check if this is a new pattern, and whether it should be one.


# MERINO__CURATED_RECOMMENDATIONS__CORPUS_CACHE__HARD_TTL_SEC
# Hard TTL in seconds. Redis evicts the key after this. Should be much longer than soft TTL.
hard_ttl_sec = 600
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
hard_ttl_sec = 600
hard_ttl_sec = 3600

Long hard ttl?


# MERINO__CURATED_RECOMMENDATIONS__CORPUS_CACHE__LOCK_TTL_SEC
# Distributed lock TTL in seconds. Auto-releases if the lock holder crashes.
lock_ttl_sec = 30
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Check if there's a client network timeout of 30s?

raise ValueError(
f"hard_ttl_sec ({self.hard_ttl_sec}) must be greater than "
f"lock_ttl_sec ({self.lock_ttl_sec})"
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good reason to do this differently from the usual config validation?



def _build_data_key(
config: CorpusCacheConfig, backend_type: str, surface_id: str, *extra: str
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should backend_type and surface_id be an enum? I believe we already have an enum for the latter.

str(days_offset),
fetch_fn=lambda: self._backend.fetch(surface_id, days_offset),
serialize_fn=lambda items: [item.model_dump(mode="json") for item in items],
deserialize_fn=lambda data: [CorpusItem.model_validate(d) for d in data],
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use named arguments.

return await self._redis_cache.get_or_fetch(
"scheduled",
surface_id.value,
str(days_offset),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's verify whether we still need this? Do we need to fetch yesterday's data for the schedule?

config: CorpusCacheConfig, backend_type: str, surface_id: str, *extra: str
) -> str:
"""Build the Redis key for cached corpus data."""
parts = [config.key_prefix, backend_type, surface_id, *extra]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally extra would be structured data.

"expires_at": time.time() + soft_ttl_sec,
"data": data,
}
return orjson.dumps(envelope)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check if there's a pattern to create an envelope or have a separate expiration key?

if await self._try_acquire_lock(lock_key):
return await self._revalidate(data_key, lock_key, fetch_fn, serialize_fn)
# Another pod is populating; wait briefly then retry Redis
await asyncio.sleep(0.1)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return 201 if this happens?

async def shutdown(self) -> None:
"""Close resources owned by this provider."""
if self._cache_adapter is not None:
await self._cache_adapter.close()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this the first time we need to do a shutdown in the curated-recommendations provider?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant