UCT/CUDA_IPC: Keep lru invariant - region in cache <-> region in lru by nbellalou · Pull Request #11363 · openucx/ucx

nbellalou · 2026-04-20T06:33:54Z

Keep lru invariant - region in cache if and only if region in lru

What?

Modify the initial design to keep only regions candidates for eviction in lru, and thus remove from lru regions with refcount > 0. This caused a failure when regions with refcount > 0 were removed from cache on cache miss and were assumed to also be in lru.
Initial design was based on the assumption that regions with refcount > 0 cannot be removed from cache because are in use - and therefore can safely be removed from lru and pushed back later.
This assumption appears to be false - on cache miss regions with refcount > 0 can be removed.
Unclear for now if this is a bug or by design

Why?

https://redmine.mellanox.com/issues/4983516

How?

It is optional, but for complex PRs, please provide information about the design,
architecture, approach, etc.

Modify initial design of keeping in lru only regions candidates to eviction - and thus removing from lru regions with refcount > 0. This caused a failure when regions with refcount > 0 were removed from cache on cache miss and were assumed to also be in lru

gleon99 · 2026-04-20T07:03:46Z

@nbellalou
If cache is at its limit and every region has refcount > 0, uct_cuda_ipc_cache_evict_lru will skip all -> return w/o freeing anything. Then caller inserts a new region, exceeding the configured limits.

gleon99 · 2026-04-20T07:12:52Z

@nbellalou If cache is at its limit and every region has refcount > 0, uct_cuda_ipc_cache_evict_lru will skip all -> return w/o freeing anything. Then caller inserts a new region, exceeding the configured limits.

That was also the case before the PR, but we have no signal of that to the caller. maybe at least add some logging

gleon99 · 2026-04-20T07:04:07Z

-            /* In-use region -- pull off LRU, it will be re-added on release */
-            ucs_list_del(&region->lru_list);
-            region->in_lru = 0;
+            /* In-use region -- keep on LRU, revisit on next eviction pass.*/


gleon99 · 2026-04-20T07:04:53Z

        if (region->refcount > 0) {
-            /* In-use region -- pull off LRU, it will be re-added on release */
-            ucs_list_del(&region->lru_list);
-            region->in_lru = 0;


Isn't it always 1 now?

Yes, in_lru should always be 1 now because every region in cache should be in lru. However, I think it worth keeping the field and the assert ucs_assertv(region->in_lru) for debuggability and to detect potential future issues in CI, given the small memory cost of 1 byte (uint8_t)

gleon99 · 2026-04-20T07:14:02Z

 }
+
+UCS_TEST_F(test_cuda_ipc_cache_lru, stale_destroy_while_in_use) {
+    /* Regression test for Bug A: evict_lru must not pull in-use regions off


Looks like too much comments. AI? :)

Completely AI. I'll reduce

gleon99 · 2026-04-20T07:18:27Z

    }
 }
+
+UCS_TEST_F(test_cuda_ipc_cache_lru, stale_destroy_while_in_use) {


Did you verify the test fails before the PR?

gleon99 · 2026-04-20T07:20:15Z


+    void destroy_region(uct_cuda_ipc_cache_region_t *region) {
+        ucs_status_t status = ucs_pgtable_remove(&m_cache->pgtable,
+                                                  &region->super);


minor: alignment

nbellalou · 2026-04-20T08:33:41Z

@nbellalou If cache is at its limit and every region has refcount > 0, uct_cuda_ipc_cache_evict_lru will skip all -> return w/o freeing anything. Then caller inserts a new region, exceeding the configured limits.

This is by design and similar to ucs rcache. Parameters are soft limits, not hard ones. The goal of the lru is to evict unused regions, but cache size can grow past the configured params if all regions are in use, to not hurt performances
Regarding logging, there is already prints indicating actual cache size and configured params

nbellalou requested a review from gleon99 April 20, 2026 06:34

nbellalou mentioned this pull request Apr 20, 2026

UCT/CUDA_IPC: Keep lru invariant - region in cache <-> region in lru #11364

Open

gleon99 reviewed Apr 20, 2026

View reviewed changes

UCT/CUDA_IPC: PR fixes

5dea954

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UCT/CUDA_IPC: Keep lru invariant - region in cache <-> region in lru#11363

UCT/CUDA_IPC: Keep lru invariant - region in cache <-> region in lru#11363
nbellalou wants to merge 2 commits intoopenucx:v1.21.xfrom
nbellalou:cudaIpcLru_bugFIx

nbellalou commented Apr 20, 2026

Uh oh!

gleon99 commented Apr 20, 2026 •

edited

Loading

Uh oh!

gleon99 commented Apr 20, 2026

Uh oh!

gleon99 Apr 20, 2026

Uh oh!

gleon99 Apr 20, 2026

Uh oh!

nbellalou Apr 20, 2026

Uh oh!

gleon99 Apr 20, 2026

Uh oh!

nbellalou Apr 20, 2026

Uh oh!

gleon99 Apr 20, 2026

Uh oh!

nbellalou Apr 20, 2026

Uh oh!

gleon99 Apr 20, 2026

Uh oh!

nbellalou commented Apr 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nbellalou commented Apr 20, 2026

What?

Why?

How?

Uh oh!

gleon99 commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gleon99 commented Apr 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nbellalou commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gleon99 commented Apr 20, 2026 •

edited

Loading

nbellalou commented Apr 20, 2026 •

edited

Loading