Skip to content

UCT/CUDA_IPC: Keep lru invariant - region in cache <-> region in lru#11363

Open
nbellalou wants to merge 2 commits intoopenucx:v1.21.xfrom
nbellalou:cudaIpcLru_bugFIx
Open

UCT/CUDA_IPC: Keep lru invariant - region in cache <-> region in lru#11363
nbellalou wants to merge 2 commits intoopenucx:v1.21.xfrom
nbellalou:cudaIpcLru_bugFIx

Conversation

@nbellalou
Copy link
Copy Markdown
Collaborator

Keep lru invariant - region in cache if and only if region in lru

What?

Modify the initial design to keep only regions candidates for eviction in lru, and thus remove from lru regions with refcount > 0. This caused a failure when regions with refcount > 0 were removed from cache on cache miss and were assumed to also be in lru.
Initial design was based on the assumption that regions with refcount > 0 cannot be removed from cache because are in use - and therefore can safely be removed from lru and pushed back later.
This assumption appears to be false - on cache miss regions with refcount > 0 can be removed.
Unclear for now if this is a bug or by design

Why?

https://redmine.mellanox.com/issues/4983516

How?

It is optional, but for complex PRs, please provide information about the design,
architecture, approach, etc.

Modify initial design of keeping in lru only regions candidates to
eviction - and thus removing from lru regions with refcount > 0.
This caused a failure when regions with refcount > 0 were removed from
cache on cache miss and were assumed to also be in lru
@gleon99
Copy link
Copy Markdown
Contributor

gleon99 commented Apr 20, 2026

@nbellalou
If cache is at its limit and every region has refcount > 0, uct_cuda_ipc_cache_evict_lru will skip all -> return w/o freeing anything. Then caller inserts a new region, exceeding the configured limits.

@gleon99
Copy link
Copy Markdown
Contributor

gleon99 commented Apr 20, 2026

@nbellalou If cache is at its limit and every region has refcount > 0, uct_cuda_ipc_cache_evict_lru will skip all -> return w/o freeing anything. Then caller inserts a new region, exceeding the configured limits.

That was also the case before the PR, but we have no signal of that to the caller. maybe at least add some logging

Comment thread src/uct/cuda/cuda_ipc/cuda_ipc_cache.c Outdated
/* In-use region -- pull off LRU, it will be re-added on release */
ucs_list_del(&region->lru_list);
region->in_lru = 0;
/* In-use region -- keep on LRU, revisit on next eviction pass.*/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor */

if (region->refcount > 0) {
/* In-use region -- pull off LRU, it will be re-added on release */
ucs_list_del(&region->lru_list);
region->in_lru = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it always 1 now?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in_lru should always be 1 now because every region in cache should be in lru. However, I think it worth keeping the field and the assert ucs_assertv(region->in_lru) for debuggability and to detect potential future issues in CI, given the small memory cost of 1 byte (uint8_t)

Comment thread test/gtest/uct/cuda/test_cuda_ipc_md.cc Outdated
}

UCS_TEST_F(test_cuda_ipc_cache_lru, stale_destroy_while_in_use) {
/* Regression test for Bug A: evict_lru must not pull in-use regions off
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like too much comments. AI? :)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely AI. I'll reduce

}
}

UCS_TEST_F(test_cuda_ipc_cache_lru, stale_destroy_while_in_use) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you verify the test fails before the PR?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did now

Comment thread test/gtest/uct/cuda/test_cuda_ipc_md.cc Outdated

void destroy_region(uct_cuda_ipc_cache_region_t *region) {
ucs_status_t status = ucs_pgtable_remove(&m_cache->pgtable,
&region->super);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: alignment

@nbellalou
Copy link
Copy Markdown
Collaborator Author

nbellalou commented Apr 20, 2026

@nbellalou If cache is at its limit and every region has refcount > 0, uct_cuda_ipc_cache_evict_lru will skip all -> return w/o freeing anything. Then caller inserts a new region, exceeding the configured limits.

This is by design and similar to ucs rcache. Parameters are soft limits, not hard ones. The goal of the lru is to evict unused regions, but cache size can grow past the configured params if all regions are in use, to not hurt performances
Regarding logging, there is already prints indicating actual cache size and configured params

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants