UCT/CUDA_IPC: Keep lru invariant - region in cache <-> region in lru#11363
UCT/CUDA_IPC: Keep lru invariant - region in cache <-> region in lru#11363nbellalou wants to merge 2 commits intoopenucx:v1.21.xfrom
Conversation
Modify initial design of keeping in lru only regions candidates to eviction - and thus removing from lru regions with refcount > 0. This caused a failure when regions with refcount > 0 were removed from cache on cache miss and were assumed to also be in lru
|
@nbellalou |
That was also the case before the PR, but we have no signal of that to the caller. maybe at least add some logging |
| /* In-use region -- pull off LRU, it will be re-added on release */ | ||
| ucs_list_del(®ion->lru_list); | ||
| region->in_lru = 0; | ||
| /* In-use region -- keep on LRU, revisit on next eviction pass.*/ |
| if (region->refcount > 0) { | ||
| /* In-use region -- pull off LRU, it will be re-added on release */ | ||
| ucs_list_del(®ion->lru_list); | ||
| region->in_lru = 0; |
There was a problem hiding this comment.
Yes, in_lru should always be 1 now because every region in cache should be in lru. However, I think it worth keeping the field and the assert ucs_assertv(region->in_lru) for debuggability and to detect potential future issues in CI, given the small memory cost of 1 byte (uint8_t)
| } | ||
|
|
||
| UCS_TEST_F(test_cuda_ipc_cache_lru, stale_destroy_while_in_use) { | ||
| /* Regression test for Bug A: evict_lru must not pull in-use regions off |
There was a problem hiding this comment.
Looks like too much comments. AI? :)
There was a problem hiding this comment.
Completely AI. I'll reduce
| } | ||
| } | ||
|
|
||
| UCS_TEST_F(test_cuda_ipc_cache_lru, stale_destroy_while_in_use) { |
There was a problem hiding this comment.
Did you verify the test fails before the PR?
|
|
||
| void destroy_region(uct_cuda_ipc_cache_region_t *region) { | ||
| ucs_status_t status = ucs_pgtable_remove(&m_cache->pgtable, | ||
| ®ion->super); |
This is by design and similar to ucs rcache. Parameters are soft limits, not hard ones. The goal of the lru is to evict unused regions, but cache size can grow past the configured params if all regions are in use, to not hurt performances |
Keep lru invariant - region in cache if and only if region in lru
What?
Modify the initial design to keep only regions candidates for eviction in lru, and thus remove from lru regions with refcount > 0. This caused a failure when regions with refcount > 0 were removed from cache on cache miss and were assumed to also be in lru.
Initial design was based on the assumption that regions with refcount > 0 cannot be removed from cache because are in use - and therefore can safely be removed from lru and pushed back later.
This assumption appears to be false - on cache miss regions with refcount > 0 can be removed.
Unclear for now if this is a bug or by design
Why?
https://redmine.mellanox.com/issues/4983516
How?
It is optional, but for complex PRs, please provide information about the design,
architecture, approach, etc.