Skip to content

SchedulerCache uses global mutex across cache.go and event_handlers.go causing contention/race-risk in read-heavy paths #5036

@hzxuzhonghu

Description

@hzxuzhonghu

Problem

SchedulerCache embeds a single global sync.Mutex which is locked in both pkg/scheduler/cache/cache.go and pkg/scheduler/cache/event_handlers.go for all operations, including read-heavy code paths (e.g., Snapshot(), String(), IsJobTerminated(), parseErrTaskKey()). Informer callbacks also take the same lock in event handlers. This coarse locking increases contention and makes it easy to accidentally introduce unsafe access as the cache grows.

Why it matters

  • Read-only operations block each other and block writes, increasing latency.
  • Snapshot() and similar methods hold the lock while doing heavy work (clone/iterate), which amplifies contention.
  • Global lock across handlers and cache logic makes correctness harder to reason about and increases risk of subtle race bugs.

Suggested improvements (incremental)

  1. Switch to sync.RWMutex and use RLock for clearly read-only paths (String(), IsJobTerminated(), parseErrTaskKey(), etc.).
  2. Audit Snapshot() and avoid mutating data under shared lock; if needed, keep write lock there or refactor to avoid mutation.
  3. (Optional) Consider per-domain locks for Jobs, Nodes, Queues, CSINodesStatus, etc., or serialize mutations via a worker queue.

References

  • pkg/scheduler/cache/cache.go
  • pkg/scheduler/cache/event_handlers.go

Expected outcome

Reduced contention and clearer separation of read vs write paths, lowering race-condition risk and improving performance under load.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions