Skip to content

Miscellaneous data race fixes detected by test runs with -race option#460

Draft
booxter wants to merge 4 commits intoovn-kubernetes:mainfrom
booxter:data-races-fixes
Draft

Miscellaneous data race fixes detected by test runs with -race option#460
booxter wants to merge 4 commits intoovn-kubernetes:mainfrom
booxter:data-races-fixes

Conversation

@booxter
Copy link

@booxter booxter commented Feb 6, 2026

  • client: fix send on closed channel panic during disconnect
  • client: don't close trafficSeen channel on disconnect
  • client: start handleDisconnectNotification after all WaitGroup Add() calls
  • cache: add read locks to Mapper() and DatabaseModel()

When handleDisconnectNotification() closes the trafficSeen channel,
there's a race window where transact() may attempt to send on the
already-closed channel, causing a panic:

  goroutine 1 (transact):          goroutine 2 (handleDisconnectNotification):
    if trafficSeen != nil {
                                     close(trafficSeen)
      trafficSeen <- struct{}{}    // PANIC: send on closed channel
    }

This was observed in CI as a coredump during e2e tests when the
services controller was syncing a service while the OVSDB connection
was being closed.

Fix this by adding a disconnected flag (protected by a mutex) that is
set before closing the trafficSeen channel. The transact() function
checks this flag before attempting to send on the channel.

This follows the same pattern as the isShutdown() mechanism proposed in:
ovn-kubernetes#457

A test using a TCP proxy to simulate abrupt network disconnections is
included to verify the fix.

Signed-off-by: Ihar Hrachyshka <ihrachyshka@nvidia.com>
Assisted-by: opus (claude-opus-4-5-20251101)
Instead of closing trafficSeen channel during disconnect, rely on the
disconnected flag to stop sends and let handleInactivityProbes() exit
via stopCh (which is closed first).

Closing the channel races with concurrent sends in transact(), even
with the disconnected flag check, because the check and send are not
atomic:

  WARNING: DATA RACE
  Write at 0x00c0000a6160 by goroutine 7312:
    runtime.recvDirect()
    client.(*ovsdbClient).handleDisconnectNotification()
        client/client.go:1255   <-- close(o.trafficSeen)

  Previous read at 0x00c0000a6160 by goroutine 7315:
    runtime.chansend1()
    client.(*ovsdbClient).transact()
        client/client.go:861    <-- o.trafficSeen <- struct{}{}

The channel doesn't need to be explicitly closed:
- handleInactivityProbes() watches stopCh which is closed first
- The channel will be garbage collected when no longer referenced
- A new channel is created on reconnection anyway

Signed-off-by: Ihar Hrachyshka <ihrachyshka@nvidia.com>
Assisted-by: opus (claude-opus-4-5-20251101)
…calls

Move the goroutine spawn for handleDisconnectNotification() to after
all handlerShutdown.Add() calls complete. This fixes a race between
Add() and Wait() on the same WaitGroup during reconnection.

The race occurs because handleDisconnectNotification() calls Wait()
on handlerShutdown, but it was spawned before all Add() calls:

  WARNING: DATA RACE
  Write at 0x00c00058e158 by goroutine 14049:
    client.(*ovsdbClient).handleDisconnectNotification()
        client/client.go:1256   <-- o.handlerShutdown.Wait()

  Previous read at 0x00c00058e158 by goroutine 13633:
    client.(*ovsdbClient).connect()
        client/client.go:317    <-- o.handlerShutdown.Add(1)

Per sync.WaitGroup documentation: "calls with a positive delta that
occur when the counter is zero must happen before a Wait."

Signed-off-by: Ihar Hrachyshka <ihrachyshka@nvidia.com>
Assisted-by: opus (claude-opus-4-5-20251101)
Add RLock protection to Mapper() and DatabaseModel() methods to prevent
races with Purge() which writes to t.dbModel under the write lock.

Without the lock, concurrent access during reconnection causes a race:

  WARNING: DATA RACE
  Read at 0x00c0006c95d0 by goroutine 1279:
    cache.(*TableCache).DatabaseModel()
        cache/cache.go:886      <-- return t.dbModel (no lock)

  Previous write at 0x00c0006c95d0 by goroutine 1277:
    cache.(*TableCache).Purge()
        cache/cache.go:1017     <-- t.dbModel = dbModel (with lock)

Signed-off-by: Ihar Hrachyshka <ihrachyshka@nvidia.com>
Assisted-by: opus (claude-opus-4-5-20251101)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant