Skip to content

Fix race conditions in gateway payload sending causing NotAuthenticated errors#139

Draft
Claude wants to merge 4 commits intomasterfrom
claude/fix-race-condition-in-sending
Draft

Fix race conditions in gateway payload sending causing NotAuthenticated errors#139
Claude wants to merge 4 commits intomasterfrom
claude/fix-race-condition-in-sending

Conversation

@Claude
Copy link
Copy Markdown
Contributor

@Claude Claude AI commented Feb 16, 2026

Discord gateway closes connections with NotAuthenticated when authenticated payloads (UpdatePresence, UpdateVoiceState, RequestMembers) are sent during Identify/Resume handshake states, causing loss of RESUMED event data.

Changes

Added state-aware payload validation

  • CanSendPayload() method validates payload type against current shard state
  • Handshake payloads (Identify, Resume) restricted to their respective states
  • Authenticated payloads require Ready state
  • Heartbeat allowed during any connected state

Implemented double-check pattern for authenticated payloads

// First check - general state validation
if (!CanSendPayload(payload.Op)) { ... }

// Second check - prevent handshake race
if (RequiresAuthentication(payload.Op))
{
    if (State is ShardState.Identifying or ShardState.Resuming)
    {
        // Wait for Ready and retry instead of sending
        await WaitForReadyAsync(cancellationToken);
        continue;
    }
}

Fixed deadlock potential

  • Handshake payloads no longer retry on WebSocketClosedException (would deadlock waiting for Ready state they're responsible for creating)
  • Non-handshake payloads retry after waiting for reconnection

Race Conditions Addressed

  1. TOCTOU between initial Ready check and send: Authenticated payload waits for Ready, but state changes to Identifying/Resuming before send completes
  2. TOCTOU between CanSendPayload and Gateway.SendAsync: State validated, but changes to handshake state before actual transmission
  3. Concurrent SendAsync calls: State could change mid-flight from multiple threads

The double-check reduces the race window to nanoseconds (between final state check and network I/O), making it virtually impossible to send authenticated payloads during handshake states.

Co-authored-by: Quahu <24848652+Quahu@users.noreply.github.com>
@Claude Claude AI changed the title [WIP] Fix race condition in SendAsync method for sharding Fix race condition and deadlock in DefaultShard.SendAsync Feb 16, 2026
@Claude Claude AI requested a review from Quahu February 16, 2026 19:27
…nditions

Co-authored-by: Quahu <24848652+Quahu@users.noreply.github.com>
@Claude Claude AI changed the title Fix race condition and deadlock in DefaultShard.SendAsync Fix race condition in SendAsync by adding state validation before sending Feb 16, 2026
… handshake

Co-authored-by: Quahu <24848652+Quahu@users.noreply.github.com>
@Claude Claude AI changed the title Fix race condition in SendAsync by adding state validation before sending Fix race conditions in gateway payload sending causing NotAuthenticated errors Feb 16, 2026
@Quahu Quahu requested a review from Copilot February 16, 2026 20:03
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request addresses race conditions in Discord gateway payload sending that cause NotAuthenticated errors when authenticated payloads (UpdatePresence, UpdateVoiceState, RequestMembers) are sent during handshake states (Identifying/Resuming). The fix implements state-aware payload validation and a double-check pattern to ensure payloads are only sent when the shard is in the appropriate state.

Changes:

  • Added three helper methods to classify payload operations: RequiresAuthentication(), IsHandshakePayload(), and CanSendPayload()
  • Implemented double-check pattern in SendAsync() to prevent authenticated payloads from being sent during handshake states
  • Modified exception handling to prevent handshake payloads from retrying on recoverable WebSocket errors (which would deadlock)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +96 to +101
// Payloads that require authentication must wait for the Ready state before sending.
// This prevents sending them during reconnection or before initial authentication completes.
if (RequiresAuthentication(payload.Op))
{
await WaitForReadyAsync(cancellationToken).ConfigureAwait(false);
}
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WaitForReadyAsync call here may hang indefinitely when resuming a session because the RESUMED event handler (line 345-351) sets the state to Ready but does not complete the _readyTcs. This means any authenticated payloads sent during a resume will wait forever.

The RESUMED handler should also complete the _readyTcs similar to how the READY handler does at line 330-331.

Copilot uses AI. Check for mistakes.
{
// For non-handshake payloads (like UpdatePresence), wait for the gateway to reconnect and become ready again.
// Handshake payloads (Identify, Resume) should not retry here as they could deadlock waiting for Ready,
// since they are responsible for making the shard Ready in the first place.
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states "Handshake payloads (Identify, Resume) should not retry here as they could deadlock waiting for Ready" which is accurate. However, it would be helpful to also mention what happens instead: the exception propagates to the caller (IdentifyAsync or ResumeAsync) where the broader connection management logic handles reconnection.

Suggested change
// since they are responsible for making the shard Ready in the first place.
// since they are responsible for making the shard Ready in the first place. Instead, the exception is
// allowed to propagate to the caller (e.g. IdentifyAsync or ResumeAsync), where the broader connection
// management logic handles reconnection and initiating a new handshake.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants