Skip to content

IO.select blocks indefinitely with :persistent plugin when connections become inactive #137

@mdayaram

Description

@mdayaram

Summary

When using the :persistent plugin, IO.select in Selector#select_many can be called with a nil timeout, causing it to block indefinitely. This happens because inactive persistent connections return nil from Connection#timeout, and if all selectables have nil timeouts and no timers are pending, Selector#next_timeout returns nil.

We observed a Sidekiq job hang for 11 hours on IO.select before we took a thread dump and killed the process.

Environment

  • httpx version: 1.7.5 (branch issue-377, commit 336b057cd5c2)
  • Ruby version: 3.4.8
  • OS: Linux (Docker container, Debian-based)
  • Plugins: :persistent, :retries (max_retries: 2), :rate_limiter, :auth

Configuration

HTTPX.
  plugin(:retries, max_retries: 2, retry_after: :exponential_backoff).
  plugin(:rate_limiter).
  plugin(:auth).
  with(
    timeout: { connect_timeout: 15, request_timeout: 30, operation_timeout: 30 },
  ).
  plugin(:persistent, close_on_fork: true)

Usage pattern

We issue multi-request GET calls to multiple different origins:

# requests is an array of [path, { origin: ..., params: ... }] tuples
responses = session.get(*requests)

This fans out HTTP requests to ~20-30 different hosts in parallel via Session#send_requestsSession#receive_requests.

Thread dump (stuck thread)

IO.select                                    # selector.rb:206
  HTTPX::Selector#select_many                # selector.rb:206
  HTTPX::Selector#select                     # selector.rb:183
  HTTPX::Selector#next_tick                  # selector.rb:53
  HTTPX::Session#receive_requests            # session.rb:337
  HTTPX::Session#send_requests               # session.rb:307
  HTTPX::Session#request                     # session.rb:102
  HTTPX::Chainable#get                       # chainable.rb:10

Root cause analysis

1. Connection#timeout returns nil for inactive connections

In connection.rb:327-335:

def timeout
  return if @state == :closed || @state == :inactive  # <-- returns nil
  return @timeout if @timeout
  return @options.timeout[:connect_timeout] if @state == :idle
  @options.timeout[:operation_timeout]
end

2. Selector#next_timeout propagates nil to IO.select

In selector.rb:275-291:

def next_timeout
  timer_interval = @timers.wait_interval
  connection_interval = @selectables.filter_map(&:timeout).min
  return connection_interval unless timer_interval  # <-- returns nil if both are nil
  # ...
end

If all registered selectables are :inactive or :closed, filter_map(&:timeout) produces an empty array, .min returns nil, and if there are no active timers, next_timeout returns nil.

3. IO.select called with nil timeout blocks forever

In selector.rb:204-206:

def select_many(r, w, interval, &block)
  readers, writers = ::IO.select(r, w, nil, interval)  # interval is nil → blocks forever

4. The :persistent plugin creates the conditions for this

Without :persistent, connections close after completing their requests and are deregistered from the selector. With :persistent, completed connections transition to :inactive and remain as selectables.

The likely sequence:

  1. Multi-request call sends requests to N hosts
  2. Most responses complete successfully, their connections transition to :inactive
  3. One connection enters a problematic state (e.g., remote closes the TCP connection without a proper FIN/RST, or the connection errors out and gets closed)
  4. All remaining selectables now return nil from timeout
  5. No active timers remain (the request timeout timer may have been cleaned up when the connection errored)
  6. next_timeout returns nil
  7. IO.select blocks forever waiting on the inactive connections' file descriptors

Steps to reproduce

This is a race condition that's hard to reproduce deterministically, but the conditions are:

  1. Use the :persistent plugin
  2. Send multi-request calls (session.get(*multiple_requests)) to multiple different origins
  3. Have one of the target hosts drop the connection in a way that causes the connection to close/error without properly failing the associated request's timeout timer
  4. The remaining inactive persistent connections keep the selector non-empty, but all return nil timeouts

A more targeted reproduction might be possible by:

require "httpx"

session = HTTPX.plugin(:persistent).with(
  timeout: { connect_timeout: 5, request_timeout: 10, operation_timeout: 10 }
)

# Send requests to multiple hosts where one will hang/drop
responses = session.get(
  ["https://httpbin.org/get", {}],
  ["https://host-that-will-drop-connection.example/", {}]
)

Suggested fix

Selector#next_timeout should return a maximum timeout value (or the minimum configured timeout) instead of nil when no selectables report a timeout. This ensures IO.select always has a bounded wait. For example:

def next_timeout
  @is_timer_interval = false

  timer_interval = @timers.wait_interval
  connection_interval = @selectables.filter_map(&:timeout).min

  # Ensure we never return nil when there are selectables registered,
  # to prevent IO.select from blocking indefinitely
  if connection_interval.nil? && timer_interval.nil? && @selectables.any?
    return 0  # or a small default like 1
  end

  return connection_interval unless timer_interval
  # ...
end

Alternatively, Connection#timeout could return operation_timeout for :inactive connections that are still registered in the selector, rather than nil.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions