Skip to content

nats input: silent connection loss with no observability into disconnect/reconnect/close lifecycle #4142

@itplayer

Description

@itplayer

Problem

When a NATS connection is lost, the nats input becomes silently non-functional with no log output to indicate what happened or whether recovery is in progress.
The current get() in connection.go only sets nats.ErrorHandler, which covers async subscription-level errors.

It does not set:

  • nats.DisconnectErrHandler — fired when the TCP connection drops
  • nats.ReconnectHandler — fired when reconnection succeeds
  • nats.ClosedHandler — fired when the client exhausts all reconnect attempts and gives up permanently

This means:

  • A TCP disconnect produces zero log output
  • Each reconnect attempt produces zero log output
  • If MaxReconnects (default: 60) is exhausted and the connection is permanently closed, there is zero log output — the input simply stops receiving messages forever, with no indication of why

Impact

In production, this makes it nearly impossible to diagnose why an input stopped receiving messages without external tooling.
The symptom is: input_received metric drops to 0 and never recovers, requiring a manual pod restart.
Without logs from these handlers, operators cannot distinguish between:

  • A transient network blip that self-recovered
  • A permanent connection loss requiring intervention
  • A slow consumer that caused the server to forcibly close the subscription

Proposed Fix

Add the three missing handlers in get(), immediately after errorHandlerOption:

	opts = append(opts, nats.DisconnectErrHandler(func(nc *nats.Conn, err error) {
		if err != nil { 
			c.logger.Errorf("NATS disconnected from %s: %v", nc.ConnectedUrl(), err)
		} else {     
			c.logger.Warnf("NATS disconnected from %s (no error)", nc.ConnectedUrl())
		}                                                                                                                                       
	}))
	
	opts = append(opts, nats.ReconnectHandler(func(nc *nats.Conn) {    
		c.logger.Infof("NATS reconnected to %s", nc.ConnectedUrl())
	}))      
	
	opts = append(opts, nats.ClosedHandler(func(nc *nats.Conn) {
		c.logger.Errorf("NATS connection permanently closed (exhausted reconnect attempts), manual restart required")
	}))

The ClosedHandler log in particular is critical: it is the only way an operator can know that the input will never recover without a restart, since the NATS client silently stops after MaxReconnects is exceeded (if people does not configure MaxReconnect=-1 reconnect forever).

Note: ClosedHandler is also a natural place to surface a fatal error or trigger a component restart in the future, but logging is a minimal and non-breaking first step.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions