Skip to content

ldmsd fails to reconnect to munge #2228

@morrone

Description

@morrone

Using ldmsd 4.5.1:

I am seeing some ldmsd aggregators beginning to print messages like the following on regular intervals (every 5 seconds):

Mar 30 11:11:56 rzhound193 ldmsd[22616]:     ERROR: auth.munge: Failed to encode MUNGE. Socket communication error
Mar 30 11:11:56 rzhound193 ldmsd[22616]:     ERROR: auth.munge: Failed to encode MUNGE. Socket communication error
Mar 30 11:11:56 rzhound193 ldmsd[22616]:     ERROR: auth.munge: Failed to encode MUNGE. Socket communication error

I suspect that ldmsd is failing to reconnect to munge after the socket it is using is closed. The beginning of these messages in the logs seems to coincide with when munge was restarted and the munge socket was re-created.

ldmsd really needs to handle the dead socket cleanly and reinitiate a connection to munge.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions