Using ldmsd 4.5.1:
I am seeing some ldmsd aggregators beginning to print messages like the following on regular intervals (every 5 seconds):
Mar 30 11:11:56 rzhound193 ldmsd[22616]: ERROR: auth.munge: Failed to encode MUNGE. Socket communication error
Mar 30 11:11:56 rzhound193 ldmsd[22616]: ERROR: auth.munge: Failed to encode MUNGE. Socket communication error
Mar 30 11:11:56 rzhound193 ldmsd[22616]: ERROR: auth.munge: Failed to encode MUNGE. Socket communication error
I suspect that ldmsd is failing to reconnect to munge after the socket it is using is closed. The beginning of these messages in the logs seems to coincide with when munge was restarted and the munge socket was re-created.
ldmsd really needs to handle the dead socket cleanly and reinitiate a connection to munge.
Using ldmsd 4.5.1:
I am seeing some ldmsd aggregators beginning to print messages like the following on regular intervals (every 5 seconds):
I suspect that ldmsd is failing to reconnect to munge after the socket it is using is closed. The beginning of these messages in the logs seems to coincide with when munge was restarted and the munge socket was re-created.
ldmsd really needs to handle the dead socket cleanly and reinitiate a connection to munge.