Fix bug with leaked fdb connections#2447
Conversation
Upgrades the Go bindings package to the release for 7.3.63. The updated bindings expose functionality to Close() the database connection created from fdb.OpenDatabase, which releases resources created from the underlaying C code.
Updates the lock client implementation to use short lived connections that are closed when no longer needed, instead of a long living connection that leaks the underlaying connection to fdb. This approach is similar to what e.g. the existing fdb or admin client do, which also creates a new short lived connection each time. Bonus: This commit fixes a bug where some methods used the database handle without checking if the connection was established, which could happen if locking was disabled.
Makes sure that Close() is called on the fdb connection created from fdb.OpenDatabase in the admin client.
Upgrades the Go bindings in the data-loader sample app, and make sure that opened database connection is closed (by calling Close()).
Makes sure that Close() is called on the fdb connection created from fdb.OpenDatabase in the fdb client.
5d4ff3f to
f0ae901
Compare
|
I saw the changes from #2444 btw, and perhaps the lock client (in |
johscheuer
left a comment
There was a problem hiding this comment.
First thanks for the PR! Have you tried setting TLS_CERT_REFRESH_DELAY_SECONDS to an appropriate value for your use case in the operator deployment? This should be already possible by setting FDB_NETWORK_OPTION_TLS_CERT_REFRESH_DELAY_SECONDS.
What you are seeing is not a resource leak but rather a design decision to keep the connection open to the FDB clusters, since the operator uses the FoundationDBCluster UID from the Kubernetes resource, the cluster file path is always the same and the go bindings will cache the DB (https://github.com/apple/foundationdb/blob/release-7.1/bindings/go/src/fdb/fdb.go#L283-L290). We decided to keep the connection open, so that the bindings can update the connection string in the cluster file if the coordinators are changed outside of the operators scope, e.g. in cases where you run multiple operators. Closing the database connection directly can have some side effects that we have to validate first, e.g. https://github.com/apple/foundationdb/blob/main/bindings/go/src/fdb/database.go#L118-L120.
Since the operator supports FDB versions from 7.1+, we have to use the bindings for 7.1 and we should only update the bindings to 7.3 once we remove the support for 7.1
I hope all of this makes sense.
|
Thanks for taking the time to review! I was not aware that it was possible to configure the refresh delay using environment variables. I'm aware of the Doesn't hurt to try of course, so I'll try setting Also, I was not aware of the caching of the connections, thanks for pointing that out and showing me! That also makes me wonder why our operator works at all (perhaps it doesn't work as well as it should), considering that cached database connection will be using an expired TLS certificate for a lot of the time... And makes sense to not close the connection after the Yet another thing that surprises me: I thought thta the bindings were backwards compatbile as long as the correct version was configured with e.g. I guess I/we can close this PR, since it does not really make sense given what you've pointed out. Again, thanks for taking the time to review. I would appreciate a lot if you could also help me understand these follow-up questions that I have. |
I just assumed that this knob is part of the network option, I should have verified that 🙈 . In that case |
|
Ah great, I missed that you could also set knobs using environment variables! I'll try that, thanks 🙂 |
|
Can confirm that setting the |
|
Here is the documentation update: #2449 |
Description
This change fixes a resource leak issue caused by using short-lived FoundationDB connections without properly closing them when done. The previously used version of the Go bindings (from the 7.1.67 release) did not expose any way to close and destory the underlaying FoundationDB resources created from the C library. Newer versions of the bindings introduces a
Close()that can be used to properly release resources.The PR upgrades the Go bindings to the version from release 7.3.63, and makes sure that all connections created from
fdb.OpenDatabase(...)are properly closed when no longer needed.This required a small refactor of the lock client, such that it uses short-lived database connections that are created for each method call, instead of keeping a long-lived connection over the lifetime of the lock client. This approach creates a consistent behavior with how the fdb and admin clients are implemented (which also utilize short-lived connections), and makes sure that the connection can be closed safely without pushing this responsibility out to the user of the lock client.
We discovered this bug (i.e. the resource and connection leakage) in our production cluster. We only noticed it beacuase we use quite short lived mTLS certificates (shorter than the default
TLS_CERT_REFRESH_DELAY_SECONDSin FoundationDB of 12 hours). We started seeing TLS issues in the trace logs, originating from the operator, which ultimately lead us to find thefdb.OpenDatabasewithoutClose()in the operator. This was not a problem for operations e.g. directly callingfdbcli, since we had configured the--knob-tls-cert-refresh-delay-secondsknob incustomParameters.Type of change
Discussion
Testing
Did run unit tests, but unit tests are mostly useless for this type of fix. Hopefully e2e tests in CI is better. We also use the 7.3.63 bindings with this pattern (i.e.
Close()of the db connection) heavily, and have substantial tests on our side. Feel confident about this change.Documentation
Follow-up
As described above, we ultimately discovered this bug because of TLS issues. If implementation is ever changed in the future to have long-lived connections opened by
fdb.OpenDatabase, it would be good if the TLS refersh knob (or any other custom parameters) could be configured. Seems like low prio right now though, given that all opened connections are now short-lived.