Skip to content

Fix bug with leaked fdb connections#2447

Closed
hsson wants to merge 5 commits intoFoundationDB:mainfrom
hsson:hsson/close-db-handle
Closed

Fix bug with leaked fdb connections#2447
hsson wants to merge 5 commits intoFoundationDB:mainfrom
hsson:hsson/close-db-handle

Conversation

@hsson
Copy link
Copy Markdown

@hsson hsson commented Mar 24, 2026

Description

This change fixes a resource leak issue caused by using short-lived FoundationDB connections without properly closing them when done. The previously used version of the Go bindings (from the 7.1.67 release) did not expose any way to close and destory the underlaying FoundationDB resources created from the C library. Newer versions of the bindings introduces a Close() that can be used to properly release resources.

The PR upgrades the Go bindings to the version from release 7.3.63, and makes sure that all connections created from fdb.OpenDatabase(...) are properly closed when no longer needed.

This required a small refactor of the lock client, such that it uses short-lived database connections that are created for each method call, instead of keeping a long-lived connection over the lifetime of the lock client. This approach creates a consistent behavior with how the fdb and admin clients are implemented (which also utilize short-lived connections), and makes sure that the connection can be closed safely without pushing this responsibility out to the user of the lock client.

We discovered this bug (i.e. the resource and connection leakage) in our production cluster. We only noticed it beacuase we use quite short lived mTLS certificates (shorter than the default TLS_CERT_REFRESH_DELAY_SECONDS in FoundationDB of 12 hours). We started seeing TLS issues in the trace logs, originating from the operator, which ultimately lead us to find the fdb.OpenDatabase without Close() in the operator. This was not a problem for operations e.g. directly calling fdbcli, since we had configured the --knob-tls-cert-refresh-delay-seconds knob in customParameters.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Discussion

  • N/a

Testing

Did run unit tests, but unit tests are mostly useless for this type of fix. Hopefully e2e tests in CI is better. We also use the 7.3.63 bindings with this pattern (i.e. Close() of the db connection) heavily, and have substantial tests on our side. Feel confident about this change.

Documentation

  • N/a

Follow-up

As described above, we ultimately discovered this bug because of TLS issues. If implementation is ever changed in the future to have long-lived connections opened by fdb.OpenDatabase, it would be good if the TLS refersh knob (or any other custom parameters) could be configured. Seems like low prio right now though, given that all opened connections are now short-lived.

hsson added 5 commits March 24, 2026 22:00
Upgrades the Go bindings package to the release for 7.3.63. The updated
bindings expose functionality to Close() the database connection created
from fdb.OpenDatabase, which releases resources created from the
underlaying C code.
Updates the lock client implementation to use short lived connections
that are closed when no longer needed, instead of a long living connection
that leaks the underlaying connection to fdb. This approach is similar
to what e.g. the existing fdb or admin client do, which also creates a new
short lived connection each time.

Bonus: This commit fixes a bug where some methods used the database handle
without checking if the connection was established, which could happen
if locking was disabled.
Makes sure that Close() is called on the fdb connection created from
fdb.OpenDatabase in the admin client.
Upgrades the Go bindings in the data-loader sample app, and make sure
that opened database connection is closed (by calling Close()).
Makes sure that Close() is called on the fdb connection created from
fdb.OpenDatabase in the fdb client.
@hsson hsson force-pushed the hsson/close-db-handle branch from 5d4ff3f to f0ae901 Compare March 24, 2026 21:02
@hsson hsson changed the title Hsson/close db handle Fix bug with leaked fdb connections Mar 24, 2026
@hsson
Copy link
Copy Markdown
Author

hsson commented Mar 25, 2026

I saw the changes from #2444 btw, and perhaps the lock client (in lock_client.go) also deserves a similar refactor to use an executeTransaction helper? Could include it in this PR if it's of interest.

Copy link
Copy Markdown
Member

@johscheuer johscheuer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First thanks for the PR! Have you tried setting TLS_CERT_REFRESH_DELAY_SECONDS to an appropriate value for your use case in the operator deployment? This should be already possible by setting FDB_NETWORK_OPTION_TLS_CERT_REFRESH_DELAY_SECONDS.
What you are seeing is not a resource leak but rather a design decision to keep the connection open to the FDB clusters, since the operator uses the FoundationDBCluster UID from the Kubernetes resource, the cluster file path is always the same and the go bindings will cache the DB (https://github.com/apple/foundationdb/blob/release-7.1/bindings/go/src/fdb/fdb.go#L283-L290). We decided to keep the connection open, so that the bindings can update the connection string in the cluster file if the coordinators are changed outside of the operators scope, e.g. in cases where you run multiple operators. Closing the database connection directly can have some side effects that we have to validate first, e.g. https://github.com/apple/foundationdb/blob/main/bindings/go/src/fdb/database.go#L118-L120.
Since the operator supports FDB versions from 7.1+, we have to use the bindings for 7.1 and we should only update the bindings to 7.3 once we remove the support for 7.1

I hope all of this makes sense.

@hsson
Copy link
Copy Markdown
Author

hsson commented Mar 26, 2026

Thanks for taking the time to review!

I was not aware that it was possible to configure the refresh delay using environment variables. I'm aware of the FDB_NETWORK_OPTION_<option> method, but from I can tell from the list of options the TLS refresh delay is not in there. My understanding is that the TLS refresh delay is a knob (that can be set with --knob-tls-cert-refresh-delay-seconds), but I'm not sure if it is possible to set with an environment variable? I'm happy to be proven wrong, but it would be nice to understand why that works then 😄

Doesn't hurt to try of course, so I'll try setting FDB_NETWORK_OPTION_TLS_CERT_REFRESH_DELAY_SECONDS in the meantime, despite not knowing how it would work — I trust you!

Also, I was not aware of the caching of the connections, thanks for pointing that out and showing me! That also makes me wonder why our operator works at all (perhaps it doesn't work as well as it should), considering that cached database connection will be using an expired TLS certificate for a lot of the time...

And makes sense to not close the connection after the RebootWorker as you pointed out, missed that. Thanks a lot for the careful review.

Yet another thing that surprises me: I thought thta the bindings were backwards compatbile as long as the correct version was configured with e.g. fdb.APIVersion(730). Is this not the case? 🤔 Where is this documented?

I guess I/we can close this PR, since it does not really make sense given what you've pointed out. Again, thanks for taking the time to review. I would appreciate a lot if you could also help me understand these follow-up questions that I have.

@johscheuer
Copy link
Copy Markdown
Member

I was not aware that it was possible to configure the refresh delay using environment variables. I'm aware of the FDB_NETWORK_OPTION_ method, but from I can tell from the list of options the TLS refresh delay is not in there. My understanding is that the TLS refresh delay is a knob (that can be set with --knob-tls-cert-refresh-delay-seconds), but I'm not sure if it is possible to set with an environment variable? I'm happy to be proven wrong, but it would be nice to understand why that works then 😄

I just assumed that this knob is part of the network option, I should have verified that 🙈 . In that case FDB_NETWORK_OPTION_KNOB="tls_cert_refresh_delay_seconds=300" should work. If things work, feel free to create a PR that documents how to set this knob.

@hsson
Copy link
Copy Markdown
Author

hsson commented Mar 26, 2026

Ah great, I missed that you could also set knobs using environment variables! I'll try that, thanks 🙂

@hsson hsson closed this Mar 26, 2026
@hsson hsson deleted the hsson/close-db-handle branch March 26, 2026 11:34
@hsson
Copy link
Copy Markdown
Author

hsson commented Mar 26, 2026

Can confirm that setting the FDB_NETWORK_OPTION_KNOB="tls_cert_refresh_delay_seconds=<value>" environment variable on our operator has solved our issues! Thanks a lot for this 🙂 I'll see if I can find any appropriate place to put this in the docs.

@hsson
Copy link
Copy Markdown
Author

hsson commented Mar 26, 2026

Here is the documentation update: #2449

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants