Skip to content

Commit dd0c7b9

Browse files
authored
Fix IdleConnectionEvictor thread leak in long-running services (#1271)
## Summary Fixes #1221 — `IdleConnectionEvictor` threads accumulate over time in long-running services even when all JDBC connections are properly closed. **Root cause (two related bugs in `DatabricksDriverFeatureFlagsContextFactory`):** 1. **Ref-count overcounting**: `getInstance()` incremented the ref-count on every call, but `isTelemetryEnabled()` calls it repeatedly for the same connection while `DatabricksConnection.close()` only calls `removeInstance()` once. This meant the feature-flags context (and its 15-minute scheduler) never shut down after the last connection closed. 2. **Stale connection context → orphaned HTTP client**: The feature-flags context is shared per-host but HTTP clients are scoped per-connection UUID. The context stored the *first* connection's `connectionContext`. When that connection closed, its HTTP client was removed from the factory. On the next 15-min refresh, `refreshAllFeatureFlags()` called `DatabricksHttpClientFactory.getClient(staleContext)` — the UUID was gone, so `computeIfAbsent` silently created a brand-new `DatabricksHttpClient` (and a new `IdleConnectionEvictor` thread) that nobody would ever clean up. **Fix** (mirrors the pattern already used correctly in `TelemetryClientFactory`): - `FeatureFlagsContextHolder` now tracks connection UUIDs (`ConcurrentHashMap<String, IDatabricksConnectionContext>`) instead of a bare ref-count. - `getInstance()` registers the UUID idempotently — no overcounting. - `removeInstance()` removes the UUID; when the set is empty the context shuts down. - When the "owner" connection closes but others remain, the stored `connectionContext` is atomically updated to another active connection so future HTTP-client lookups resolve to a live client. ## Test plan - [x] `DatabricksDriverFeatureFlagsContextFactoryTest` — updated existing tests for new UUID-based semantics and added: - `testMultipleCallsWithSameConnectionAreIdempotent` — single connection, multiple `getInstance` calls, one `removeInstance` is sufficient - `testMultipleConnectionsRequireAllToClose` — two connections share context; closing one keeps it alive, closing both shuts it down - `testConnectionContextUpdatedWhenOwnerConnectionCloses` — verifies the stored `connectionContext` switches to another connection when the owner closes - [x] All existing `DatabricksDriverFeatureFlagsContextTest` tests still pass - [x] Full build with `mvn test` passes --------- Signed-off-by: Samikshya Chand <samikshya.chand@databricks.com>
1 parent 03ca3c1 commit dd0c7b9

5 files changed

Lines changed: 163 additions & 53 deletions

File tree

NEXT_CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
- Fixed `rollback()` to throw `SQLException` when called in auto-commit mode (no active transaction), aligning with JDBC spec. Previously it silently sent a ROLLBACK command to the server.
2020
- Fixed `fetchAutoCommitStateFromServer()` to accept both `"1"`/`"0"` and `"true"`/`"false"` responses from `SET AUTOCOMMIT` query, since different server implementations return different formats.
2121
- Fixed socket leak in SDK HTTP client that prevented CRaC checkpointing. The SDK's connection pool was not shut down on `connection.close()`, leaving TCP sockets open.
22+
- Fixed `IdleConnectionEvictor` thread leak in long-running services. The feature-flags context shared per host was ref-counted incorrectly and held a stale connection UUID after the owning connection closed; on the next 15-minute refresh it silently recreated an HTTP client (and its evictor thread) that was never cleaned up. Connection UUIDs are now tracked idempotently and the stored connection context is updated when the owning connection closes.
2223
- Fixed Date fields within complex types (ARRAY, STRUCT, MAP) being returned as epoch day integers instead of proper date values.
2324
- Fixed `DatabaseMetaData.getColumns()` returning the column type name in `COLUMN_DEF` for columns with no default value. `COLUMN_DEF` now correctly returns `null` per the JDBC specification.
2425
- Coalesce concurrent expired cloud fetch link refreshes into a single batch FetchResults RPC to prevent thread pool exhaustion under high concurrency.

src/main/java/com/databricks/jdbc/common/safe/DatabricksDriverFeatureFlagsContext.java

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ public class DatabricksDriverFeatureFlagsContext {
3333
DriverUtil.getDriverVersionWithoutOSSSuffix());
3434
private static final int DEFAULT_TTL_SECONDS = 900; // 15 minutes
3535
private final String featureFlagEndpoint;
36-
private final IDatabricksConnectionContext connectionContext;
36+
private volatile IDatabricksConnectionContext connectionContext;
3737
private final Cache<String, String> featureFlags;
3838
private final ScheduledExecutorService scheduler =
3939
Executors.newSingleThreadScheduledExecutor(
@@ -139,6 +139,14 @@ void fetchAndSetFlagsFromServer(IDatabricksHttpClient httpClient, HttpGet reques
139139
}
140140
}
141141

142+
IDatabricksConnectionContext getConnectionContext() {
143+
return connectionContext;
144+
}
145+
146+
void updateConnectionContext(IDatabricksConnectionContext newContext) {
147+
this.connectionContext = newContext;
148+
}
149+
142150
public boolean isFeatureEnabled(String name) {
143151
String value = featureFlags.getIfPresent(name);
144152
return Boolean.parseBoolean(value);

src/main/java/com/databricks/jdbc/common/safe/DatabricksDriverFeatureFlagsContextFactory.java

Lines changed: 33 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,9 @@ private DatabricksDriverFeatureFlagsContextFactory() {
1616
}
1717

1818
/**
19-
* Gets or creates a DatabricksDriverFeatureFlagsContext instance for the given compute
19+
* Gets or creates a DatabricksDriverFeatureFlagsContext instance for the given compute. Multiple
20+
* calls with the same connection UUID are idempotent — only one reference is tracked per
21+
* connection.
2022
*
2123
* @param context the connection context
2224
* @return the DatabricksDriverFeatureFlagsContext instance
@@ -29,35 +31,47 @@ public static DatabricksDriverFeatureFlagsContext getInstance(
2931
key,
3032
(k, existing) -> {
3133
if (existing == null) {
32-
// First reference for this compute
34+
// First connection for this host
3335
return new FeatureFlagsContextHolder(
34-
new DatabricksDriverFeatureFlagsContext(context), 1);
36+
new DatabricksDriverFeatureFlagsContext(context), context);
3537
}
36-
// Additional reference for the same compute
37-
existing.refCount.incrementAndGet();
38+
// Track this connection UUID (idempotent for the same UUID)
39+
existing.addContext(context);
3840
return existing;
3941
});
4042
return holder.context;
4143
}
4244

4345
/**
44-
* Removes the DatabricksDriverFeatureFlagsContext instance for the given compute.
46+
* Removes the reference for the given connection context. When the last connection for a host is
47+
* removed, the context is shut down. If the removed connection was the one whose context was
48+
* stored internally, the stored context is updated to another active connection to prevent HTTP
49+
* client leaks.
4550
*
4651
* @param connectionContext the connection context
4752
*/
4853
public static void removeInstance(IDatabricksConnectionContext connectionContext) {
4954
if (connectionContext != null) {
5055
String key = TelemetryHelper.keyOf(connectionContext);
56+
String uuid = connectionContext.getConnectionUuid();
5157
contextMap.computeIfPresent(
5258
key,
5359
(k, holder) -> {
54-
// Last reference being removed: shutdown and remove entry
55-
if (holder.refCount.get() <= 1) {
60+
if (uuid != null) {
61+
holder.activeContexts.remove(uuid);
62+
}
63+
if (holder.activeContexts.isEmpty()) {
5664
holder.context.shutdown();
5765
return null;
5866
}
59-
// Still referenced elsewhere: just decrement
60-
holder.refCount.decrementAndGet();
67+
// If the removed connection's UUID matches the one stored in the feature flags context,
68+
// update to another active connection to prevent stale HTTP client lookups.
69+
IDatabricksConnectionContext current = holder.context.getConnectionContext();
70+
if (current != null && uuid != null && uuid.equals(current.getConnectionUuid())) {
71+
IDatabricksConnectionContext replacement =
72+
holder.activeContexts.values().iterator().next();
73+
holder.context.updateConnectionContext(replacement);
74+
}
6175
return holder;
6276
});
6377
}
@@ -67,9 +81,15 @@ public static void removeInstance(IDatabricksConnectionContext connectionContext
6781
public static void setFeatureFlagsContext(
6882
IDatabricksConnectionContext connectionContext, Map<String, String> featureFlags) {
6983
String key = TelemetryHelper.keyOf(connectionContext);
70-
contextMap.put(
84+
contextMap.compute(
7185
key,
72-
new FeatureFlagsContextHolder(
73-
new DatabricksDriverFeatureFlagsContext(connectionContext, featureFlags), 1));
86+
(k, existing) -> {
87+
if (existing != null) {
88+
existing.context.shutdown();
89+
}
90+
return new FeatureFlagsContextHolder(
91+
new DatabricksDriverFeatureFlagsContext(connectionContext, featureFlags),
92+
connectionContext);
93+
});
7494
}
7595
}
Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,23 @@
11
package com.databricks.jdbc.common.safe;
22

3-
import java.util.concurrent.atomic.AtomicInteger;
3+
import com.databricks.jdbc.api.internal.IDatabricksConnectionContext;
4+
import java.util.concurrent.ConcurrentHashMap;
45

56
final class FeatureFlagsContextHolder {
67
final DatabricksDriverFeatureFlagsContext context;
7-
AtomicInteger refCount;
8+
final ConcurrentHashMap<String, IDatabricksConnectionContext> activeContexts;
89

9-
FeatureFlagsContextHolder(DatabricksDriverFeatureFlagsContext context, int refCount) {
10+
FeatureFlagsContextHolder(
11+
DatabricksDriverFeatureFlagsContext context, IDatabricksConnectionContext initialContext) {
1012
this.context = context;
11-
this.refCount = new AtomicInteger(refCount);
13+
this.activeContexts = new ConcurrentHashMap<>();
14+
addContext(initialContext);
15+
}
16+
17+
void addContext(IDatabricksConnectionContext ctx) {
18+
String uuid = ctx != null ? ctx.getConnectionUuid() : null;
19+
if (uuid != null) {
20+
activeContexts.put(uuid, ctx);
21+
}
1222
}
1323
}

src/test/java/com/databricks/jdbc/common/safe/DatabricksDriverFeatureFlagsContextFactoryTest.java

Lines changed: 106 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,8 @@ void setUp() {
4444
when(connectionContext2.getHost()).thenReturn("host2.databricks.com");
4545
when(connectionContext1.getHostForOAuth()).thenReturn("host1.databricks.com");
4646
when(connectionContext2.getHostForOAuth()).thenReturn("host2.databricks.com");
47+
when(connectionContext1.getConnectionUuid()).thenReturn("uuid-1");
48+
when(connectionContext2.getConnectionUuid()).thenReturn("uuid-2");
4749
}
4850

4951
@AfterEach
@@ -74,32 +76,92 @@ void testGetInstanceCreatesDifferentContextForDifferentWorkspace() {
7476
}
7577

7678
@Test
77-
void testReferenceCountingWorks() {
78-
// Get instance three times for same workspace
79+
void testMultipleCallsWithSameConnectionAreIdempotent() {
80+
// Multiple getInstance calls with the same connection UUID are idempotent
7981
DatabricksDriverFeatureFlagsContext context1 =
8082
DatabricksDriverFeatureFlagsContextFactory.getInstance(connectionContext1);
81-
DatabricksDriverFeatureFlagsContext context2 =
82-
DatabricksDriverFeatureFlagsContextFactory.getInstance(connectionContext1);
83-
DatabricksDriverFeatureFlagsContext context3 =
84-
DatabricksDriverFeatureFlagsContextFactory.getInstance(connectionContext1);
83+
DatabricksDriverFeatureFlagsContextFactory.getInstance(connectionContext1);
84+
DatabricksDriverFeatureFlagsContextFactory.getInstance(connectionContext1);
8585

86-
assertSame(context1, context2);
87-
assertSame(context2, context3);
86+
assertSame(
87+
context1, DatabricksDriverFeatureFlagsContextFactory.getInstance(connectionContext1));
8888

89-
// Remove twice - context should still exist
90-
DatabricksDriverFeatureFlagsContextFactory.removeInstance(connectionContext1);
89+
// A single removeInstance is sufficient to clean up one connection
9190
DatabricksDriverFeatureFlagsContextFactory.removeInstance(connectionContext1);
9291

93-
// Get again - should still return the same instance
94-
DatabricksDriverFeatureFlagsContext context4 =
92+
// Context is gone; a new one would be created on next getInstance
93+
DatabricksDriverFeatureFlagsContext newContext =
9594
DatabricksDriverFeatureFlagsContextFactory.getInstance(connectionContext1);
96-
assertSame(context1, context4);
97-
98-
// Remove remaining references
99-
DatabricksDriverFeatureFlagsContextFactory.removeInstance(connectionContext1);
95+
// New context was created because the old one was shut down
96+
// (context1 was already shut down by the single removeInstance above)
97+
assertNotNull(newContext);
10098
DatabricksDriverFeatureFlagsContextFactory.removeInstance(connectionContext1);
10199
}
102100

101+
@Test
102+
void testMultipleConnectionsRequireAllToClose() {
103+
IDatabricksConnectionContext connA =
104+
mock(IDatabricksConnectionContext.class, withSettings().lenient());
105+
IDatabricksConnectionContext connB =
106+
mock(IDatabricksConnectionContext.class, withSettings().lenient());
107+
108+
lenient().when(connA.getHost()).thenReturn("shared.databricks.com");
109+
lenient().when(connB.getHost()).thenReturn("shared.databricks.com");
110+
lenient().when(connA.getHostForOAuth()).thenReturn("shared.databricks.com");
111+
lenient().when(connB.getHostForOAuth()).thenReturn("shared.databricks.com");
112+
lenient().when(connA.getConnectionUuid()).thenReturn("uuid-connA");
113+
lenient().when(connB.getConnectionUuid()).thenReturn("uuid-connB");
114+
115+
// Both connections share the same context
116+
DatabricksDriverFeatureFlagsContext ctxA =
117+
DatabricksDriverFeatureFlagsContextFactory.getInstance(connA);
118+
DatabricksDriverFeatureFlagsContext ctxB =
119+
DatabricksDriverFeatureFlagsContextFactory.getInstance(connB);
120+
assertSame(ctxA, ctxB);
121+
122+
// Closing one connection keeps the context alive
123+
DatabricksDriverFeatureFlagsContextFactory.removeInstance(connA);
124+
DatabricksDriverFeatureFlagsContext ctxStillAlive =
125+
DatabricksDriverFeatureFlagsContextFactory.getInstance(connB);
126+
assertSame(ctxA, ctxStillAlive);
127+
128+
// Closing the last connection removes the context
129+
DatabricksDriverFeatureFlagsContextFactory.removeInstance(connB);
130+
DatabricksDriverFeatureFlagsContextFactory.removeInstance(connB); // second remove is no-op
131+
}
132+
133+
@Test
134+
void testConnectionContextUpdatedWhenOwnerConnectionCloses() {
135+
// This tests the fix for the HTTP client leak:
136+
// when the connection whose context is stored in the feature flags context closes,
137+
// the stored context should be updated to another active connection.
138+
IDatabricksConnectionContext connA =
139+
mock(IDatabricksConnectionContext.class, withSettings().lenient());
140+
IDatabricksConnectionContext connB =
141+
mock(IDatabricksConnectionContext.class, withSettings().lenient());
142+
143+
lenient().when(connA.getHost()).thenReturn("shared2.databricks.com");
144+
lenient().when(connB.getHost()).thenReturn("shared2.databricks.com");
145+
lenient().when(connA.getHostForOAuth()).thenReturn("shared2.databricks.com");
146+
lenient().when(connB.getHostForOAuth()).thenReturn("shared2.databricks.com");
147+
lenient().when(connA.getConnectionUuid()).thenReturn("uuid-ownerA");
148+
lenient().when(connB.getConnectionUuid()).thenReturn("uuid-secondB");
149+
150+
// connA creates the context (becomes the "owner")
151+
DatabricksDriverFeatureFlagsContext ctx =
152+
DatabricksDriverFeatureFlagsContextFactory.getInstance(connA);
153+
DatabricksDriverFeatureFlagsContextFactory.getInstance(connB);
154+
155+
assertEquals("uuid-ownerA", ctx.getConnectionContext().getConnectionUuid());
156+
157+
// When connA closes, the stored context should be updated to connB
158+
DatabricksDriverFeatureFlagsContextFactory.removeInstance(connA);
159+
assertEquals("uuid-secondB", ctx.getConnectionContext().getConnectionUuid());
160+
161+
// Clean up
162+
DatabricksDriverFeatureFlagsContextFactory.removeInstance(connB);
163+
}
164+
103165
@Test
104166
void testRemoveInstanceWithNullContext() {
105167
// Should not throw exception
@@ -108,33 +170,40 @@ void testRemoveInstanceWithNullContext() {
108170

109171
@Test
110172
void testContextPersistsUntilLastRemoval() {
111-
// Create multiple references
112-
DatabricksDriverFeatureFlagsContext context1 =
113-
DatabricksDriverFeatureFlagsContextFactory.getInstance(connectionContext1);
114-
DatabricksDriverFeatureFlagsContext context2 =
115-
DatabricksDriverFeatureFlagsContextFactory.getInstance(connectionContext1);
173+
// Create two distinct connections to the same host
174+
IDatabricksConnectionContext connX =
175+
mock(IDatabricksConnectionContext.class, withSettings().lenient());
176+
IDatabricksConnectionContext connY =
177+
mock(IDatabricksConnectionContext.class, withSettings().lenient());
178+
lenient().when(connX.getHost()).thenReturn("host1.databricks.com");
179+
lenient().when(connY.getHost()).thenReturn("host1.databricks.com");
180+
lenient().when(connX.getHostForOAuth()).thenReturn("host1.databricks.com");
181+
lenient().when(connY.getHostForOAuth()).thenReturn("host1.databricks.com");
182+
lenient().when(connX.getConnectionUuid()).thenReturn("uuid-X");
183+
lenient().when(connY.getConnectionUuid()).thenReturn("uuid-Y");
116184

117185
// Set a feature flag
118186
Map<String, String> flags = new HashMap<>();
119187
flags.put("test.flag", "true");
120-
DatabricksDriverFeatureFlagsContextFactory.setFeatureFlagsContext(connectionContext1, flags);
188+
DatabricksDriverFeatureFlagsContextFactory.setFeatureFlagsContext(connX, flags);
121189

122-
// Get the context again - should have the flag
123-
DatabricksDriverFeatureFlagsContext context3 =
124-
DatabricksDriverFeatureFlagsContextFactory.getInstance(connectionContext1);
125-
assertTrue(context3.isFeatureEnabled("test.flag"));
190+
DatabricksDriverFeatureFlagsContextFactory.getInstance(connX);
191+
DatabricksDriverFeatureFlagsContextFactory.getInstance(connY);
126192

127-
// Remove one reference - context should still exist
128-
DatabricksDriverFeatureFlagsContextFactory.removeInstance(connectionContext1);
193+
DatabricksDriverFeatureFlagsContext context =
194+
DatabricksDriverFeatureFlagsContextFactory.getInstance(connX);
195+
assertTrue(context.isFeatureEnabled("test.flag"));
129196

130-
// Get again - should still have the flag
131-
DatabricksDriverFeatureFlagsContext context4 =
132-
DatabricksDriverFeatureFlagsContextFactory.getInstance(connectionContext1);
133-
assertTrue(context4.isFeatureEnabled("test.flag"));
197+
// Remove connX — context should still exist because connY is still open
198+
DatabricksDriverFeatureFlagsContextFactory.removeInstance(connX);
134199

135-
// Clean up remaining references
136-
DatabricksDriverFeatureFlagsContextFactory.removeInstance(connectionContext1);
137-
DatabricksDriverFeatureFlagsContextFactory.removeInstance(connectionContext1);
200+
DatabricksDriverFeatureFlagsContext contextAfterX =
201+
DatabricksDriverFeatureFlagsContextFactory.getInstance(connY);
202+
assertTrue(contextAfterX.isFeatureEnabled("test.flag"));
203+
204+
// Clean up
205+
DatabricksDriverFeatureFlagsContextFactory.removeInstance(connY);
206+
DatabricksDriverFeatureFlagsContextFactory.removeInstance(connY);
138207
}
139208

140209
@Test
@@ -167,6 +236,8 @@ void testMultipleConnectionsToSameWorkspaceShareFlags() {
167236
lenient().when(conn2.getHost()).thenReturn("host.databricks.com");
168237
lenient().when(conn1.getHostForOAuth()).thenReturn("host.databricks.com");
169238
lenient().when(conn2.getHostForOAuth()).thenReturn("host.databricks.com");
239+
lenient().when(conn1.getConnectionUuid()).thenReturn("uuid-conn1");
240+
lenient().when(conn2.getConnectionUuid()).thenReturn("uuid-conn2");
170241

171242
// Set flags for first connection
172243
Map<String, String> flags = new HashMap<>();

0 commit comments

Comments
 (0)