Fix flaky test: switch work queue tests to in-process client by devin-ai-integration[bot] · Pull Request #21042 · PrefectHQ/prefect

devin-ai-integration · 2026-03-09T17:50:39Z

Fixes flaky test_get_runs_in_queue_concurrency_limit_and_limit[1] which intermittently fails with httpx.ReadError in the Orchestration Tests suite on SQLite (failed run).

Root Cause

The hosted_api_client fixture runs the API server in a subprocess. Under parallel test execution (pytest-xdist), this creates cross-process SQLite contention — the test process and the server subprocess both access the same worker-scoped database file. When the server subprocess hits a lock timeout mid-request, it drops the connection, surfacing as httpx.ReadError in the test client.

Changes

Switches 8 tests in TestGetRunsInWorkQueue from hosted_api_client (subprocess-based) to ephemeral_client_with_lifespan (in-process ASGI transport). This eliminates cross-process SQLite contention entirely by running the API server in the same process as the test.

Tests switched to ephemeral_client_with_lifespan — these perform simple request/response without relying on Docket background tasks:

test_get_runs_in_queue
test_get_runs_in_queue_limit
test_get_runs_in_queue_scheduled_before
test_get_runs_in_queue_nonexistant
test_get_runs_in_queue_paused
test_get_runs_in_queue_concurrency_limit
test_get_runs_in_queue_concurrency_limit_and_limit ← the flaky test
test_read_work_queue_runs_does_not_update_a_paused_work_queues_status

Tests kept on hosted_api_client — these use retry_asserts to wait for Docket background tasks (e.g. mark_deployments_ready) and need the hosted server:

test_read_work_queue_runs_updates_work_queue_last_polled_time
test_read_work_queue_runs_associated_deployments_return_status_of_ready
test_read_work_queue_runs_updates_work_queue_status

No production code changes.

Reviewer Checklist

Verify the 8 switched tests don't rely on subprocess-specific behavior or Docket background task processing
Confirm the 3 tests kept on hosted_api_client are the correct ones to retain (they all use retry_asserts)

Checklist

This pull request references any related issue by including "closes <link to issue>"
- If no issue exists and your change is not a small fix, please create an issue first.
If this pull request adds new functionality, it includes unit tests that cover the changes
If this pull request removes docs files, it includes redirect settings in mint.json.
If this pull request adds functions or classes, it includes helpful docstrings.

Link to Devin session: https://app.devin.ai/sessions/fa3a47b1b4ad4e4a84adc5e97904dc4a
Requested by: bot_apk (apk@cognition.ai)

The test_get_runs_in_queue_concurrency_limit_and_limit test was intermittently failing with httpx.ReadError because the hosted API server subprocess encountered SQLite lock contention during parallel test execution (pytest-xdist). The 30s busy_timeout in test mode was insufficient under heavy parallel load, causing the server to fail and close the connection. Remove the test-specific timeout differentiation and use the same 60s timeout for all modes. This gives SQLite enough time to handle lock contention during parallel test execution without impacting test performance, since the timeout is only reached under genuine contention scenarios. Co-authored-by: bot_apk <apk@cognition.ai>

devin-ai-integration · 2026-03-09T17:50:44Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

codspeed-hq · 2026-03-09T17:54:59Z

Merging this PR will not alter performance

✅ 2 untouched benchmarks

_{Comparing devin/1773078422-fix-flaky-work-queue-sqlite-timeout (cb318d0) with main (b265679)}

desertaxle · 2026-03-09T18:58:28Z

Is there anything else we can do to reduce contention on SQLite while running tests?

devin-ai-integration · 2026-03-09T18:59:36Z

Good question. Here are some additional approaches that could further reduce SQLite contention during tests, roughly ordered by impact/feasibility:

1. Use per-worker SQLite databases with pytest-xdist
Each xdist worker could get its own SQLite database file instead of sharing one. This would eliminate cross-process contention entirely. The test_database_connection_url fixture could append the worker ID (e.g., prefect_test_gw0.db, prefect_test_gw1.db). The hosted API server would also need its own dedicated database.

2. Use WAL2 journal mode (SQLite 3.37+)
SQLite's WAL2 mode allows concurrent writers in some scenarios where WAL does not. However, this requires a fairly recent SQLite version and isn't widely supported by Python's sqlite3 module yet.

3. Reduce write transaction scope
The TestGetRunsInWorkQueue tests use hosted_api_client for every test, including simple reads. Tests that only read data (like test_get_runs_in_queue, test_get_runs_in_queue_limit) could use the in-process client fixture instead, reducing the number of cross-process SQLite connections competing for locks. The hosted_api_client could be reserved for tests that specifically need to test the hosted server behavior (e.g., deployment status updates via Docket background tasks).

4. Connection pooling tuning
For test mode, the SQLite engine doesn't explicitly set pool_size or max_overflow (except for in-memory DBs). Limiting the pool size in test mode could reduce the number of concurrent connections competing for the write lock.

That said, the 60s timeout in this PR aligns test mode with production and should handle the current contention levels. The per-worker database approach (option 1) would be the most impactful structural change if contention continues to be an issue.

desertaxle · 2026-03-09T19:24:47Z

Can you implement that first option instead of changing the timeout in the tests?

devin-ai-integration · 2026-03-09T19:29:53Z

After investigating the test infrastructure, I found that per-worker SQLite databases are already implemented. The generate_test_database_connection_url fixture in tests/conftest.py (line 495-500) already creates isolated database files per xdist worker:

if scheme == "sqlite+aiosqlite":
    test_db_path = Path(PREFECT_HOME.value()) / f"prefect_{worker_id}.db"
    yield f"sqlite+aiosqlite:///{test_db_path}"
    return

And the hosted_api_server fixture depends on test_database_connection_url and passes the per-worker URL to the subprocess via get_current_settings().to_environment_variables().

So the contention isn't cross-worker — it's within a single worker, between two processes that must share the same database:

The test process (writing fixture data like work queues, flow runs)
The hosted API server subprocess (serving HTTP requests + running Docket background tasks)

Both processes need to access the same SQLite file because the tests set up data directly and then query it through the hosted API. This is an inherent limitation of SQLite's single-writer model when used across processes.

Given this, the realistic options to reduce contention are:

Increase the busy_timeout (this PR) — gives SQLite more time to handle within-worker lock contention
Switch affected tests from hosted_api_client to ephemeral_client_with_lifespan — eliminates the subprocess entirely, running everything in-process where SQLite handles concurrency better
Reduce the scope of tests that use hosted_api_client — only use it for tests that specifically need the hosted server behavior (e.g., Docket background task tests)

Would you like me to pursue option 2 or 3 instead of the timeout increase? Or is there a different approach you had in mind?

desertaxle · 2026-03-09T19:55:07Z

Yeah, look into option 2 as a possible solution

…Lite contention Revert the busy_timeout changes and instead switch tests in TestGetRunsInWorkQueue that don't rely on Docket background tasks from hosted_api_client to ephemeral_client_with_lifespan. This eliminates cross-process SQLite contention by running the API server in-process via ASGI transport instead of a subprocess. Tests that use retry_asserts (waiting for background tasks) remain on hosted_api_client since they need the hosted server behavior. Co-authored-by: bot_apk <apk@cognition.ai>

devin-ai-integration · 2026-03-09T20:18:41Z

Updated the PR per reviewer feedback — reverted the timeout changes and switched 8 tests in TestGetRunsInWorkQueue from hosted_api_client to ephemeral_client_with_lifespan.

The tests that don't rely on Docket background tasks now run in-process via ASGI transport, eliminating the cross-process SQLite contention that caused the flaky httpx.ReadError. The 3 tests that use retry_asserts (waiting for background tasks like mark_deployments_ready) remain on hosted_api_client.

All 53 CI checks passed, and the flaky test passed 20/20 locally.

devin-ai-integration bot added the development Tech debt, refactors, CI, tests, and other related work. label Mar 9, 2026

devin-ai-integration bot changed the title ~~Fix flaky test: unify SQLite busy_timeout to 60s for all modes~~ Fix flaky test: switch work queue tests to in-process client Mar 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky test: switch work queue tests to in-process client#21042

Fix flaky test: switch work queue tests to in-process client#21042
devin-ai-integration[bot] wants to merge 2 commits intomainfrom
devin/1773078422-fix-flaky-work-queue-sqlite-timeout

devin-ai-integration bot commented Mar 9, 2026 •

edited

Loading

Uh oh!

devin-ai-integration bot commented Mar 9, 2026

Uh oh!

codspeed-hq bot commented Mar 9, 2026

Uh oh!

desertaxle commented Mar 9, 2026

Uh oh!

devin-ai-integration bot commented Mar 9, 2026

Uh oh!

desertaxle commented Mar 9, 2026

Uh oh!

devin-ai-integration bot commented Mar 9, 2026

Uh oh!

desertaxle commented Mar 9, 2026

Uh oh!

devin-ai-integration bot commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

devin-ai-integration bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root Cause

Changes

Reviewer Checklist

Checklist

Uh oh!

devin-ai-integration bot commented Mar 9, 2026

🤖 Devin AI Engineer

Uh oh!

codspeed-hq bot commented Mar 9, 2026

Merging this PR will not alter performance

Uh oh!

desertaxle commented Mar 9, 2026

Uh oh!

devin-ai-integration bot commented Mar 9, 2026

Uh oh!

desertaxle commented Mar 9, 2026

Uh oh!

devin-ai-integration bot commented Mar 9, 2026

Uh oh!

desertaxle commented Mar 9, 2026

Uh oh!

devin-ai-integration bot commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

devin-ai-integration bot commented Mar 9, 2026 •

edited

Loading