Fix flaky test: switch work queue tests to in-process client#21042
Fix flaky test: switch work queue tests to in-process client#21042devin-ai-integration[bot] wants to merge 2 commits intomainfrom
Conversation
The test_get_runs_in_queue_concurrency_limit_and_limit test was intermittently failing with httpx.ReadError because the hosted API server subprocess encountered SQLite lock contention during parallel test execution (pytest-xdist). The 30s busy_timeout in test mode was insufficient under heavy parallel load, causing the server to fail and close the connection. Remove the test-specific timeout differentiation and use the same 60s timeout for all modes. This gives SQLite enough time to handle lock contention during parallel test execution without impacting test performance, since the timeout is only reached under genuine contention scenarios. Co-authored-by: bot_apk <apk@cognition.ai>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
|
Is there anything else we can do to reduce contention on SQLite while running tests? |
|
Good question. Here are some additional approaches that could further reduce SQLite contention during tests, roughly ordered by impact/feasibility: 1. Use per-worker SQLite databases with pytest-xdist 2. Use 3. Reduce write transaction scope 4. Connection pooling tuning That said, the 60s timeout in this PR aligns test mode with production and should handle the current contention levels. The per-worker database approach (option 1) would be the most impactful structural change if contention continues to be an issue. |
|
Can you implement that first option instead of changing the timeout in the tests? |
|
After investigating the test infrastructure, I found that per-worker SQLite databases are already implemented. The if scheme == "sqlite+aiosqlite":
test_db_path = Path(PREFECT_HOME.value()) / f"prefect_{worker_id}.db"
yield f"sqlite+aiosqlite:///{test_db_path}"
returnAnd the So the contention isn't cross-worker — it's within a single worker, between two processes that must share the same database:
Both processes need to access the same SQLite file because the tests set up data directly and then query it through the hosted API. This is an inherent limitation of SQLite's single-writer model when used across processes. Given this, the realistic options to reduce contention are:
Would you like me to pursue option 2 or 3 instead of the timeout increase? Or is there a different approach you had in mind? |
|
Yeah, look into option 2 as a possible solution |
…Lite contention Revert the busy_timeout changes and instead switch tests in TestGetRunsInWorkQueue that don't rely on Docket background tasks from hosted_api_client to ephemeral_client_with_lifespan. This eliminates cross-process SQLite contention by running the API server in-process via ASGI transport instead of a subprocess. Tests that use retry_asserts (waiting for background tasks) remain on hosted_api_client since they need the hosted server behavior. Co-authored-by: bot_apk <apk@cognition.ai>
|
Updated the PR per reviewer feedback — reverted the timeout changes and switched 8 tests in The tests that don't rely on Docket background tasks now run in-process via ASGI transport, eliminating the cross-process SQLite contention that caused the flaky All 53 CI checks passed, and the flaky test passed 20/20 locally. |
Fixes flaky
test_get_runs_in_queue_concurrency_limit_and_limit[1]which intermittently fails withhttpx.ReadErrorin the Orchestration Tests suite on SQLite (failed run).Root Cause
The
hosted_api_clientfixture runs the API server in a subprocess. Under parallel test execution (pytest-xdist), this creates cross-process SQLite contention — the test process and the server subprocess both access the same worker-scoped database file. When the server subprocess hits a lock timeout mid-request, it drops the connection, surfacing ashttpx.ReadErrorin the test client.Changes
Switches 8 tests in
TestGetRunsInWorkQueuefromhosted_api_client(subprocess-based) toephemeral_client_with_lifespan(in-process ASGI transport). This eliminates cross-process SQLite contention entirely by running the API server in the same process as the test.Tests switched to
ephemeral_client_with_lifespan— these perform simple request/response without relying on Docket background tasks:test_get_runs_in_queuetest_get_runs_in_queue_limittest_get_runs_in_queue_scheduled_beforetest_get_runs_in_queue_nonexistanttest_get_runs_in_queue_pausedtest_get_runs_in_queue_concurrency_limittest_get_runs_in_queue_concurrency_limit_and_limit← the flaky testtest_read_work_queue_runs_does_not_update_a_paused_work_queues_statusTests kept on
hosted_api_client— these useretry_assertsto wait for Docket background tasks (e.g.mark_deployments_ready) and need the hosted server:test_read_work_queue_runs_updates_work_queue_last_polled_timetest_read_work_queue_runs_associated_deployments_return_status_of_readytest_read_work_queue_runs_updates_work_queue_statusNo production code changes.
Reviewer Checklist
hosted_api_clientare the correct ones to retain (they all useretry_asserts)Checklist
<link to issue>"mint.json.Link to Devin session: https://app.devin.ai/sessions/fa3a47b1b4ad4e4a84adc5e97904dc4a
Requested by: bot_apk (apk@cognition.ai)