Skip to content

Fix flaky test: switch work queue tests to in-process client#21042

Draft
devin-ai-integration[bot] wants to merge 2 commits intomainfrom
devin/1773078422-fix-flaky-work-queue-sqlite-timeout
Draft

Fix flaky test: switch work queue tests to in-process client#21042
devin-ai-integration[bot] wants to merge 2 commits intomainfrom
devin/1773078422-fix-flaky-work-queue-sqlite-timeout

Conversation

@devin-ai-integration
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot commented Mar 9, 2026

Fixes flaky test_get_runs_in_queue_concurrency_limit_and_limit[1] which intermittently fails with httpx.ReadError in the Orchestration Tests suite on SQLite (failed run).

Root Cause

The hosted_api_client fixture runs the API server in a subprocess. Under parallel test execution (pytest-xdist), this creates cross-process SQLite contention — the test process and the server subprocess both access the same worker-scoped database file. When the server subprocess hits a lock timeout mid-request, it drops the connection, surfacing as httpx.ReadError in the test client.

Changes

Switches 8 tests in TestGetRunsInWorkQueue from hosted_api_client (subprocess-based) to ephemeral_client_with_lifespan (in-process ASGI transport). This eliminates cross-process SQLite contention entirely by running the API server in the same process as the test.

Tests switched to ephemeral_client_with_lifespan — these perform simple request/response without relying on Docket background tasks:

  • test_get_runs_in_queue
  • test_get_runs_in_queue_limit
  • test_get_runs_in_queue_scheduled_before
  • test_get_runs_in_queue_nonexistant
  • test_get_runs_in_queue_paused
  • test_get_runs_in_queue_concurrency_limit
  • test_get_runs_in_queue_concurrency_limit_and_limit ← the flaky test
  • test_read_work_queue_runs_does_not_update_a_paused_work_queues_status

Tests kept on hosted_api_client — these use retry_asserts to wait for Docket background tasks (e.g. mark_deployments_ready) and need the hosted server:

  • test_read_work_queue_runs_updates_work_queue_last_polled_time
  • test_read_work_queue_runs_associated_deployments_return_status_of_ready
  • test_read_work_queue_runs_updates_work_queue_status

No production code changes.

Reviewer Checklist

  • Verify the 8 switched tests don't rely on subprocess-specific behavior or Docket background task processing
  • Confirm the 3 tests kept on hosted_api_client are the correct ones to retain (they all use retry_asserts)

Checklist

  • This pull request references any related issue by including "closes <link to issue>"
    • If no issue exists and your change is not a small fix, please create an issue first.
  • If this pull request adds new functionality, it includes unit tests that cover the changes
  • If this pull request removes docs files, it includes redirect settings in mint.json.
  • If this pull request adds functions or classes, it includes helpful docstrings.

Link to Devin session: https://app.devin.ai/sessions/fa3a47b1b4ad4e4a84adc5e97904dc4a
Requested by: bot_apk (apk@cognition.ai)

The test_get_runs_in_queue_concurrency_limit_and_limit test was
intermittently failing with httpx.ReadError because the hosted API
server subprocess encountered SQLite lock contention during parallel
test execution (pytest-xdist). The 30s busy_timeout in test mode
was insufficient under heavy parallel load, causing the server to
fail and close the connection.

Remove the test-specific timeout differentiation and use the same
60s timeout for all modes. This gives SQLite enough time to handle
lock contention during parallel test execution without impacting
test performance, since the timeout is only reached under genuine
contention scenarios.

Co-authored-by: bot_apk <apk@cognition.ai>
@devin-ai-integration
Copy link
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@devin-ai-integration devin-ai-integration bot added the development Tech debt, refactors, CI, tests, and other related work. label Mar 9, 2026
@codspeed-hq
Copy link

codspeed-hq bot commented Mar 9, 2026

Merging this PR will not alter performance

✅ 2 untouched benchmarks


Comparing devin/1773078422-fix-flaky-work-queue-sqlite-timeout (cb318d0) with main (b265679)

Open in CodSpeed

@desertaxle
Copy link
Member

Is there anything else we can do to reduce contention on SQLite while running tests?

@devin-ai-integration
Copy link
Contributor Author

Good question. Here are some additional approaches that could further reduce SQLite contention during tests, roughly ordered by impact/feasibility:

1. Use per-worker SQLite databases with pytest-xdist
Each xdist worker could get its own SQLite database file instead of sharing one. This would eliminate cross-process contention entirely. The test_database_connection_url fixture could append the worker ID (e.g., prefect_test_gw0.db, prefect_test_gw1.db). The hosted API server would also need its own dedicated database.

2. Use WAL2 journal mode (SQLite 3.37+)
SQLite's WAL2 mode allows concurrent writers in some scenarios where WAL does not. However, this requires a fairly recent SQLite version and isn't widely supported by Python's sqlite3 module yet.

3. Reduce write transaction scope
The TestGetRunsInWorkQueue tests use hosted_api_client for every test, including simple reads. Tests that only read data (like test_get_runs_in_queue, test_get_runs_in_queue_limit) could use the in-process client fixture instead, reducing the number of cross-process SQLite connections competing for locks. The hosted_api_client could be reserved for tests that specifically need to test the hosted server behavior (e.g., deployment status updates via Docket background tasks).

4. Connection pooling tuning
For test mode, the SQLite engine doesn't explicitly set pool_size or max_overflow (except for in-memory DBs). Limiting the pool size in test mode could reduce the number of concurrent connections competing for the write lock.

That said, the 60s timeout in this PR aligns test mode with production and should handle the current contention levels. The per-worker database approach (option 1) would be the most impactful structural change if contention continues to be an issue.

@desertaxle
Copy link
Member

Can you implement that first option instead of changing the timeout in the tests?

@devin-ai-integration
Copy link
Contributor Author

After investigating the test infrastructure, I found that per-worker SQLite databases are already implemented. The generate_test_database_connection_url fixture in tests/conftest.py (line 495-500) already creates isolated database files per xdist worker:

if scheme == "sqlite+aiosqlite":
    test_db_path = Path(PREFECT_HOME.value()) / f"prefect_{worker_id}.db"
    yield f"sqlite+aiosqlite:///{test_db_path}"
    return

And the hosted_api_server fixture depends on test_database_connection_url and passes the per-worker URL to the subprocess via get_current_settings().to_environment_variables().

So the contention isn't cross-worker — it's within a single worker, between two processes that must share the same database:

  1. The test process (writing fixture data like work queues, flow runs)
  2. The hosted API server subprocess (serving HTTP requests + running Docket background tasks)

Both processes need to access the same SQLite file because the tests set up data directly and then query it through the hosted API. This is an inherent limitation of SQLite's single-writer model when used across processes.

Given this, the realistic options to reduce contention are:

  1. Increase the busy_timeout (this PR) — gives SQLite more time to handle within-worker lock contention
  2. Switch affected tests from hosted_api_client to ephemeral_client_with_lifespan — eliminates the subprocess entirely, running everything in-process where SQLite handles concurrency better
  3. Reduce the scope of tests that use hosted_api_client — only use it for tests that specifically need the hosted server behavior (e.g., Docket background task tests)

Would you like me to pursue option 2 or 3 instead of the timeout increase? Or is there a different approach you had in mind?

@desertaxle
Copy link
Member

Yeah, look into option 2 as a possible solution

…Lite contention

Revert the busy_timeout changes and instead switch tests in
TestGetRunsInWorkQueue that don't rely on Docket background tasks from
hosted_api_client to ephemeral_client_with_lifespan. This eliminates
cross-process SQLite contention by running the API server in-process
via ASGI transport instead of a subprocess.

Tests that use retry_asserts (waiting for background tasks) remain on
hosted_api_client since they need the hosted server behavior.

Co-authored-by: bot_apk <apk@cognition.ai>
@devin-ai-integration devin-ai-integration bot changed the title Fix flaky test: unify SQLite busy_timeout to 60s for all modes Fix flaky test: switch work queue tests to in-process client Mar 9, 2026
@devin-ai-integration
Copy link
Contributor Author

Updated the PR per reviewer feedback — reverted the timeout changes and switched 8 tests in TestGetRunsInWorkQueue from hosted_api_client to ephemeral_client_with_lifespan.

The tests that don't rely on Docket background tasks now run in-process via ASGI transport, eliminating the cross-process SQLite contention that caused the flaky httpx.ReadError. The 3 tests that use retry_asserts (waiting for background tasks like mark_deployments_ready) remain on hosted_api_client.

All 53 CI checks passed, and the flaky test passed 20/20 locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

development Tech debt, refactors, CI, tests, and other related work.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant