Commit 049e6cf
committed
Fix ~75% flake rate in upgrade-mysql and upgrade-postgres pipelines
Both upgrade pipelines have been failing approximately 75% of the time
since at least 2026-03-11. Analysis of 41 failed builds across both
pipelines identified three failure categories, all stemming from BOSH
agent unresponsiveness after a director upgrade.
Examined builds since 2026-03-11:
- upgrade-mysql: 19 failures / 25 builds (76% failure rate)
- upgrade-postgres: 22 failures / 30 builds (73% failure rate)
After the director is upgraded (create-env), agents deployed by the
old director become unresponsive to the new director. The pipeline
had a blind `sleep 300` before attempting to redeploy zookeeper, but
agents were still not reconnected when the deploy started.
- MySQL builds: `run_script` (pre-stop) times out after 45s during
the update of the 3rd-5th zookeeper instance. 2-4 instances update
successfully before one agent fails to respond.
- Postgres builds: `get_state` times out during `Preparing deployment`
before any instance updates begin. This suggests agents take longer
to reconnect when using the internal Postgres DB.
`pd-ssd` disk type incompatible with `c4a-standard-1` machine type.
Only affected builds 250-253 (MySQL) and 458-460 (Postgres) in
mid-March. Already resolved externally — no action needed.
When agent timeouts cause the deploy to fail, the ensure teardown's
`delete-deployment` also times out on the same unresponsive agents.
VMs remain attached to the subnetwork, causing Terraform destroy to
fail with `resourceInUseByAnotherResource`.
1. **Replace blind sleep with active agent health check**
(ci/tasks/wait-for-agents.{sh,yml}, ci/pipeline.yml)
New `wait-for-agents` task polls `bosh vms` every 10s for up to
600s, waiting until all zookeeper agents report `process_state:
running`. Replaces the blind `sleep 300` in both upgrade-postgres
and upgrade-mysql pipeline jobs. Fails fast with diagnostic output
if agents never reconnect.
2. **Add retry logic to zookeeper deploy**
(ci/tasks/deploy-zookeeper.sh)
Wrap `bosh deploy --recreate` in a retry loop (3 attempts, 60s
delay between retries). On failure, logs current VM state for
diagnostics. Controlled by MAX_DEPLOY_ATTEMPTS and
DEPLOY_RETRY_DELAY env vars. Also adds DEPLOY_EXTRA_ARGS to
allow passing --skip-drain from the pipeline if needed.
3. **Use --force for teardown delete-deployment**
(ci/bats/tasks/destroy-director.sh)
Add `--force` flag to `delete-deployment` in the teardown script.
This skips drain and pre-stop lifecycle hooks during cleanup,
preventing the cascade failure where teardown times out on the
same unresponsive agents that caused the original deploy failure.
4. **Reduce zookeeper instances from 5 to 3**
(ci/tasks/deploy-zookeeper/zookeeper-manifest.yml)
3 instances is the minimum ZooKeeper quorum and is sufficient to
validate the upgrade path. Fewer instances means lower probability
of hitting an unresponsive agent, and faster update cycles.
Canaries reduced from 2 to 1 accordingly.
The root cause — why agents become unresponsive after a director
upgrade — remains uninvestigated. Likely candidates:
- NATS server restarts during create-env, and agents exhaust their
reconnect attempts (max_reconnect_attempts=4, reconnect_time_wait=2s
in nats_rpc.rb) before NATS comes back up
- NATS CA certificate rotation during upgrade invalidates existing
agent TLS certificates (chicken-and-egg: can't push new certs to
agents that can't connect)
- The health monitor's scan-and-fix process runs during the sleep
window and may interfere with agent state
Made-with: Cursor1 parent 3f4620e commit 049e6cf
File tree
6 files changed
+107
-22
lines changed- ci
- bats/tasks
- tasks
- deploy-zookeeper
6 files changed
+107
-22
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
28 | 28 | | |
29 | 29 | | |
30 | 30 | | |
31 | | - | |
| 31 | + | |
32 | 32 | | |
33 | 33 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
804 | 804 | | |
805 | 805 | | |
806 | 806 | | |
807 | | - | |
| 807 | + | |
808 | 808 | | |
809 | | - | |
810 | | - | |
811 | | - | |
812 | | - | |
813 | | - | |
814 | | - | |
815 | | - | |
816 | | - | |
| 809 | + | |
| 810 | + | |
| 811 | + | |
817 | 812 | | |
818 | 813 | | |
819 | 814 | | |
| |||
915 | 910 | | |
916 | 911 | | |
917 | 912 | | |
918 | | - | |
| 913 | + | |
919 | 914 | | |
920 | | - | |
921 | | - | |
922 | | - | |
923 | | - | |
924 | | - | |
925 | | - | |
926 | | - | |
927 | | - | |
| 915 | + | |
| 916 | + | |
| 917 | + | |
928 | 918 | | |
929 | 919 | | |
930 | 920 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
30 | | - | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
31 | 57 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
15 | | - | |
| 15 | + | |
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
23 | | - | |
| 23 | + | |
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
0 commit comments