Skip to content

Commit 049e6cf

Browse files
committed
Fix ~75% flake rate in upgrade-mysql and upgrade-postgres pipelines
Both upgrade pipelines have been failing approximately 75% of the time since at least 2026-03-11. Analysis of 41 failed builds across both pipelines identified three failure categories, all stemming from BOSH agent unresponsiveness after a director upgrade. Examined builds since 2026-03-11: - upgrade-mysql: 19 failures / 25 builds (76% failure rate) - upgrade-postgres: 22 failures / 30 builds (73% failure rate) After the director is upgraded (create-env), agents deployed by the old director become unresponsive to the new director. The pipeline had a blind `sleep 300` before attempting to redeploy zookeeper, but agents were still not reconnected when the deploy started. - MySQL builds: `run_script` (pre-stop) times out after 45s during the update of the 3rd-5th zookeeper instance. 2-4 instances update successfully before one agent fails to respond. - Postgres builds: `get_state` times out during `Preparing deployment` before any instance updates begin. This suggests agents take longer to reconnect when using the internal Postgres DB. `pd-ssd` disk type incompatible with `c4a-standard-1` machine type. Only affected builds 250-253 (MySQL) and 458-460 (Postgres) in mid-March. Already resolved externally — no action needed. When agent timeouts cause the deploy to fail, the ensure teardown's `delete-deployment` also times out on the same unresponsive agents. VMs remain attached to the subnetwork, causing Terraform destroy to fail with `resourceInUseByAnotherResource`. 1. **Replace blind sleep with active agent health check** (ci/tasks/wait-for-agents.{sh,yml}, ci/pipeline.yml) New `wait-for-agents` task polls `bosh vms` every 10s for up to 600s, waiting until all zookeeper agents report `process_state: running`. Replaces the blind `sleep 300` in both upgrade-postgres and upgrade-mysql pipeline jobs. Fails fast with diagnostic output if agents never reconnect. 2. **Add retry logic to zookeeper deploy** (ci/tasks/deploy-zookeeper.sh) Wrap `bosh deploy --recreate` in a retry loop (3 attempts, 60s delay between retries). On failure, logs current VM state for diagnostics. Controlled by MAX_DEPLOY_ATTEMPTS and DEPLOY_RETRY_DELAY env vars. Also adds DEPLOY_EXTRA_ARGS to allow passing --skip-drain from the pipeline if needed. 3. **Use --force for teardown delete-deployment** (ci/bats/tasks/destroy-director.sh) Add `--force` flag to `delete-deployment` in the teardown script. This skips drain and pre-stop lifecycle hooks during cleanup, preventing the cascade failure where teardown times out on the same unresponsive agents that caused the original deploy failure. 4. **Reduce zookeeper instances from 5 to 3** (ci/tasks/deploy-zookeeper/zookeeper-manifest.yml) 3 instances is the minimum ZooKeeper quorum and is sufficient to validate the upgrade path. Fewer instances means lower probability of hitting an unresponsive agent, and faster update cycles. Canaries reduced from 2 to 1 accordingly. The root cause — why agents become unresponsive after a director upgrade — remains uninvestigated. Likely candidates: - NATS server restarts during create-env, and agents exhaust their reconnect attempts (max_reconnect_attempts=4, reconnect_time_wait=2s in nats_rpc.rb) before NATS comes back up - NATS CA certificate rotation during upgrade invalidates existing agent TLS certificates (chicken-and-egg: can't push new certs to agents that can't connect) - The health monitor's scan-and-fix process runs during the sleep window and may interfere with agent state Made-with: Cursor
1 parent 3f4620e commit 049e6cf

File tree

6 files changed

+107
-22
lines changed

6 files changed

+107
-22
lines changed

ci/bats/tasks/destroy-director.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,6 @@ export BOSH_CLIENT_SECRET
2828

2929
set +e
3030

31-
bosh-cli deployments --column name --json | jq -r ".Tables[0].Rows[].name" | xargs -n1 -I % bosh-cli -n -d % delete-deployment
31+
bosh-cli deployments --column name --json | jq -r ".Tables[0].Rows[].name" | xargs -n1 -I % bosh-cli -n -d % delete-deployment --force
3232
bosh-cli clean-up -n --all
3333
bosh-cli delete-env -n director-state/director.yml -l director-state/director-creds.yml

ci/pipeline.yml

Lines changed: 8 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -804,16 +804,11 @@ jobs:
804804
GCP_JSON_KEY: ((gcp_json_key))
805805
DEPLOY_ARGS: |-
806806
-o bosh-deployment/external-ip-not-recommended.yml
807-
- task: sleep-300-seconds
807+
- task: wait-for-agents
808808
image: integration-image
809-
config:
810-
platform: linux
811-
run:
812-
path: /bin/sh
813-
args:
814-
- -exc
815-
- |
816-
sleep 300
809+
file: bosh-ci/ci/tasks/wait-for-agents.yml
810+
params:
811+
CPI: gcp
817812
- task: recreate-zookeeper
818813
image: integration-image
819814
file: bosh-ci/ci/tasks/deploy-zookeeper.yml
@@ -915,16 +910,11 @@ jobs:
915910
DEPLOY_ARGS: |-
916911
-o bosh-deployment/external-ip-not-recommended.yml
917912
-o bosh-deployment/misc/external-db.yml
918-
- task: sleep-300-seconds
913+
- task: wait-for-agents
919914
image: integration-image
920-
config:
921-
platform: linux
922-
run:
923-
path: /bin/sh
924-
args:
925-
- -exc
926-
- |
927-
sleep 300
915+
file: bosh-ci/ci/tasks/wait-for-agents.yml
916+
params:
917+
CPI: gcp
928918
- task: recreate-zookeeper
929919
image: integration-image
930920
file: bosh-ci/ci/tasks/deploy-zookeeper.yml

ci/tasks/deploy-zookeeper.sh

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,5 +27,31 @@ bosh-cli update-cloud-config "bosh-deployment/${CPI}/cloud-config.yml" \
2727
--vars-file director-state/director-vars.json
2828

2929
bosh-cli upload-stemcell stemcell/*.tgz
30-
bosh-cli -d zookeeper deploy --recreate "${bosh_repo_dir}/ci/tasks/deploy-zookeeper/zookeeper-manifest.yml"
30+
31+
MAX_DEPLOY_ATTEMPTS=${MAX_DEPLOY_ATTEMPTS:-3}
32+
DEPLOY_RETRY_DELAY=${DEPLOY_RETRY_DELAY:-60}
33+
34+
for attempt in $(seq 1 "$MAX_DEPLOY_ATTEMPTS"); do
35+
echo "Deploy attempt ${attempt}/${MAX_DEPLOY_ATTEMPTS}..."
36+
set +e
37+
bosh-cli -d zookeeper deploy --recreate ${DEPLOY_EXTRA_ARGS:-} "${bosh_repo_dir}/ci/tasks/deploy-zookeeper/zookeeper-manifest.yml"
38+
deploy_exit=$?
39+
set -e
40+
41+
if [ $deploy_exit -eq 0 ]; then
42+
echo "Deploy succeeded on attempt ${attempt}."
43+
break
44+
fi
45+
46+
if [ "$attempt" -eq "$MAX_DEPLOY_ATTEMPTS" ]; then
47+
echo "Deploy failed after ${MAX_DEPLOY_ATTEMPTS} attempts."
48+
exit 1
49+
fi
50+
51+
echo "Deploy failed on attempt ${attempt}. Waiting ${DEPLOY_RETRY_DELAY}s before retry..."
52+
echo "Current VM state:"
53+
bosh-cli -d zookeeper vms || true
54+
sleep "$DEPLOY_RETRY_DELAY"
55+
done
56+
3157
bosh-cli -d zookeeper run-errand smoke-tests

ci/tasks/deploy-zookeeper/zookeeper-manifest.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,15 +12,15 @@ stemcells:
1212
version: latest
1313

1414
update:
15-
canaries: 2
15+
canaries: 1
1616
max_in_flight: 1
1717
canary_watch_time: 5000-60000
1818
update_watch_time: 5000-60000
1919

2020
instance_groups:
2121
- name: zookeeper
2222
azs: [z1, z2, z3]
23-
instances: 5
23+
instances: 3
2424
jobs:
2525
- name: zookeeper
2626
release: zookeeper

ci/tasks/wait-for-agents.sh

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
#!/usr/bin/env bash
2+
set -e
3+
4+
state_path() { bosh-cli int director-state/director.yml --path="$1" ; }
5+
6+
function get_bosh_environment {
7+
if [[ -z $(state_path /instance_groups/name=bosh/networks/name=public/static_ips/0 2>/dev/null) ]]; then
8+
state_path /instance_groups/name=bosh/networks/name=default/static_ips/0 2>/dev/null
9+
else
10+
state_path /instance_groups/name=bosh/networks/name=public/static_ips/0 2>/dev/null
11+
fi
12+
}
13+
14+
mv bosh-cli/bosh-cli-* /usr/local/bin/bosh-cli
15+
chmod +x /usr/local/bin/bosh-cli
16+
17+
export BOSH_ENVIRONMENT=$(get_bosh_environment)
18+
export BOSH_CLIENT=admin
19+
export BOSH_CLIENT_SECRET=$(bosh-cli int director-state/director-creds.yml --path /admin_password)
20+
export BOSH_CA_CERT=$(bosh-cli int director-state/director-creds.yml --path /director_ssl/ca)
21+
export BOSH_NON_INTERACTIVE=true
22+
23+
MAX_ATTEMPTS=60
24+
SLEEP_INTERVAL=10
25+
TOTAL_TIMEOUT=$((MAX_ATTEMPTS * SLEEP_INTERVAL))
26+
27+
echo "Waiting up to ${TOTAL_TIMEOUT}s for all zookeeper agents to become responsive..."
28+
29+
for i in $(seq 1 "$MAX_ATTEMPTS"); do
30+
set +e
31+
vms_json=$(bosh-cli -d zookeeper vms --json)
32+
exit_code=$?
33+
set -e
34+
35+
if [ $exit_code -eq 0 ]; then
36+
total=$(echo "$vms_json" | jq -r '.Tables[0].Rows | length')
37+
running=$(echo "$vms_json" | jq -r '[.Tables[0].Rows[] | select(.process_state == "running")] | length')
38+
39+
echo " Attempt $i/${MAX_ATTEMPTS}: ${running}/${total} agents responsive"
40+
41+
if [ "$running" -eq "$total" ] && [ "$total" -gt 0 ]; then
42+
echo "All ${total} agents are responsive after $((i * SLEEP_INTERVAL)) seconds."
43+
exit 0
44+
fi
45+
else
46+
echo " Attempt $i/${MAX_ATTEMPTS}: bosh vms failed (director may still be starting)"
47+
fi
48+
49+
sleep "$SLEEP_INTERVAL"
50+
done
51+
52+
echo "ERROR: Not all agents became responsive within ${TOTAL_TIMEOUT}s."
53+
echo "Final VM state:"
54+
bosh-cli -d zookeeper vms || true
55+
exit 1

ci/tasks/wait-for-agents.yml

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
---
2+
platform: linux
3+
4+
inputs:
5+
- name: director-state
6+
- name: bosh-ci
7+
- name: bosh-cli
8+
- name: bosh-deployment
9+
10+
run:
11+
path: bosh-ci/ci/tasks/wait-for-agents.sh
12+
13+
params:
14+
CPI:

0 commit comments

Comments
 (0)