Fix ~75% flake rate in upgrade-mysql and upgrade-postgres pipelines

aramprice · aramprice · commit 049e6cf9a10c · 2026-04-01T16:26:18.000+02:00
Both upgrade pipelines have been failing approximately 75% of the time
since at least 2026-03-11. Analysis of 41 failed builds across both
pipelines identified three failure categories, all stemming from BOSH
agent unresponsiveness after a director upgrade.

Examined builds since 2026-03-11:
- upgrade-mysql: 19 failures / 25 builds (76% failure rate)
- upgrade-postgres: 22 failures / 30 builds (73% failure rate)

After the director is upgraded (create-env), agents deployed by the
old director become unresponsive to the new director. The pipeline
had a blind `sleep 300` before attempting to redeploy zookeeper, but
agents were still not reconnected when the deploy started.

- MySQL builds: `run_script` (pre-stop) times out after 45s during
  the update of the 3rd-5th zookeeper instance. 2-4 instances update
  successfully before one agent fails to respond.
- Postgres builds: `get_state` times out during `Preparing deployment`
  before any instance updates begin. This suggests agents take longer
  to reconnect when using the internal Postgres DB.

`pd-ssd` disk type incompatible with `c4a-standard-1` machine type.
Only affected builds 250-253 (MySQL) and 458-460 (Postgres) in
mid-March. Already resolved externally — no action needed.

When agent timeouts cause the deploy to fail, the ensure teardown's
`delete-deployment` also times out on the same unresponsive agents.
VMs remain attached to the subnetwork, causing Terraform destroy to
fail with `resourceInUseByAnotherResource`.

1. **Replace blind sleep with active agent health check**
   (ci/tasks/wait-for-agents.{sh,yml}, ci/pipeline.yml)

   New `wait-for-agents` task polls `bosh vms` every 10s for up to
   600s, waiting until all zookeeper agents report `process_state:
   running`. Replaces the blind `sleep 300` in both upgrade-postgres
   and upgrade-mysql pipeline jobs. Fails fast with diagnostic output
   if agents never reconnect.

2. **Add retry logic to zookeeper deploy**
   (ci/tasks/deploy-zookeeper.sh)

   Wrap `bosh deploy --recreate` in a retry loop (3 attempts, 60s
   delay between retries). On failure, logs current VM state for
   diagnostics. Controlled by MAX_DEPLOY_ATTEMPTS and
   DEPLOY_RETRY_DELAY env vars. Also adds DEPLOY_EXTRA_ARGS to
   allow passing --skip-drain from the pipeline if needed.

3. **Use --force for teardown delete-deployment**
   (ci/bats/tasks/destroy-director.sh)

   Add `--force` flag to `delete-deployment` in the teardown script.
   This skips drain and pre-stop lifecycle hooks during cleanup,
   preventing the cascade failure where teardown times out on the
   same unresponsive agents that caused the original deploy failure.

4. **Reduce zookeeper instances from 5 to 3**
   (ci/tasks/deploy-zookeeper/zookeeper-manifest.yml)

   3 instances is the minimum ZooKeeper quorum and is sufficient to
   validate the upgrade path. Fewer instances means lower probability
   of hitting an unresponsive agent, and faster update cycles.
   Canaries reduced from 2 to 1 accordingly.

The root cause — why agents become unresponsive after a director
upgrade — remains uninvestigated. Likely candidates:

- NATS server restarts during create-env, and agents exhaust their
  reconnect attempts (max_reconnect_attempts=4, reconnect_time_wait=2s
  in nats_rpc.rb) before NATS comes back up
- NATS CA certificate rotation during upgrade invalidates existing
  agent TLS certificates (chicken-and-egg: can't push new certs to
  agents that can't connect)
- The health monitor's scan-and-fix process runs during the sleep
  window and may interfere with agent state

Made-with: Cursor
diff --git a/ci/bats/tasks/destroy-director.sh b/ci/bats/tasks/destroy-director.sh
@@ -28,6 +28,6 @@ export BOSH_CLIENT_SECRET
 
 set +e
 
-bosh-cli deployments --column name --json | jq -r ".Tables[0].Rows[].name" | xargs -n1 -I % bosh-cli -n -d % delete-deployment
+bosh-cli deployments --column name --json | jq -r ".Tables[0].Rows[].name" | xargs -n1 -I % bosh-cli -n -d % delete-deployment --force
 bosh-cli clean-up -n --all
 bosh-cli delete-env -n director-state/director.yml -l director-state/director-creds.yml
diff --git a/ci/pipeline.yml b/ci/pipeline.yml
@@ -804,16 +804,11 @@ jobs:
               GCP_JSON_KEY: ((gcp_json_key))
               DEPLOY_ARGS: |-
                 -o bosh-deployment/external-ip-not-recommended.yml
-          - task: sleep-300-seconds
+          - task: wait-for-agents
             image: integration-image
-            config:
-              platform: linux
-              run:
-                path: /bin/sh
-                args:
-                  - -exc
-                  - |
-                    sleep 300
+            file: bosh-ci/ci/tasks/wait-for-agents.yml
+            params:
+              CPI: gcp
           - task: recreate-zookeeper
             image: integration-image
             file: bosh-ci/ci/tasks/deploy-zookeeper.yml
@@ -915,16 +910,11 @@ jobs:
               DEPLOY_ARGS: |-
                 -o bosh-deployment/external-ip-not-recommended.yml
                 -o bosh-deployment/misc/external-db.yml
-          - task: sleep-300-seconds
+          - task: wait-for-agents
             image: integration-image
-            config:
-              platform: linux
-              run:
-                path: /bin/sh
-                args:
-                  - -exc
-                  - |
-                    sleep 300
+            file: bosh-ci/ci/tasks/wait-for-agents.yml
+            params:
+              CPI: gcp
           - task: recreate-zookeeper
             image: integration-image
             file: bosh-ci/ci/tasks/deploy-zookeeper.yml
diff --git a/ci/tasks/deploy-zookeeper.sh b/ci/tasks/deploy-zookeeper.sh
@@ -27,5 +27,31 @@ bosh-cli update-cloud-config "bosh-deployment/${CPI}/cloud-config.yml" \
   --vars-file director-state/director-vars.json
 
 bosh-cli upload-stemcell stemcell/*.tgz
-bosh-cli -d zookeeper deploy --recreate "${bosh_repo_dir}/ci/tasks/deploy-zookeeper/zookeeper-manifest.yml"
+
+MAX_DEPLOY_ATTEMPTS=${MAX_DEPLOY_ATTEMPTS:-3}
+DEPLOY_RETRY_DELAY=${DEPLOY_RETRY_DELAY:-60}
+
+for attempt in $(seq 1 "$MAX_DEPLOY_ATTEMPTS"); do
+  echo "Deploy attempt ${attempt}/${MAX_DEPLOY_ATTEMPTS}..."
+  set +e
+  bosh-cli -d zookeeper deploy --recreate ${DEPLOY_EXTRA_ARGS:-} "${bosh_repo_dir}/ci/tasks/deploy-zookeeper/zookeeper-manifest.yml"
+  deploy_exit=$?
+  set -e
+
+  if [ $deploy_exit -eq 0 ]; then
+    echo "Deploy succeeded on attempt ${attempt}."
+    break
+  fi
+
+  if [ "$attempt" -eq "$MAX_DEPLOY_ATTEMPTS" ]; then
+    echo "Deploy failed after ${MAX_DEPLOY_ATTEMPTS} attempts."
+    exit 1
+  fi
+
+  echo "Deploy failed on attempt ${attempt}. Waiting ${DEPLOY_RETRY_DELAY}s before retry..."
+  echo "Current VM state:"
+  bosh-cli -d zookeeper vms || true
+  sleep "$DEPLOY_RETRY_DELAY"
+done
+
 bosh-cli -d zookeeper run-errand smoke-tests
diff --git a/ci/tasks/deploy-zookeeper/zookeeper-manifest.yml b/ci/tasks/deploy-zookeeper/zookeeper-manifest.yml
@@ -12,15 +12,15 @@ stemcells:
     version: latest
 
 update:
-  canaries: 2
+  canaries: 1
   max_in_flight: 1
   canary_watch_time: 5000-60000
   update_watch_time: 5000-60000
 
 instance_groups:
   - name: zookeeper
     azs: [z1, z2, z3]
-    instances: 5
+    instances: 3
     jobs:
       - name: zookeeper
         release: zookeeper
diff --git a/ci/tasks/wait-for-agents.sh b/ci/tasks/wait-for-agents.sh
@@ -0,0 +1,55 @@
+#!/usr/bin/env bash
+set -e
+
+state_path() { bosh-cli int director-state/director.yml --path="$1" ; }
+
+function get_bosh_environment {
+  if [[ -z $(state_path /instance_groups/name=bosh/networks/name=public/static_ips/0 2>/dev/null) ]]; then
+    state_path /instance_groups/name=bosh/networks/name=default/static_ips/0 2>/dev/null
+  else
+    state_path /instance_groups/name=bosh/networks/name=public/static_ips/0 2>/dev/null
+  fi
+}
+
+mv bosh-cli/bosh-cli-* /usr/local/bin/bosh-cli
+chmod +x /usr/local/bin/bosh-cli
+
+export BOSH_ENVIRONMENT=$(get_bosh_environment)
+export BOSH_CLIENT=admin
+export BOSH_CLIENT_SECRET=$(bosh-cli int director-state/director-creds.yml --path /admin_password)
+export BOSH_CA_CERT=$(bosh-cli int director-state/director-creds.yml --path /director_ssl/ca)
+export BOSH_NON_INTERACTIVE=true
+
+MAX_ATTEMPTS=60
+SLEEP_INTERVAL=10
+TOTAL_TIMEOUT=$((MAX_ATTEMPTS * SLEEP_INTERVAL))
+
+echo "Waiting up to ${TOTAL_TIMEOUT}s for all zookeeper agents to become responsive..."
+
+for i in $(seq 1 "$MAX_ATTEMPTS"); do
+  set +e
+  vms_json=$(bosh-cli -d zookeeper vms --json)
+  exit_code=$?
+  set -e
+
+  if [ $exit_code -eq 0 ]; then
+    total=$(echo "$vms_json" | jq -r '.Tables[0].Rows | length')
+    running=$(echo "$vms_json" | jq -r '[.Tables[0].Rows[] | select(.process_state == "running")] | length')
+
+    echo "  Attempt $i/${MAX_ATTEMPTS}: ${running}/${total} agents responsive"
+
+    if [ "$running" -eq "$total" ] && [ "$total" -gt 0 ]; then
+      echo "All ${total} agents are responsive after $((i * SLEEP_INTERVAL)) seconds."
+      exit 0
+    fi
+  else
+    echo "  Attempt $i/${MAX_ATTEMPTS}: bosh vms failed (director may still be starting)"
+  fi
+
+  sleep "$SLEEP_INTERVAL"
+done
+
+echo "ERROR: Not all agents became responsive within ${TOTAL_TIMEOUT}s."
+echo "Final VM state:"
+bosh-cli -d zookeeper vms || true
+exit 1
diff --git a/ci/tasks/wait-for-agents.yml b/ci/tasks/wait-for-agents.yml
@@ -0,0 +1,14 @@
+---
+platform: linux
+
+inputs:
+- name: director-state
+- name: bosh-ci
+- name: bosh-cli
+- name: bosh-deployment
+
+run:
+  path: bosh-ci/ci/tasks/wait-for-agents.sh
+
+params:
+  CPI: