Skip to content

FTT2 restart refactor, remote-device reconciliation, collect_logs.py, FTT2 test suite#956

Open
schmidt-scaled wants to merge 37 commits intomainfrom
test_FTT2
Open

FTT2 restart refactor, remote-device reconciliation, collect_logs.py, FTT2 test suite#956
schmidt-scaled wants to merge 37 commits intomainfrom
test_FTT2

Conversation

@schmidt-scaled
Copy link
Copy Markdown
Contributor

Summary

  • Restart design refactor: disconnect-based logic, leader failover, operation gates
  • FTT2 fixes: shutdown-during-migration, restart leadership/JM ordering, remote-device reconnect ordering
  • Migration retry gated on recovery events; avoid spurious online events on healthy reconnects
  • Add collect_logs.py (Graylog/OpenSearch) with opensearch epoch_millis/index fixes
  • New FTT2 mock/RPC test suite under tests/ftt2/: restart guards, concurrent ops, peer states, scenarios
  • Perf tooling updates: setup_perf_test.py, probe_nvme_queues.py, GCP deployer

Test plan

  • tests/ftt2 suite green
  • Existing dual-FT e2e + failover/failback combinations green
  • Soak on AWS cluster: dual-node outage cycles stable (currently running, iter 11+)
  • collect_logs.py run on mgmt node produces expected tarball

michixs and others added 30 commits April 8, 2026 18:53
…er failover, operation gates

Major refactoring of node restart, LVS recreation, and CRUD operations per the
"Design of Node Restart with primary, secondary, tertiary" document.

Key changes:
- Pre-restart check: FDB transaction (query all nodes, check restart/shutdown, set in_restart)
- Naming: secondary_node_id_2 → tertiary_node_id, lvstore_stack_secondary_2 → lvstore_stack_tertiary
- Disconnect checks: two methods — JM quorum (primary) and hublvol connection (fallback)
- No node status checks in restart flow — only disconnect state and RPC behavior
- Sequential LVS recreation: primary → secondary → tertiary, no recursion
- Leader identification via bdev_lvol_get_lvstores leadership field
- Compression/replication checks only on current leader
- Secondary creates hublvol (non_optimized) for tertiary failover
- Port drop on restarting node in non-leader path
- Tertiary connects to secondary's hublvol after restart
- Demote old leader subsystems to non_optimized after takeover
- Multipathing: enabled when multiple data NICs
- Restart phase tracking (pre_block/blocked/post_unblock) persisted to FDB
- Operation gate: sync deletes and registrations queue during port block, drain after unblock
- Leader failover: detect leader via RPC, failover on timeout if fabric healthy
- CRUD operations: no status checks, use check_non_leader_for_operation
- storage_node_monitor: guarded with if __name__ == "__main__"
- 267/268 tests passing (unit + ftt2 integration)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
scripts/collect_logs.py collects container logs for a specified time
window and packages them into a tarball.

- Retrieves cluster UUID and secret via `sbctl cluster list` /
  `sbctl cluster get-secret`
- Authenticates to Graylog as admin with the cluster secret
- Collects per-storage-node logs: spdk_{rpc_port},
  spdk_proxy_{rpc_port}, SNodeAPI
- Collects all control-plane service logs (WebAppAPI, fdb-server,
  task runners, monitors, etc.)
- Paginates Graylog results (PAGE_SIZE=1000, up to 100 k per query);
  splits into 10-minute sub-windows automatically for very large sets
- Alternatively queries OpenSearch directly via scroll API
  (--use-opensearch flag)
- Writes a manifest.json with collection metadata
- Outputs a timestamped .tar.gz bundle

https://claude.ai/code/session_0128r6vXhbzkKmu3m3kc3b4e
Two issues caused 400 errors with --use-opensearch:

1. Index wildcard in URL path: `graylog_*/_search` is rejected by HAProxy
   as a bad request.  Fix: query `_cat/indices` first to discover the
   actual graylog index names and join them as a comma-separated list
   (e.g. `graylog_0,graylog_1`).  Falls back to `_all` if discovery fails.

2. term queries on string fields: OpenSearch dynamic mapping stores string
   fields as text+keyword pairs.  Plain `term` queries on the text variant
   fail or return wrong results.  Fix: each term clause now tries both
   `field.keyword` and `field` via a should/minimum_should_match:1 wrapper,
   covering both mapping styles.

Also improves the error message to include the response body when the
initial scroll request fails, making future debugging easier.

https://claude.ai/code/session_0128r6vXhbzkKmu3m3kc3b4e
…error

Graylog configures its OpenSearch index timestamp field with format
"uuuu-MM-dd HH:mm:ss.SSS" (space separator, no timezone suffix).
Sending the range query bounds in ISO-8601 format ("...T...Z") triggers:

  parse_exception: failed to parse date field [2026-04-08T08:40:00.000Z]
  with format [uuuu-MM-dd HH:mm:ss.SSS]

Fix: convert both bounds to epoch milliseconds and pass
{"format": "epoch_millis"} in the range clause. OpenSearch accepts
epoch_millis regardless of the field's stored date format.

Verified locally: 2026-04-08T08:40:00.000Z -> 1775637600000 ms (correct).

https://claude.ai/code/session_0128r6vXhbzkKmu3m3kc3b4e
…e mode

Root cause of 0 results: exact term queries on container_name don't match
Docker Swarm naming (e.g. 'simplyblock_WebAppAPI.1.<hash>' != 'WebAppAPI').

Changes:
- _os_probe(): runs once before any scroll queries; discovers the actual
  timestamp field name (@timestamp vs timestamp), the container-name field
  name, and the total document count in the requested time window. Cached
  across all fetch calls to avoid redundant round-trips.
- opensearch_fetch_all(): replaced nested term/bool clauses with
  query_string + wildcard (*WebAppAPI*) so partial names match regardless
  of Docker Swarm name decoration. Uses analyze_wildcard:true. The probe
  result drives ts_field and cname_field so the code works even if the
  index uses non-standard field names.
- --diagnose flag: prints full diagnostic report (indices, field names,
  sample document, distinct container_name values in window) and exits
  without collecting. Run this first when collections return 0 lines.
- probe_cache dict threaded through fetch() -> opensearch_fetch_all() so
  the probe runs exactly once per script invocation.

https://claude.ai/code/session_0128r6vXhbzkKmu3m3kc3b4e
…k containers

Two issues causing 0 results for all storage node containers:

1. spdk_N / spdk_proxy_N: these were filtered by source IP, but since
   each RPC port is unique across the cluster there is no ambiguity.
   Drop the source filter entirely for these containers.

2. SNodeAPI: the Graylog GELF 'source' field contains the Docker host
   hostname (AWS EC2 default: "ip-X-X-X-X"), not the raw IP address
   we were using.  The fix tries all three plausible formats as a
   should/OR clause so the query succeeds regardless of convention:
     - raw IP         "172.31.33.210"
     - EC2 hostname   "ip-172-31-33-210"  (derived from IP)
     - sbctl hostname "ip-172-31-33-210"  (sbctl stores "ip-X-X-X-X_PORT";
                                           rsplit("_",1)[0] strips the port)

Also updates the Graylog query for SNodeAPI to OR the same three values.

https://claude.ai/code/session_0128r6vXhbzkKmu3m3kc3b4e
The 'source' field in Graylog for SNodeAPI containers cannot be reliably
derived from the management IP because the Docker GELF driver uses the
host's system hostname (format varies: "ip-X-X-X-X", FQDN, etc.).

Previous approach of trying IP + EC2-style + sbctl hostname as OR
candidates still returned 0 because none matched the actual value.

Fix: collect ALL SNodeAPI logs in a single query with no source filter
(container_name:"SNodeAPI") into storage_nodes/SNodeAPI_all_nodes.log.
Each log line already contains src=<host> so per-node filtering is
trivial with grep.  spdk_N and spdk_proxy_N remain per-node (unique by
port) as before.

https://claude.ai/code/session_0128r6vXhbzkKmu3m3kc3b4e
Sync scripts/collect_logs.py from claude/log-collection-script-XVXcN.
Includes all prior fixes (timestamp format, index discovery, wildcard
container matching, storage node source handling) plus the new sbctl_info/
section collecting cluster show, lvol list, sn list, sn check per node,
and cluster get-logs --limit 0.

https://claude.ai/code/session_0128r6vXhbzkKmu3m3kc3b4e
send_dev_status_event was called while the restarting node still had
status in_restart, causing peer nodes to receive unavailable events for
the restarting node's devices instead of online. Moving the event send
and cluster map refresh to after set_node_status(ONLINE) ensures peers
see the correct device status.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
lvol_migration tasks were not counted in the is_re_balancing check,
causing the cluster rebalancing flag to clear while lvol migration
tasks were still active.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implement correct three-step SPDK sequence for hublvol connection on
secondary/tertiary nodes (attach_controller → set_lvs_opts → connect_hublvol),
fix ANA state exposure, and add full test coverage.

Key fixes:
- storage_node.py: create_hublvol/recreate_hublvol use ana_state=optimized;
  create_secondary_hublvol uses ana_state=non_optimized with primary's NQN;
  connect_to_hublvol attaches 1 path (secondary) or 2 multipath paths (tertiary)
- storage_node_ops.py: tertiary restart uses correct failover_node in
  connect_to_hublvol; step 10 adds multipath path via attach_controller only
- health_controller.py: fix path count check (ctrlrs nested in response),
  guard snode.hublvol null reference

Test suite (80 tests, no FDB required):
- test_hublvol_unit.py: 28 unit tests with mocked RPCClient
- test_hublvol_mock_rpc.py: 52 integration tests against FTT2MockRpcServer
  including RPC error injection (bdev create / attach / connect failures)
- test_hublvol_paths.py: FDB-backed integration tests for full restart paths
- mock_cluster.py: add error injection (fail_method), hublvol_connected/
  hublvol_created state queries, fix lvs_name param handling in handlers

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
setup_gcp_perf.py deploys a 3-node simplyblock cluster on GCP:
  - 3 × c3d-standard-8 storage nodes with 3 NVMe local SSDs (375 GB each),
    all in the same zone and subnet
  - 1 × n2-standard-4 management node
  - 1 × n2-standard-8 client node

Uses gcloud CLI (subprocess) instead of boto3. SSH key pair at
C:\ssh\gcp_sbcli (ed25519, no passphrase) injected via instance metadata.
Firewall rules created idempotently via CLUSTER_TAG=sb-cluster.
Cluster configured for FTT=1 with ndcs=2 npcs=1 (3-node minimum).
Branch: lvol-migration-fresh.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…probe script

- Switch SN machine type from c3d-standard-30-lssd (2 SSDs/node, NCQA=2 per
  controller, needs 3) to c3d-standard-8-lssd (1 SSD/node, NCQA=2 exactly fits)
- Increase SN count from 4 to 5 nodes
- Make instance launch idempotent: reuse existing mgmt/client if already running
- Fix interface name (ens4 → eth0), ha-jm-count (2 → 3), add pciutils install
- Add probe_nvme_queues.py: tests GCP machine types for NVMe controller topology
  and queue pair count (NCQA) to identify compatibility with simplyblock SPDK

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Stale IN_DELETION/IN_CREATION lvol rows were inflating the per-node
subsys_count and tripping max_lvol earlier than the actual subsystem
load justified. New _count_active_subsystems helper filters those
states; selector, add_lvol_ha guard, and clone guard now use it.

_get_next_3_nodes also tracks per-reason skip counts (offline,
subsys_full, sync_del) and logs the breakdown when no node is eligible,
so the caller's generic "No nodes found with enough resources" can be
correlated with the actual exclusion cause (e.g. a stuck sync_del flag).

Harmonised the post-selection guard in add_lvol_ha from > to >= so all
three sites reject identically at exact max_lvol.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sons"

This reverts commit 22806ac on test_ftt2.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New aws_dual_node_outage_soak_mixed.py randomly picks 2 distinct outage
methods per iteration from {graceful, forced, container_kill, host_reboot}:
  - graceful: sbctl sn shutdown + sn restart
  - forced: sbctl sn shutdown --force + sn restart
  - container_kill: docker kill spdk_* on host; node auto-recovers
  - host_reboot: reboot -f on host; node auto-recovers

Adds --methods and --auto-recover-wait CLI flags, lazy per-node RemoteHost
lookup via metadata topology / sbctl sn list, and widened online-wait
timeout when an auto-recovery method is in the pair.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two related bugs surfaced by the mixed-outage soak run on 2026-04-13:

1. tasks_runner_restart.task_runner_node left the node pinned in
   STATUS_IN_SHUTDOWN or STATUS_RESTARTING whenever shutdown_storage_node
   or restart_storage_node returned False / raised. On the next retry the
   intermediate state either (a) short-circuited the task to DONE ("Node
   is restarting, stopping task") without the node ever becoming online,
   or (b) re-entered the restart step on a half-shutdown node, guaranteed
   to fail again. Added _reset_if_transient() and a try/finally wrapper
   so every non-success exit from the shutdown/restart sequence rolls
   the node back to STATUS_OFFLINE, and the task doesn't attempt restart
   on top of a shutdown that itself failed.

2. distr_controller.parse_distr_cluster_map treated the transient CP
   states STATUS_RESTARTING and STATUS_IN_SHUTDOWN as strict mismatches
   against the SPDK cluster map (which reflects the last reachability
   event — typically offline/unreachable while the CP is mid-transition).
   This cascaded: one stuck node flipped every peer's Health=False via
   the lvstore check. Extended the existing STATUS_SCHEDULABLE ->
   STATUS_UNREACHABLE canonicalisation to cover the two transient states.

Reproducer: tests/perf/aws_dual_node_outage_soak_mixed.py with a pair
that combines an async outage (host_reboot / container_kill) with a
sync one (forced / graceful) — the async outage races the mutual-
exclusion guard during the sync outage's sbctl sn restart.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The first pass of the restart-hang fix flipped the DB status to OFFLINE
without confirming the SPDK process on the node's host was actually down.
If SPDK was still serving IO (e.g. shutdown killed alceml/bind devices
but spdk_process_kill itself failed), the DB claim of OFFLINE would
conflict with a live data plane, and a subsequent restart_storage_node
would spawn a second SPDK on top of the first.

New _ensure_spdk_killed(node) helper:
  - if the node API is unreachable → SPDK is not serving either (safe),
  - else call spdk_process_kill(rpc_port, cluster_id),
  - on SNodeClientException from a reachable API → return False,
    _reset_if_transient refuses to flip the status and waits for the
    next retry (no split-brain).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a dual-outage iteration pairs container_kill (auto-recover) with
graceful shutdown (manual restart), the restart of the gracefully-shut-down
node can fail if the container-killed peer hasn't finished recovering yet
(still in_shutdown). The per-cluster guard correctly rejects concurrent
restarts, but the test script wasn't retrying.

Wrap manual restart calls in a retry loop (15s backoff, up to
restart_timeout) so the auto-recovering peer has time to come back before
the manual restart is attempted again.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
michixs and others added 7 commits April 14, 2026 13:05
New setup_perf_test_multipath.py deploys an FT=2 cluster where every
storage node and client has 3 ENIs:
  eth0 — management (sbctl, SNodeAPI, SSH)
  eth1 — data-plane path A
  eth2 — data-plane path B

Key differences from setup_perf_test.py:
  - Launches instances with 3 NetworkInterfaces (DeviceIndex 0/1/2)
  - Configures secondary NICs via NetworkManager after boot + after reboot
  - Passes --data-nics eth1 eth2 to sn add-node so all cluster-internal
    connections (devices, JM, hublvol) and client connections are
    duplicated across both data NICs for NVMe multipath
  - Post-activation verification sweep:
    1. Node status/health from sbctl sn list
    2. Hublvol controller paths via sbctl sn check
    3. Test volume connect returns 2× connect commands per node
  - Metadata includes per-node data NIC IPs and multipath=True flag

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two changes:

1. Remove force bypass of the concurrent-restart guard. When a peer node
   is mid-restart/shutdown, restart_storage_node now always returns False
   regardless of the force flag. The force flag was letting auto-recovered
   nodes (container_kill) stomp over a peer's in-flight restart, leaving
   the peer stuck in in_restart with no task to drive it forward.

2. Replace the dummy bdev_distrib_drop_leadership_remote RPC with the
   real bdev_lvol_set_lvs_signal. This fabric-level signal is sent FROM
   the restarting node TO a peer whose management interface is unavailable
   but whose data plane is healthy, telling the peer's SPDK to drop LVS
   leadership. Updated both call sites (_handle_rpc_failure_on_peer and
   find_leader_with_failover) and added the lvs_name parameter threading.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New aws_dual_node_outage_soak_multipath.py extends the mixed-outage soak
for multipath clusters (3 NICs per host: 1 mgmt + 2 data).

Three new outage methods added to the existing four:
  - data_nics_short: take down both data NICs for 25s (mgmt stays up)
  - data_nics_long:  take down both data NICs for 120s
  - mgmt_nic_outage: take down mgmt NIC for 120s (data stays up)

All NIC outages are fire-and-forget: a nohup script on the host downs
the NIC(s), sleeps, then restores them. No sbctl restart needed.

Independent background NIC chaos thread:
  - Runs continuously alongside the outage iterations
  - Picks a random subset (1, some, or all) of online storage nodes
  - Takes down a SINGLE random data NIC per selected node
  - Restores after --nic-chaos-duration seconds (default 20)
  - Interval between events: --nic-chaos-interval (default 45s)
  - A single-NIC-down on a multipath cluster must produce zero IO errors

New CLI flags:
  --data-nics       Comma-separated data NIC names (default: eth1,eth2)
  --mgmt-nic        Management NIC name (default: eth0)
  --nic-chaos-interval   Mean seconds between chaos events (0=disable)
  --nic-chaos-duration   Seconds each single-NIC chaos event lasts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When _create_bdev_stack fails during recreate_lvstore_on_non_leader or
recreate_lvstore, the function returns False but leaves restart_phases
set to 'pre_block' for that LVS. This stale phase causes
check_non_leader_for_operation to permanently return "skip" for the
affected LVS, silently blocking all new volume subsystem creation on
the secondary/tertiary node.

Root cause traced via a real cluster failure: a concurrent-restart stomp
caused _create_bdev_stack to fail on the secondary, leaving
restart_phases['LVS_6616'] = 'pre_block' forever. Every subsequent
volume created on the primary had its secondary subsystem skipped,
causing client nvme connect to fail on the secondary path.

Fix: clear restart_phases in every error-return path after it is set.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…l lock leak, restart_phases cleanup

1. Port-allow event replay (tasks_runner_port_allow.py): After a network
   outage, the recovering node's distrib cluster maps are stale for events
   that happened while disconnected. Replay cluster-wide node-status and
   device-status events to the recovering node before the consistency check.

2. FTT-aware snapshot/clone gate (storage_node_ops.py): When a non-leader
   is RPC-unreachable but fabric-healthy, check FTT tolerance before
   rejecting. If FTT allows (e.g., only one non-leader down in FTT2),
   queue the registration and let the leader operation proceed instead of
   blocking the entire snapshot/clone/create.

3. Sync-del lock leak (snapshot_controller.py): The _acquire_lvol_mutation_lock
   / _release_lvol_mutation_lock pair in snapshot create and clone create
   had multiple early-return paths between acquire and release that leaked
   the lock permanently. Wrapped in try/finally. This caused "LVol sync
   deletion found on node" errors blocking all new volume/snapshot creation
   even though no deletions were in progress.

4. Sync-del check downgrade (lvol_controller.py): The sync-del lock check
   in volume creation, explicit-host placement, and resize paths was a
   hard blocker. Downgraded to info-log since sync deletion can coexist
   with new creates — the serialization for snapshot/clone ordering is
   maintained in snapshot_controller.py where it matters.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…status

The auto-fix in _check_node_lvstore only sent device status events when
the device's owner node was ONLINE or DOWN. When the node was OFFLINE
(graceful shutdown), the event was not sent, leaving the distrib cluster
map permanently stale for that device. This blocked port-allow on
recovering nodes that missed the shutdown events during their outage.

Fix: remove the node-status guard — if the distrib map shows a device
as online but the DB says unavailable, resend the event regardless of
why the node is in that state. The health check should repair any
inconsistency it finds.

Also removes the event-replay band-aid from tasks_runner_port_allow
that was added as a workaround — the health check auto-fix now handles
this correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wrap restart_storage_node() with a try/finally that resets the node to
OFFLINE if the inner logic fails after try_set_node_restarting has set
STATUS_RESTARTING. Previously, any return False path (SPDK start
failure, remote device connection error, LVS recreation failure, etc.)
left the node pinned in RESTARTING, which blocked all future restart
attempts from both CLI and TasksRunnerRestart.

The existing _reset_if_transient() in TasksRunnerRestart only covers
the task-runner code path; this fix covers the direct CLI/API path
(sbctl sn restart) which the soak test uses.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants