DO NOT MERGE: RFC: add ownership registry design documents#3434
DO NOT MERGE: RFC: add ownership registry design documents#3434martin-belanger wants to merge 1 commit into
Conversation
RFCs submitted for review only. Not intended to be merged. Signed-off-by: Martin Belanger <martin.belanger@dell.com>
|
|
||
| discoverd would: | ||
|
|
||
| 1. Monitor the directory using inotify. |
There was a problem hiding this comment.
inotify is slow (but still useful!), I wonder if it makes sense to add another simple mechanism on top of it to briefly check if anything has changed in the monitored directory, before deciding on making new connection - something like timestamp comparison of the directory itself (granted that it works on tmpfs)? Trying to minimize race condition surface here...
There was a problem hiding this comment.
The slowness of inotify delivery is actually desirable here. When the first inotify event fires, we start a soak timer (e.g. 1–2 s). If additional inotify events arrive before the timer expires — the administrator is mid-edit, or nvme exclusion add is called several times in a script — the timer resets. Only once changes have stopped arriving for the full soak interval do we perform the reconciliation pass. This means we always act on a settled, consistent view of the exclusion set rather than reacting to each intermediate write.
A few additional properties that keep the race window tight:
Atomic writes. Every nvme exclusion write uses a temp-file + rename() protocol. inotify fires on the rename(), which is the commit point — we never observe a partial file. So the inotify event and the complete file are always in sync.
Check at connect time, not at inotify time. The exclusion check happens at the moment a connection is about to be made, not when inotify fires. The sequence is: inotify → soak → rebuild in-memory exclusion set → reconcile (connect/skip). By the time we act, the in-memory set reflects the fully settled state of the directory.
IN_Q_OVERFLOW safety net. Under extreme load the kernel can overflow the inotify event queue and deliver IN_Q_OVERFLOW. We handle this the same way as startup: trigger a full directory re-scan and rebuild the exclusion set from scratch. This covers the missed-event scenario without needing a separate timestamp poll on every connection attempt.
Regarding the directory mtime approach: the exclusion list lives at /etc/nvme/exclusions/ on persistent storage (not tmpfs), so mtime would be reliable there. That said, we believe the race surface is already small enough with atomic writes + soak timer + overflow handling, without adding a per-connection mtime check. Happy to reconsider if there is a specific scenario you have in mind that the above does not cover.
| - libsystemd's conf-parser is internal to systemd and not part of the public API, so discoverd cannot link against it directly; however, it could serve as the basis for discoverd's own parser — provided the systemd source code is lifted with its copyright statements intact (systemd is LGPL-2.1-or-later, which is compatible with libnvme's license) | ||
| - `sd-json` was only made public in systemd 257 (2025-09-17) and is too new to depend on for broad distro compatibility |
There was a problem hiding this comment.
No need to make excuses, besides general design documents like this should be init system agnostic.
There was a problem hiding this comment.
Actually, @igaw specifically mandated libsystemd as a core dependency — sd_event loop, libsystemd APIs throughout. The reason is concrete: nvme-discoverd creates one systemd transient unit per NVMe controller, and that per-unit model is what makes correct shutdown ordering possible.
Because each unit knows its transport type at creation time, nvme-discoverd can add After=network.target to TCP and RDMA units and omit it for FC. systemd reverses After= at shutdown, so TCP/RDMA controllers are torn down before the network goes away — which is exactly the problem documented in issue #3309. That fix is not achievable with a single nvme connect-all invocation because it mixes all transport types in one shot.
So the libsystemd dependency isn't incidental — it's load-bearing. The parser note was a separate tangential point; happy to drop it from the design doc if it reads as noise.
|
|
||
| ## 6. API Reference | ||
|
|
||
| All registry functions are available when libnvme is built with fabrics support and are declared in `<libnvme.h>`. Registry entry creation is handled automatically by the connect path when `ctx->owner` is non-NULL. The internal function `libnvmf_registry_create()` is not part of the public API and should not be called directly (and therefore left out of the following list). |
There was a problem hiding this comment.
One important implementation TODO: sanitize attr and exclude path delimiters + reserved characters (/, perhaps ~) as these will appear in the created filenames and we don't want the udev remove rule to delete anything outside.
There was a problem hiding this comment.
Good catch. The concern is about the attr parameter being used as a filename component without sanitization. If attr contains /, a caller could inadvertently write outside the device's registry subdirectory:
libnvmf_registry_update("nvme3", "../../tmp/evil", "val");
// resolves to: /run/nvme/registry/nvme3/../../tmp/evil = /run/nvme/tmp/evilThe ~ case is subtler — a file literally named ~ in the registry directory could be misinterpreted by shell tooling that reads registry entries.
Two fixes:
-
Sanitize
attrin the API — validate to[a-zA-Z0-9_-]+and reject anything else (in particular/,~,., NUL). This is the primary fix. -
Tighten the udev rule — add
KERNEL=="nvme[0-9]*"as an explicit match condition. In practice the udev rule is already safe:%kis the kernel-assigned device name and forSUBSYSTEM=="nvme"the kernel always producesnvme[0-9]+, so path traversal via%kis not possible. But making the constraint explicit in the rule self-documents the intent and provides defense-in-depth:
ACTION=="remove", SUBSYSTEM=="nvme", KERNEL=="nvme[0-9]*", \
RUN+="/bin/rm -rf /run/nvme/registry/%k"
I will add both to the implementation TODO list in the registry design doc.
|
|
||
| **Kernel reconnect behavior.** During a reconnect within `ctrl-loss-tmo`, the kernel reuses the same `struct nvme_ctrl` — same instance number, same device name, no `KOBJ_REMOVE` event. The registry entry is untouched throughout. Only when `ctrl-loss-tmo` expires and the kernel calls `nvme_delete_ctrl()` does the `KOBJ_REMOVE` event fire and the entry get deleted. If an orchestrator later establishes a new connection, it creates a fresh registry entry under the new device name. | ||
|
|
||
| **Boot device protection.** Controllers connected during the initramfs stage are automatically protected. The dracut nvmf module invokes `nvme connect-all --nbft`, which goes through libnvme and writes registry entries with `owner=nbft`. `switch_root(8)` explicitly moves `/run` to the new root — *"switch_root moves already mounted /proc, /dev, /sys and /run to newroot"* ([switch_root(8), util-linux](https://man7.org/linux/man-pages/man8/switch_root.8.html)) — so these entries are present in the runtime system with no special handling. No dracut changes are needed beyond using the updated nvme-cli in the initramfs. Boot devices are a first-class case, not a special one. |
There was a problem hiding this comment.
The dracut nvmf module actually handles various transports, not just NBFT - see the top of https://github.com/dracut-ng/dracut/blob/main/modules.d/74nvmf/parse-nvmf-boot-connections.sh and also notice the fc,auto discovery.
We need to update the module to explicitly provide context to any nvme connect-all calls there, incl. the nvme connect-all triggered from the custom udev rule: https://github.com/dracut-ng/dracut/blob/main/modules.d/74nvmf/95-nvmf-initqueue.rules
(that said, since this will be a new argument, we'll likely need to check nvme command version beforehand to be backwards-compatible...)
There was a problem hiding this comment.
You're right that the 74nvmf dracut module handles more than just NBFT — it also triggers FC discovery via nvme_discovery and connects controllers through the udev rule in 95-nvmf-initqueue.rules. Those connections are made via nvme connect-all with no --owner, so they land in the registry as unowned.
I don't think this needs to change. We're already addressing the issue with the new nvme-discoverd daemon that starts after the switch_root takes place.
There is no race window between dracut and nvme-discoverd. The unowned period spans exactly the time between dracut's connect-all and nvme-discoverd's startup scan after switch_root. During switch_root itself, no user-space orchestrator is running. Nothing can disconnect unowned controllers in that window.
After switch_root, nvme-discoverd wins the startup race. nvme-discoverd is written in C and starts early in the boot sequence. At startup it scans sysfs for existing controllers and immediately claims ownership via libnvmf_registry_update for any controller it recognises. By the time nvme-stas (Python, slower to start) has initialised, nvme-discoverd already holds the registry entries. nvme-stas sees owned controllers and leaves them alone.
nvme-stas does not manage FC connections by default. Unless FC controllers are explicitly listed in stafd.conf or stacd.conf, nvme-stas never attempts to connect to or disconnect them. There is no default FC policy in stas that could accidentally disconnect the dracut-created FC connections.
Adding --owner to dracut's connect-all would be counterproductive. The design contract for plain connect-all is that connections are unowned so disconnect-all can always undo them. Making dracut's connections permanently owned (e.g. owner=dracut) would prevent disconnect-all from cleaning them up, and would require version-checking in dracut for backwards compatibility — complexity with no real benefit given the arguments above.
The NBFT case is already handled correctly. nvme connect-all --nbft registers owner=nbft, protecting firmware boot volumes from disconnect-all. That path is unchanged.
In summary: dracut creates unowned connections, nvme-discoverd claims them immediately after switch_root before any other orchestrator can act, and nvme-stas never touches FC connections by default. No dracut changes are needed.
|
|
||
| ### NBFT connect | ||
|
|
||
| `nvme connect-all --nbft` automatically registers all controllers it connects as `owner=nbft`, protecting boot volumes from accidental disconnection. |
There was a problem hiding this comment.
Well, for the dracut context noted above, we'll need to add something like --owner for nvme connect-all. Since the dracut nvmf module only calls nvme connect-all as it creates /etc/nvme/discovery.conf from kernel cmdline arguments, there isn't currently any call to nvme connect (for now).
There was a problem hiding this comment.
Please refer to my previous answer on this topic. To summarise: the unowned window between dracut's connect-all and nvme-discoverd's startup scan is safe in practice — no orchestrator runs during switch_root, nvme-discoverd (C) claims ownership before nvme-stas (Python) is up, and nvme-stas has no default policy for FC connections anyway.
Beyond that, adding --owner to dracut's nvme connect-all would introduce a versioned dependency between dracut and nvme-cli that I'd like to avoid. The --owner option would only be available from nvme-cli 3.0 onwards, so dracut would need to version-check the nvme binary before using it — adding complexity to dracut for a problem that doesn't need solving.
One document is proposed for review, and a second is included for context:
Please use inline comments to discuss specific sections.
Related issue: #3433
Implementation PR: #3425
These documents are in Markdown format; it reads best in Typora or any Markdown-aware editor.