Skip to content

feat: add XPU container run script#7669

Open
ZhengHongming888 wants to merge 1 commit intoai-dynamo:mainfrom
ZhengHongming888:container_run_sh_for_xpu
Open

feat: add XPU container run script#7669
ZhengHongming888 wants to merge 1 commit intoai-dynamo:mainfrom
ZhengHongming888:container_run_sh_for_xpu

Conversation

@ZhengHongming888
Copy link
Copy Markdown
Contributor

@ZhengHongming888 ZhengHongming888 commented Mar 28, 2026

Overview:

This PR is to add run_xpu.sh script for launching Dynamo containers with Intel XPU support. The script handles XPU-specific device access, DRI group management, and provides flexible container configuration options.

Details:

  1. generate dockerfile for xpu
    python container/render.py --framework vllm --target local-dev --device xpu --output-short-filename

  2. docker build -
    docker build --progress=plain --target local-dev -t dynamo:latest-vllm17.1-local-dev6 -f container/rendered.Dockerfile --build-arg USER_UID=$(id -u) --build-arg USER_GID=$(id -g) --build-arg http_proxy=... --build-arg https_proxy=... --build-arg no_proxy=localhost,127.0.0.1,0.0.0.0 .

  3. create container -
    container/run_xpu.sh --mount-workspace --image dynamo:latest-vllm17.1-local-dev -it

for local-dev container you still need install inside container for dynamo -
cargo build --features dynamo-llm/block-manager && cd /workspace/lib/bindings/python && maturin develop --uv && cd /workspace && uv pip install --no-deps -e /workspace

With the run_xpu.sh it will simplify the step for creating the xpu container and also aligned with Dynamo templating style.
image

Tested with run_xpu.sh it works fine.

Thanks.

Summary by CodeRabbit

Release Notes

  • New Features
    • Added Intel XPU container launch utility with support for multiple frameworks (VLLM, TRTLLM, SGLANG)
    • Configurable Docker container settings including volumes, ports, and environment variables
    • Includes dry-run mode to preview commands before execution

Add run_xpu.sh script for launching Dynamo containers with Intel XPU support. The script handles XPU-specific device access, DRI group management, and provides flexible container configuration options.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Signed-off-by: Hongming Zheng <[email protected]>
@ZhengHongming888 ZhengHongming888 requested review from a team as code owners March 28, 2026 02:58
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 28, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi ZhengHongming888! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added feat external-contribution Pull request is from an external contributor container labels Mar 28, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 28, 2026

Walkthrough

A new executable Bash script is added at container/run_xpu.sh that configures and launches Docker containers with Intel XPU support. The script parses command-line options for framework selection, image/container names, volume/port/environment mappings, workspace mounting, and HF_HOME configuration. It detects XPU device nodes, manages group permissions, and supports dry-run mode for command preview.

Changes

Cohort / File(s) Summary
XPU Docker Container Launcher
container/run_xpu.sh
New 374-line Bash script implementing Docker container orchestration with Intel XPU support, including argument parsing, conditional volume/environment mounting, device node detection with fallback warnings, group ID management for custom users, and dry-run execution mode.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: add XPU container run script' accurately summarizes the main change—adding a new executable script for XPU container management.
Description check ✅ Passed PR description covers overview, implementation details, and usage examples with clear workflow steps and testing confirmation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
container/run_xpu.sh (2)

229-239: Minor inconsistency with run.sh mount path.

The /mnt mount on line 232 lacks the trailing slash that run.sh uses (/mnt/:/mnt). While functionally equivalent for directories, consider aligning for consistency across container run scripts.

         VOLUME_MOUNTS+=" -v ${SOURCE_DIR}/..:/workspace "
         VOLUME_MOUNTS+=" -v /tmp:/tmp "
-        VOLUME_MOUNTS+=" -v /mnt:/mnt "
+        VOLUME_MOUNTS+=" -v /mnt/:/mnt "
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@container/run_xpu.sh` around lines 229 - 239, In the MOUNT_WORKSPACE block
where VOLUME_MOUNTS is assembled (conditioned by MOUNT_WORKSPACE and using
SOURCE_DIR), change the mount entry currently added as "-v /mnt:/mnt" to include
the trailing slash so it becomes "-v /mnt/:/mnt" to match the style used in
run.sh; update the VOLUME_MOUNTS concatenation that references /mnt so the
variable and surrounding concatenations (VOLUME_MOUNTS, MOUNT_WORKSPACE,
SOURCE_DIR) remain consistent.

281-288: Consider multi-XPU environments.

The script only checks /dev/dri/renderD128, but systems with multiple Intel XPUs will have additional render nodes (renderD129, renderD130, etc.). While the GID is typically the same for all render devices (the render group), consider:

  1. Verifying /dev/dri exists rather than a specific device
  2. Documenting this assumption in the help text
     # XPU specific: Add DRI device group for GPU access
-    if [ -e /dev/dri/renderD128 ]; then
-        DRI_GROUP=$(stat -c '%g' /dev/dri/renderD128)
+    # Get GID from any render device (all share the same group)
+    RENDER_DEVICE=$(ls /dev/dri/renderD* 2>/dev/null | head -n1)
+    if [ -n "$RENDER_DEVICE" ]; then
+        DRI_GROUP=$(stat -c '%g' "$RENDER_DEVICE")
         GROUP_ADD_STRING="--group-add ${DRI_GROUP}"
     else
         GROUP_ADD_STRING=""
-        echo "Warning: /dev/dri/renderD128 not found. XPU access may not work properly."
+        echo "Warning: No /dev/dri/renderD* devices found. XPU access may not work properly."
     fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@container/run_xpu.sh` around lines 281 - 288, The check for a single device
node /dev/dri/renderD128 is fragile in multi-XPU setups; update the logic around
DRI_GROUP and GROUP_ADD_STRING to first verify /dev/dri exists and then derive
the group ID from any renderD* entry (e.g., find the first /dev/dri/renderD*),
and update the help/warning text to document the assumption that all render
devices share the same GID; keep using DRI_GROUP and GROUP_ADD_STRING but
replace the hardcoded path check with a /dev/dri presence check and a probe for
renderD* to obtain the GID.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@container/run_xpu.sh`:
- Around line 352-372: The docker run command invoked by ${RUN_PREFIX} needs
explicit XPU device passthrough when running non-privileged (i.e. when
${PRIVILEGED_STRING} indicates FALSE); update the script to add device flags
(for example --device /dev/dri --device /dev/dri/renderD128 or a new
${DEVICE_STRING} variable) into the docker run invocation path that is used when
PRIVILEGED_STRING is not set, so the container retains access to /dev/dri
devices; alternatively, if non-privileged XPU is unsupported, remove the
--privileged option or update the help text accordingly.

---

Nitpick comments:
In `@container/run_xpu.sh`:
- Around line 229-239: In the MOUNT_WORKSPACE block where VOLUME_MOUNTS is
assembled (conditioned by MOUNT_WORKSPACE and using SOURCE_DIR), change the
mount entry currently added as "-v /mnt:/mnt" to include the trailing slash so
it becomes "-v /mnt/:/mnt" to match the style used in run.sh; update the
VOLUME_MOUNTS concatenation that references /mnt so the variable and surrounding
concatenations (VOLUME_MOUNTS, MOUNT_WORKSPACE, SOURCE_DIR) remain consistent.
- Around line 281-288: The check for a single device node /dev/dri/renderD128 is
fragile in multi-XPU setups; update the logic around DRI_GROUP and
GROUP_ADD_STRING to first verify /dev/dri exists and then derive the group ID
from any renderD* entry (e.g., find the first /dev/dri/renderD*), and update the
help/warning text to document the assumption that all render devices share the
same GID; keep using DRI_GROUP and GROUP_ADD_STRING but replace the hardcoded
path check with a /dev/dri presence check and a probe for renderD* to obtain the
GID.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bf5a77e0-1ae4-4a79-aeee-ac53125f0bd2

📥 Commits

Reviewing files that changed from the base of the PR and between 7dfbe4f and bdad23e.

📒 Files selected for processing (1)
  • container/run_xpu.sh

Comment on lines +352 to +372
${RUN_PREFIX} docker run \
${INTERACTIVE} \
${RM_STRING} \
${PRIVILEGED_STRING} \
${GROUP_ADD_STRING} \
--network "$NETWORK" \
--shm-size=10G \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--ulimit nofile=65536:65536 \
${ENVIRONMENT_VARIABLES} \
${VOLUME_MOUNTS} \
${PORT_MAPPINGS} \
-w "$WORKDIR" \
--cap-add CAP_SYS_PTRACE \
--ipc host \
${USER_STRING} \
${NAME_STRING} \
${ENTRYPOINT_STRING} \
${IMAGE} \
"${REMAINING_ARGS[@]}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing XPU device passthrough for non-privileged mode.

When --privileged FALSE is specified, the container loses access to XPU devices because there's no --device flag to explicitly pass /dev/dri. The --group-add provides permission but not device visibility.

Consider adding device passthrough when not running privileged:

+    # XPU device passthrough (only needed when not privileged)
+    if [[ ${PRIVILEGED^^} == "FALSE" ]] && [ -d /dev/dri ]; then
+        DEVICE_STRING="--device=/dev/dri"
+    else
+        DEVICE_STRING=""
+    fi
+
 ${RUN_PREFIX} docker run \
+    ${DEVICE_STRING} \
     ${INTERACTIVE} \
     ${RM_STRING} \

Alternatively, if non-privileged XPU mode isn't intended to be supported, consider either removing the --privileged option or documenting this limitation in the help text.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@container/run_xpu.sh` around lines 352 - 372, The docker run command invoked
by ${RUN_PREFIX} needs explicit XPU device passthrough when running
non-privileged (i.e. when ${PRIVILEGED_STRING} indicates FALSE); update the
script to add device flags (for example --device /dev/dri --device
/dev/dri/renderD128 or a new ${DEVICE_STRING} variable) into the docker run
invocation path that is used when PRIVILEGED_STRING is not set, so the container
retains access to /dev/dri devices; alternatively, if non-privileged XPU is
unsupported, remove the --privileged option or update the help text accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

container external-contribution Pull request is from an external contributor feat size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant