You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add NUMA considerations and configurable base image
- README: add hardware requirements table, NUMA considerations section,
and multi-NUMA warning for EI deployment
- docker-compose: add PYTHON_BASE_IMAGE build arg for both services
Signed-off-by: Rafal Bogdanowicz <rafal.bogdanowicz@intel.com>
Copy file name to clipboardExpand all lines: sample_solutions/AgenticCodeExecution/README.md
+35-1Lines changed: 35 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -23,6 +23,32 @@ Flowise (or other MCP client)
23
23
24
24
**sandbox-server (port 5051)** — Exposes `execute_python` and proxies `actions.*` calls to tools-server. Uses session-aware routing (`mcp-session-id`) and stores run hashes in `sandbox-server/session_hashes/`. Starts independently and auto-refreshes tool discovery in the background. Dynamically regenerates `execute_python` description when connected tools change.
25
25
26
+
## Minimum Hardware Requirements
27
+
28
+
Measured on a bare-metal deployment with `Qwen/Qwen3-Coder-30B-A3B-Instruct` (BF16, ~57 GB model weights).
|**CPU cores**| 16 | 32+ | vLLM CPU inference is compute-bound. More cores reduce latency. |
34
+
|**Disk**| 120 GB | 200 GB |~57 GB model weights, ~2 GB container images, ~30 GB K8s/containerd, OS overhead. |
35
+
|**NUMA**| — | interleave | Multi-socket systems **must** use `numactl --interleave=all` for vLLM. See [NUMA notes](#numa-considerations) below. |
36
+
37
+
### NUMA considerations
38
+
39
+
On multi-NUMA-node systems (e.g. dual-socket Xeon), vLLM's IPEX backend migrates all memory to a single NUMA node by default. Since each node typically has only 30–32 GB, the model loading fails with OOM on a 60 GB+ model.
40
+
41
+
**Solution:** Launch vLLM with memory interleaving and disable IPEX's thread binding:
-`VLLM_CPU_OMP_THREADS_BIND=nobind` — prevents IPEX from calling `numa_migrate_pages()`.
48
+
-`numactl --interleave=all` — distributes allocations evenly across all NUMA nodes.
49
+
50
+
---
51
+
26
52
## Quick Start (Docker)
27
53
28
54
```bash
@@ -80,7 +106,7 @@ You need a running vLLM endpoint serving `Qwen/Qwen3-Coder-30B-A3B-Instruct` (or
80
106
81
107
### Enterprise Inference (Kubernetes)
82
108
83
-
Deploy the model using [Enterprise Inference](../../docs/README.md). EI handles model download, CPU pinning, NUMA binding, and proxy configuration automatically.
109
+
Deploy the model using [Enterprise Inference](../../docs/README.md). EI handles model download, proxy configuration, and basic CPU pinning.
> **Multi-NUMA systems:** If the model is larger than a single NUMA node's memory (e.g. ~60 GB model on a system with 4 × 32 GB NUMA nodes), the default vLLM/IPEX configuration will fail with OOM. You **must** set two environment variables in the vLLM container:
120
+
>
121
+
> ```
122
+
> VLLM_CPU_OMP_THREADS_BIND=nobind
123
+
> ```
124
+
>
125
+
> and launch the process with `numactl --interleave=all`. See [NUMA considerations](#numa-considerations) for details.
126
+
93
127
See the [EI deployment guide](../../docs/README.md) for full instructions, proxy setup, and troubleshooting.
0 commit comments