Skip to content

Commit a1168ad

Browse files
author
sdp
committed
Add NUMA considerations and configurable base image
- README: add hardware requirements table, NUMA considerations section, and multi-NUMA warning for EI deployment - docker-compose: add PYTHON_BASE_IMAGE build arg for both services Signed-off-by: Rafal Bogdanowicz <rafal.bogdanowicz@intel.com>
1 parent 00be0d0 commit a1168ad

2 files changed

Lines changed: 39 additions & 1 deletion

File tree

sample_solutions/AgenticCodeExecution/README.md

Lines changed: 35 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,32 @@ Flowise (or other MCP client)
2323

2424
**sandbox-server (port 5051)** — Exposes `execute_python` and proxies `actions.*` calls to tools-server. Uses session-aware routing (`mcp-session-id`) and stores run hashes in `sandbox-server/session_hashes/`. Starts independently and auto-refreshes tool discovery in the background. Dynamically regenerates `execute_python` description when connected tools change.
2525

26+
## Minimum Hardware Requirements
27+
28+
Measured on a bare-metal deployment with `Qwen/Qwen3-Coder-30B-A3B-Instruct` (BF16, ~57 GB model weights).
29+
30+
| Resource | Minimum | Recommended | Notes |
31+
|---|---|---|---|
32+
| **RAM** | 80 GB | 128 GB | vLLM alone uses ~81 GB RSS (model weights + KV cache + runtime). K8s + Flowise + MCP add ~8 GB. |
33+
| **CPU cores** | 16 | 32+ | vLLM CPU inference is compute-bound. More cores reduce latency. |
34+
| **Disk** | 120 GB | 200 GB | ~57 GB model weights, ~2 GB container images, ~30 GB K8s/containerd, OS overhead. |
35+
| **NUMA** || interleave | Multi-socket systems **must** use `numactl --interleave=all` for vLLM. See [NUMA notes](#numa-considerations) below. |
36+
37+
### NUMA considerations
38+
39+
On multi-NUMA-node systems (e.g. dual-socket Xeon), vLLM's IPEX backend migrates all memory to a single NUMA node by default. Since each node typically has only 30–32 GB, the model loading fails with OOM on a 60 GB+ model.
40+
41+
**Solution:** Launch vLLM with memory interleaving and disable IPEX's thread binding:
42+
43+
```bash
44+
VLLM_CPU_OMP_THREADS_BIND=nobind numactl --interleave=all vllm serve <model> ...
45+
```
46+
47+
- `VLLM_CPU_OMP_THREADS_BIND=nobind` — prevents IPEX from calling `numa_migrate_pages()`.
48+
- `numactl --interleave=all` — distributes allocations evenly across all NUMA nodes.
49+
50+
---
51+
2652
## Quick Start (Docker)
2753

2854
```bash
@@ -80,7 +106,7 @@ You need a running vLLM endpoint serving `Qwen/Qwen3-Coder-30B-A3B-Instruct` (or
80106

81107
### Enterprise Inference (Kubernetes)
82108

83-
Deploy the model using [Enterprise Inference](../../docs/README.md). EI handles model download, CPU pinning, NUMA binding, and proxy configuration automatically.
109+
Deploy the model using [Enterprise Inference](../../docs/README.md). EI handles model download, proxy configuration, and basic CPU pinning.
84110

85111
```bash
86112
cd /path/to/Enterprise-Inference
@@ -90,6 +116,14 @@ helm install vllm-qwen3-coder ./core/helm-charts/vllm \
90116
--set LLM_MODEL_ID="Qwen/Qwen3-Coder-30B-A3B-Instruct"
91117
```
92118

119+
> **Multi-NUMA systems:** If the model is larger than a single NUMA node's memory (e.g. ~60 GB model on a system with 4 × 32 GB NUMA nodes), the default vLLM/IPEX configuration will fail with OOM. You **must** set two environment variables in the vLLM container:
120+
>
121+
> ```
122+
> VLLM_CPU_OMP_THREADS_BIND=nobind
123+
> ```
124+
>
125+
> and launch the process with `numactl --interleave=all`. See [NUMA considerations](#numa-considerations) for details.
126+
93127
See the [EI deployment guide](../../docs/README.md) for full instructions, proxy setup, and troubleshooting.
94128
95129
---

sample_solutions/AgenticCodeExecution/docker-compose.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@ services:
33
build:
44
context: .
55
dockerfile: examples/Dockerfile
6+
args:
7+
PYTHON_BASE_IMAGE: ${PYTHON_BASE_IMAGE:-public.ecr.aws/docker/library/python:3.12-slim}
68
container_name: mcp-tools-server
79
ports:
810
- "5050:5050"
@@ -36,6 +38,8 @@ services:
3638
build:
3739
context: .
3840
dockerfile: sandbox-server/Dockerfile
41+
args:
42+
PYTHON_BASE_IMAGE: ${PYTHON_BASE_IMAGE:-public.ecr.aws/docker/library/python:3.12-slim}
3943
container_name: mcp-sandbox-server
4044
depends_on:
4145
tools-server:

0 commit comments

Comments
 (0)