Ullama is a personal infrastructure-as-code repository for local LLM deployment with CUDA acceleration. It combines llama.cpp's router server mode with preset-based configuration to enable dynamic model management on NVIDIA GPU hardware.
- Role: Central inference engine with dynamic model loading
- Port: 8001 (OpenAI-compatible API)
- Key Features:
- Single model in VRAM at a time (
--models-max 1) - Preset-based configuration via INI files
- Automatic model switching based on requests
- CPU affinity binding for optimal performance (taskset -c 0-7)
- Single model in VRAM at a time (
- Role: User interface layer
- Port: 3000
- Connection: Connects to router server at
http://host.docker.internal:8001/v1 - Features: ChatGPT-like interface, conversation management
- Location:
config/presets.ini(Linux),config/presets-macos.ini(macOS) - Purpose: Define model parameters, quantization, context limits
- Structure: Global defaults + per-model overrides
scripts/run-server.sh- Launch router server with OS-specific configurationscripts/update_llama_cpp.sh- Pull and rebuild llama.cpp from sourcescripts/install-docker.sh- Docker installation automation
User Request (WebUI:3000)
↓
Router Server (llama.cpp:8001)
↓
Preset Configuration (config/presets.ini)
↓
Model Loading (HuggingFace → VRAM)
↓
Inference (CUDA Accelerated)
↓
Response (OpenAI API Format)
↓
User Interface (WebUI)
- CPU: AMD Ryzen 9 7950X3D (16-core, 32-thread)
- GPU: NVIDIA RTX 4090 (24GB VRAM)
- RAM: 64GB DDR5
- Storage: 1.9TB NVMe
- VRAM: Model weights + KV cache (primary inference memory)
- RAM: KV cache overflow, system operations (64GB total)
- CPU: 8-core affinity (cores 0-7) for MoE expert routing and non-GPU layers
- Swap: Fallback for extreme memory pressure (monitored to prevent OOM)
Decision: Single model in VRAM (--models-max 1)
- Rationale: Prevents VRAM exhaustion on 24GB card
- Trade-off: Model switching latency vs. concurrent model support
Decision: q8_0 for K/V caches by default
- Rationale: Minimal quality loss (~1% perplexity increase)
- Trade-off: ~1MB per token memory usage vs. q4_0 (~0.5MB/token)
- See: ADR 0003 for context limit implications
Decision: taskset -c 0-7 in run-server.sh
- Rationale: Single CCD optimization for 3D V-Cache architecture
- Trade-off: Portability (in script) vs. centralization (in systemd)
- See: ADR 0002 for systemd integration details
Decision: Layered defense with aligned limits
- opencode.json:
limit.context= 32768 (safety limit) - config/presets.ini:
ctx-size= 32768 (server rejection threshold) - Trade-off: Reduced context window vs. system stability
- See: ADR 0003 for detailed rationale
- Specs:
docs/specs/- Technical blueprints and implementation plans - ADRs:
docs/adrs/- Immutable architectural decisions - Journal:
docs/journal/- Chronological engineering logs - Tasks:
TODO.md- Task inbox with spec/ADR references
- Research: Journal entry or draft spec
- Design: Formalize in spec file
- Implement: Execute spec checklist
- Record: Keep spec as permanent reference
- Service runs under user
zooaccount (personal workstation) - No network exposure (localhost only)
- Models downloaded from HuggingFace (trust model required)
- Dedicated service user for production deployment
- Network firewall rules if exposing beyond localhost
- Model weight verification and signing
- First Token: 100-500ms (model loading if not cached)
- Subsequent Tokens: 20-50 tokens/sec (model dependent)
- Model Switch: 2-10 seconds (unload + load)
- Qwen3.5-27B: ~16GB VRAM (Q4_K_XL)
- Gemma-4-31B: ~20GB VRAM (Q3_K_XL)
- KV Cache: ~1MB/token (q8_0 quantization)
- Context Window: Limited by physical RAM for large contexts with q8_0 KV cache
- Model Switching: Latency when switching between models
- Single Instance: No concurrent model support
- Linux-Specific: CPU affinity and systemd features require Linux
- VRAM Bound: 24GB VRAM limits model size and context combinations
- Preset modularization (core/testing/legacy) - see
docs/specs/preset-modularization.md - Makefile build system - see
docs/specs/makefile-build-system.md - systemd service hardening - see
docs/adrs/0002-systemd-service.md - MoE model evaluation for doc agent - see
docs/journal/2026-04-09-moe-model-research.md - Open WebUI removal (export conversations first) - see
TODO.md