Velo-Sentinel: A high-performance inference gateway designed for Tier-1 AI organizations. Built on Java 25 Virtual Threads, it serves as the mission-critical orchestration layer for transitioning from legacy NVIDIA Triton to the next-generation NVIDIA Dynamo 1.x disaggregated inference framework.
The following features are production-ready. Click the links for detailed documentation.
| Phase | Focus |
|---|---|
| Concurrency | 🏗️ Architecture & Disaggregated Serving — Prefill/Decode separation and cache-aware routing. 🏛️ Deep-Dive Architecture — Internal component breakdown and thread model. 📊 Observability & Metrics — Micrometer, Jaeger tracing, and Prometheus. |
| Resilience | 🛡️ Resilience & Chaos Engineering — Circuit breakers, hedging, and fault injection. 🌍 Multi-Cloud Disaster Recovery — Cross-cloud failover strategies. |
| Efficiency | ⚖️ Adaptive Batching Strategies — Optimizing TFLOPS via intelligent request grouping. ⚡ SLA-Aware Priority Queuing — Dynamic deadline resolution and priority-based scheduling. ⚙️ NVIDIA Dynamo-Aware Scaling — Predictive HPA and backend pressure metrics. |
| Governance | 🔐 Enterprise Governance & Privacy — PII scrubbing, authentication, and audit logging. 🔒 Security & Authentication — API Key management and endpoint protection. |
| Hardware | 🚀 Performance & Benchmarks — Latency results and hardware-specific optimizations Velo-Core. |
To provide a foundational computational platform for global-scale ML/AI applications, bridging the gap between legacy infrastructure and disaggregated, hardware-aware serving through the Four Pillars of Inference Scale: Concurrency, Resilience, Efficiency, and Governance.
- Research-to-Production Bridge: Streamlines the deployment of large foundation models (LLMs) by abstracting the complexity of disaggregated inference backends.
- Scalable Orchestration Layer: A sophisticated model serving system providing foundational abstractions that ensure consistency across distributed inference nodes.
- High-Throughput Execution: Leveraging Java 25 Virtual Threads (Project Loom) to achieve L5-tier concurrency, enabling high-traffic model serving with minimal overhead.
- Speculative Acceleration: Orchestrating multi-model "Draft & Verify" workflows to achieve superior token-generation speeds.
- Cost & Latency Optimization: Integrated Adaptive Batching, SLA-Aware Priority Queuing, and Semantic Caching to maximize GPU utilization.
- Production-Grade Resilience: Multi-layered fault tolerance (Fail-Open/Closed), Request Hedging, and Multi-Cloud Disaster Recovery.
- Enterprise Governance: Proactive privacy protection via PII Scrubbing and Immutable Audit Logging for regulatory compliance.
- Safety & Observability: Real-time Accuracy Drift Monitoring and Shadow-Mode Validation to ensure model parity during migration.
- Hybrid Hardware Orchestration: Seamlessly routing between Cloud GPUs and Local Metal/AMX acceleration (Velo-Core) for edge-aware inference.
Velo-Sentinel follows modern high-performance architectural patterns, prioritizing Structured Concurrency over legacy Reactive patterns.
☀️ View Architecture (Light Mode)
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#fdfdfd', 'primaryTextColor': '#2c3e50', 'primaryBorderColor': '#34495e', 'lineColor': '#2c3e50', 'secondaryColor': '#d4edda', 'tertiaryColor': '#d1ecf1' }}}%%
graph TD
User([User Request]) --> Controller[InferenceController]
Controller --> Bridge[DynamoBridgeService]
subgraph "Orchestration Layer (Virtual Threads)"
Bridge -->|Shadow Mode| TaskScope[StructuredTaskScope]
TaskScope -->|Primary| Dynamo[DynamoBackend]
TaskScope -->|Shadow| Triton[TritonBackend]
Bridge -->|Hybrid Path| Metal[MetalBackend]
Dynamo -->|Session Lookup| Redis[(Redis KV-Registry)]
Redis -.->|Cache Status| Dynamo
Dynamo -->|AOP Proxy| RC[DynamoResilienceComponent]
RC -->|Circuit Breaker| D_GRPC[Dynamo gRPC Client]
RC -.->|Fallback| Triton
Metal -->|Java FFM API| VeloCore[Velo-Core Engine]
end
Triton --> T_GRPC[Triton gRPC Client]
D_GRPC --> DB[(Dynamo Backend)]
T_GRPC --> TB[(Legacy Triton Backend)]
VeloCore -->|Apple Silicon| GPU[Metal / AMX Acceleration]
Bridge --> Metrics[Micrometer Metrics]
Metrics --> Prometheus[Prometheus / Grafana]
classDef primary fill:#e3f2fd,stroke:#2196f3,stroke-width:2px,color:#0d47a1;
classDef secondary fill:#e8f5e9,stroke:#4caf50,stroke-width:2px,color:#1b5e20;
classDef storage fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px,color:#4a148c;
classDef user fill:#fff3e0,stroke:#ff9800,stroke-width:2px,color:#e65100;
classDef hardware fill:#fff9c4,stroke:#fbc02d,stroke-width:2px,color:#f57f17;
class User user;
class Controller,Bridge primary;
class Dynamo,Triton,Metal secondary;
class Redis,DB,TB,Metrics,Prometheus storage;
class VeloCore,GPU hardware;
🌙 View Architecture (Dark Mode)
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#1a1a1a', 'primaryTextColor': '#ecf0f1', 'primaryBorderColor': '#34495e', 'lineColor': '#bdc3c7', 'secondaryColor': '#006100', 'tertiaryColor': '#fff' }}}%%
graph TD
User([User Request]) --> Controller[InferenceController]
Controller --> Bridge[DynamoBridgeService]
subgraph "Orchestration Layer (Virtual Threads)"
Bridge -->|Shadow Mode| TaskScope[StructuredTaskScope]
TaskScope -->|Primary| Dynamo[DynamoBackend]
TaskScope -->|Shadow| Triton[TritonBackend]
Bridge -->|Hybrid Path| Metal[MetalBackend]
Dynamo -->|Session Lookup| Redis[(Redis KV-Registry)]
Redis -.->|Cache Status| Dynamo
Dynamo -->|AOP Proxy| RC[DynamoResilienceComponent]
RC -->|Circuit Breaker| D_GRPC[Dynamo gRPC Client]
RC -.->|Fallback| Triton
Metal -->|Java FFM API| VeloCore[Velo-Core Engine]
end
Triton --> T_GRPC[Triton gRPC Client]
D_GRPC --> DB[(Dynamo Backend)]
T_GRPC --> TB[(Legacy Triton Backend)]
VeloCore -->|Apple Silicon| GPU[Metal / AMX Acceleration]
Bridge --> Metrics[Micrometer Metrics]
Metrics --> Prometheus[Prometheus / Grafana]
classDef primary fill:#154360,stroke:#3498db,stroke-width:2px,color:#fff;
classDef secondary fill:#145a32,stroke:#2ecc71,stroke-width:2px,color:#fff;
classDef storage fill:#4a235a,stroke:#9b59b6,stroke-width:2px,color:#fff;
classDef user fill:#6e2c00,stroke:#e67e22,stroke-width:2px,color:#fff;
classDef hardware fill:#7d6608,stroke:#f1c40f,stroke-width:2px,color:#fff;
class User user;
class Controller,Bridge primary;
class Dynamo,Triton,Metal secondary;
class Redis,DB,TB,Metrics,Prometheus storage;
class VeloCore,GPU hardware;
- Global Session Registry: Leveraging Redis as a disaggregated KV-Cache registry to track session "Warmth" across the cluster.
- Context-Aware Routing: The gateway identifies "Cold" sessions and pre-emptively triggers KV-Cache hydration in the Dynamo backend, eliminating cold-start latency for the user.
- Distributed State: Ensures that stateless Virtual Threads can rapidly re-associate users with their specific model context without expensive lookups.
- Disaggregated Serving: Optimizes throughput by separating Prefill (compute-bound) and Decode (memory-bound) phases into distinct batching streams.
- Sticky Cache Routing: Minimizes data transfer latency by directing requests to GPU nodes that already host the relevant KV-Cache in memory.
- Request Hedging: Eliminates P99 latency outliers by automatically spawning parallel "hedged" requests if the primary backend stutters.
- Semantic Caching: Reduces GPU load by >40% by returning cached results for semantically similar prompts using vector-based lookup.
- Velo-Core Integration: Velo-Sentinel acts as the primary gateway for Velo-Core, a specialized Rust/C++ inference engine. This enables sub-millisecond latency for local workloads.
- Language: Java 25 (Optimized for Virtual Threads)
- Framework: Spring Boot 4.0.5
- Resilience: Java 25 Structured Concurrency & Scoped Values
- Observability: Micrometer (Registry-based metrics)
- Communication: gRPC (Primary) & HTTP/REST (Legacy/Compatibility)
- Inference Backend: NVIDIA Dynamo-Triton
- Infrastructure: Docker & Docker Compose
.
├── gateway/ # Java/Spring Boot Gateway Implementation
├── sdks/ # Multi-language SDKs (Python, Node.js, Java)
├── scripts/ # Automation scripts (SDK gen, deployment)
├── infra/ # Infrastructure configuration (Docker, Kubernetes)
└── core/ # High-performance Metal/AMX kernels (C++/MSL)
- JDK 25
- Docker & Docker Compose
- Gradle 8.x (provided via wrapper)
-
Start NVIDIA Dynamo-Triton:
cd infra docker compose up -d -
Build and Run the Gateway:
cd gateway/sentinel ./gradlew bootRun -
Generate Multi-Language SDKs:
./scripts/sdk-gen.sh
This generates high-performance clients in
sdks/python,sdks/node, andsdks/java. -
Generate API Documentation (Javadoc):
# For the Gateway cd gateway/sentinel && ./gradlew javadoc # For the Java SDK cd sdks/java && ./gradlew javadoc
Output:
build/docs/javadoc/index.htmlin respective modules. -
Verify System Integrity (Tests & Coverage):
cd gateway/sentinel ./gradlew clean test build
Generate test coverage report (Jacoco):
./gradlew test jacocoTestReportThe report URL is:
file:///${PWD}/gateway/sentinel/build/reports/jacoco/test/html/index.htmlThe unit test coverage is >84%.
This project was designed and implemented by velo.com as a technical demonstration of high-performance system architecture. All architectural decisions, performance optimizations, and code implementations are original work.