This project implements an LLM inference service with GPU batching, TTL-based caching, and full observability. Autoscaling logic is implemented at the application control-plane level, with infrastructure scaling simulated due to single-GPU runtime constraints.
- Request batching via a single GPU forward pass to improve throughput.
- A lightweight in-memory cache to reduce repeated work and measure cache effects on latency.
- A round-robin load balancer and logical workers to experiment with concurrency/oversubscription.
- A simple queue-driven autoscaler that adjusts logical workers based on queue depth.
- End-to-end Prometheus metrics collection and basic experiment scripts for load testing and visualization.
- FastAPI exposes a single POST /generate endpoint which checks cache, then enqueues cache-miss requests.
batch_processor.batch_workerpulls requests from a global asyncio queue, forms batches (configurable size and max-wait window), runs a GPU generation, and resolves each request's future.serving/worker.pyprovides logical worker handlers registered with theRoundRobinLoadBalancer.serving/autoscaler.pyobservesrequest_queuedepth and adds/removes logical workers, exposing a Prometheus gauge for active workers.metrics.pydefines Prometheus metrics for requests, latency, batch sizes, queue wait times, cache hits/misses, and worker activity.
Prerequisites:
- Python 3.10+
piporconda- NVIDIA GPU with CUDA support
Steps:
- Create a Python environment and install dependencies:
- pip install -r requirements.txt
- Run the app locally (development):
- uvicorn serving.app:app --host 0.0.0.0 --port 8000 --reload
- Load tests and visualizations live in
experiments/:- python experiments/load_cache_test.py
- python experiments/load_test.py
- python experiments/visualize_cache_results.py
- Metrics: Prometheus metrics are exposed by the app (default port 8002) and can be plotted on Grafana (port 9090).
Configuration tips
- Model loading is in
model.load_model(); by default the model is moved to CUDA. For CPU-only testing, remove.cuda(). - Tune
batch_worker(...)parameters (batch_size,max_wait_ms) to trade throughput for latency. - Adjust
NUM_WORKERSand autoscaler settings inserving/app.pyto study oversubscription effects.
- End-to-end inference under load
- A FastAPI service handled concurrent client requests while batching, caching, and generating model outputs.
- Real, production-style metrics
- The system exports Prometheus metrics for:
- Request queue depth
- Queue wait time
- Batch size distribution
- Per-worker request counts
- Autoscaler decisions
- The system exports Prometheus metrics for:
- Observable autoscaling behavior
- As request concurrency increased:
- Queue depth and wait time rose predictably
- The autoscaler computed higher desired worker counts
- Scaling decisions were visible and explainable in Grafana
- As request concurrency increased:
- Controlled performance tradeoffs
- By tuning batch size limits and queue wait thresholds, I observed clear tradeoffs between:
- Latency vs throughput
- Batch efficiency vs tail latency
- Cache effectiveness vs compute utilization
- By tuning batch size limits and queue wait thresholds, I observed clear tradeoffs between:
This project is released under the MIT License. See the bundled LICENSE file for the full text and permissions.