Comprehensive performance evaluation framework for vLLM on CPU platforms.
This repository provides a complete testing methodology, automation tools, and platform configurations for evaluating vLLM inference performance on CPU-based systems.
See the Quick Start Guide
vllm-cpu-perf-eval/
βββ README.md # This file
β
βββ models/ # Centralized model definitions
β βββ models.md # Comprehensive model documentation
β βββ llm-models/ # LLM model configurations
β β βββ model-matrix.yaml # LLM model test mappings
β β βββ llm-models.md # Redirects to models.md
β βββ embedding-models/ # Embedding model configurations
β βββ model-matrix.yaml # Embedding model test mappings
β
βββ tests/ # Test suites and scenarios
β βββ tests.md # Test suite overview
β βββ concurrent-load/ # Test Suite 1: Concurrent load testing
β β βββ concurrent-load.md # Suite documentation
β β βββ *.yaml # Test scenario definitions
β βββ scalability/ # Test Suite 2: Scalability testing
β β βββ scalability.md # Suite documentation
β β βββ *.yaml # Test scenario definitions
β βββ resource-contention/ # Test Suite 3: Resource contention
β β βββ resource-contention.md # Suite documentation
β β βββ *.yaml # Test scenario definitions (planned)
β βββ embedding-models/ # Embedding model test scenarios
β βββ embedding-models.md # Embedding test documentation
β βββ baseline-sweep.yaml # Baseline performance tests
β βββ latency-concurrent.yaml # Latency tests
β
βββ automation/ # Automation framework
β βββ test-execution/ # Test orchestration
β β βββ ansible/ # Ansible playbooks (primary)
β β β βββ ansible.md # Ansible documentation
β β β βββ inventory/ # Host configurations
β β β βββ filter_plugins/ # Custom Ansible filters
β β β βββ roles/ # Ansible roles
β β β βββ tests/ # Ansible tests
β β β βββ *.yml # Playbook files
β β βββ bash/ # Bash automation scripts
β β βββ embedding/ # Embedding test scripts
β βββ platform-setup/ # Platform configuration
β β βββ bash/ # Platform setup scripts
β β βββ intel/ # Intel-specific setup
β βββ utilities/ # Helper utilities
β βββ health-checks/ # Health check scripts
β βββ log-monitoring/ # Log analysis tools
β
βββ docs/ # Documentation
β βββ docs.md # Documentation index
β βββ methodology/ # Test methodology
β β βββ overview.md # Testing approach and metrics
β βββ platform-setup/ # Platform setup guides
β
βββ results/ # Test results (gitignored)
β βββ llm/ # LLM test results
β βββ results.md # Results documentation
β
βββ utils/ # Utility scripts and tools
β
βββ Configuration Files
βββ .pre-commit-config.yaml # Pre-commit hooks configuration
βββ .yamllint.yaml # YAML linting rules
βββ .markdownlint-cli2.yaml # Markdown linting rules
βββ .gitignore # Git ignore patterns
Key Directories:
- models/ - Model definitions reused across all test suites
- tests/ - Test suite definitions organized by testing focus
- automation/test-execution/ansible/ - Ansible playbooks for test execution
- docs/ - Comprehensive testing methodology and guides
- results/ - Local test results (gitignored, see results.md)
See individual directory markdown files for detailed information.
- Docker or Podman - Use either runtime
- Auto-detection - Automatically detects available runtime
- Rootless support - Full Podman rootless compatibility
- Define models once, use across all test phases
- Easy to add new models
- Model matrix for flexible test configuration
- Intel Xeon (Ice Lake, Sapphire Rapids)
- AMD EPYC
- ARM64 (planned)
- Ansible playbooks for platform setup and test execution
- Bash scripts for manual operation
- Docker/Podman Compose for containerized testing
- Distributed testing across multiple nodes
- Concurrent Load: Concurrent load testing
- Scalability: Scalability and sweep testing
- Resource Contention: Resource contention testing (planned)
- β±οΈ Time-based testing - Consistent 10-minute tests across CPU types
- 1οΈβ£ Single-user baseline - Concurrency=1 for efficiency calculations
- π Variable workloads - Realistic traffic simulation with statistical variance
- π Prefix caching control - Baseline vs production comparison
- π― 3-phase testing - Baseline β Realistic β Production methodology
- π Large model support - Added gpt-oss-20b (21B MoE) for scalability testing
See 3-Phase Testing Strategy for details.
- Ansible Testing - Complete Ansible usage guide
- Methodology - Testing methodology and metrics
- Platform Setup - Intel platform configuration
- Models - Model definitions and selection
- Tests - Test suite documentation
Full documentation index: docs/docs.md
β οΈ IMPORTANT: Validation Status and Availabilityβ SUPPORTED (Fully Validated):
- Concurrent Load Testing (Phase 1 & Phase 2) - Ready for use
- Playbooks:
llm-benchmark-concurrent-load.yml,llm-benchmark-auto.yml- Documentation: tests/concurrent-load/concurrent-load.md
π§ NOT YET SUPPORTED (Blocked from End User Execution):
The following test suites are work in progress and are automatically blocked to prevent end users from running them until they are fully validated:
Scalability - Work in progress; blocked by default
- Playbook:
llm-core-sweep-auto.yml(will fail with error message)- Contains: sweep, synchronous, poisson tests
Embedding Models - Work in progress; blocked by default
- Playbook:
embedding-benchmark.yml(will fail with error message)- Scripts:
run-baseline.sh,run-latency.sh,run-all.sh(will exit with error)Resource Contention - Planned; not yet implemented
- No test files exist yet
Bypass for Development/Testing Only:
If you need to run unsupported tests for development or testing purposes:
- Ansible: Add
-e "allow_unsupported_tests=true"to your playbook command- Bash: Export
ALLOW_UNSUPPORTED_TESTS=truebefore running scriptsNote: Unsupported tests are provided as-is with no guarantees they will work without modification. Only use them if you understand the risks and are willing to troubleshoot issues independently.
Tests model performance under various concurrent request loads.
- Concurrency levels: 1, 2, 4, 8, 16, 32
- 8 LLM generative models (embedding models not yet supported)
- Focus: P95 latency, TTFT, throughput scaling
π§ NOT YET SUPPORTED - This test suite is blocked by default. See validation status above for details.
Characterizes maximum throughput and performance curves.
- Sweep tests for capacity discovery
- Synchronous baseline tests
- Poisson distribution tests
- Focus: Maximum capacity, saturation points
π PLANNED - Not yet implemented.
Multi-tenant and resource sharing scenarios.
Current model coverage:
LLM Models (8 total):
- Llama-3.2 (1B, 3B) - Prefill-heavy
- TinyLlama-1.1B - Balanced small-scale
- OPT (125M, 1.3B) - Decode-heavy legacy baseline
- Granite-3.2-2B - Balanced enterprise
- Qwen3-0.6B, Qwen2.5-3B - High-efficiency balanced
Embedding Models:
π§ NOT YET SUPPORTED - Embedding model tests are blocked by default. These models are defined but testing is not yet validated.
- granite-embedding-english-r2
- granite-embedding-278m-multilingual
See models/models.md for complete model definitions, selection rationale, and how to add new models.
- CPU: Intel Xeon (Ice Lake or newer) or AMD EPYC
- Memory: 64GB+ RAM recommended
- OS: Ubuntu 22.04+, RHEL 9+, or Fedora 38+
- Storage: 500GB+ for models and results
- Python 3.10+
- Docker 24.0+ or Podman 4.0+
- Ansible 2.14+ (for automation)
- GuideLLM v0.5.0+
- vLLM
See Ansible Documentation for setup and configuration instructions.
This repository supports both Docker and Podman:
- Docker: Traditional container runtime
- Podman: Daemonless, rootless-capable alternative
- Auto-detection: Automatically uses available runtime
The Ansible playbooks automatically detect and use the available container runtime. For manual configuration, see the vllm_server role documentation.
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Run pre-commit checks:
pre-commit run --all-files - Submit a pull request
This repository uses pre-commit to ensure code quality.
# Install pre-commit
pip install pre-commit
# Install hooks
pre-commit install
pre-commit install --hook-type commit-msg
# Run manually
pre-commit run --all-files[Add license information]
- Documentation: See docs/
- Issues: GitHub Issues
- Discussions: GitHub Discussions