A pure Python high-performance LLM inference engine.
prek installuv venv
source .venv/bin/activate
uv pip install modal==1.3.5
modal setup
modal secret
modal secret create huggingface-secret HF_TOKEN=<your_huggingface_token>Serve an OpenAI-compatible API server. You can specify the following environment variables:
NNODES: # nodes (1..4, default=1)N_GPU: # GPUs per node (1..8, default=1)GPU_TYPE: GPU type (l4,l40s,a100,a100-40gb,a100-80gb,rtx-pro-6000,h100/h100!,h200,b200/b200+, default=a100)RDMA: whether to use RDMA (0or1, default=0)
modal serve llmeng/api.pyNote that for multi-node deployment on Hopper and Blackwell chips, your Modal workspace must have RDMA support.
Run the benchmark suite. Some important parameters:
--engine: engine to benchmark (llmeng,minisgl,all, default=all)--model: model to benchmark (Qwen/Qwen3-32B,Qwen/Qwen3-30B-A3B,all, default=all)--data: data workload (in_out_128_1024,in_out_256_2048,in_out_512_512,in_out_512_4096,in_out_1024_128,in_out_1024_1024,in_out_2048_256,in_out_2048_2048,in_out_4096_512,all, default=in_out_128_1024)--rate-type: request workload (synchronous,throughput,constant,all, default=synchronous)--rate: only for constant rate type, request rate per second (float, default=None)
The results will be saved to benchmark/results.json and benchmark/results.html.
Work is parallelized across engine/model servers. Within each server, data workloads and rate workloads run sequentially to avoid interference.
modal run benchmark/main.pyRun the tests. You can specify the following parameters:
--target: a specific test file or test function (e.g.,tests/kernel/test_radix.py::test_fast_compare_key_perf).--pytest-args: additional pytest arguments (e.g.,-q).
modal run ci.py- port mini-sglang to Modal
-
replace nccl with penny- reverted for now due to compatibility issues between nccl4py and nvshmem4py
- rewrite C++/CUDA/Triton in Numba/CuTe-DSL/CuTile
- benchmark against mini-sglang properly
- add speculative speculative decoding (SSD)
- mini-sglang
- Penny, worklogs 1, 2, 3
- kernel benchmarking code
- Simon Veitner's blog posts
- Getting Memory-Bound Kernels to Speed-of-Light
- cuda-python, nvidia-cutlass-dsl, cuda-tile, nccl4py, nvshmem
- LLM Almanac and its code
- SSD, paper