llm-engine

A pure Python high-performance LLM inference engine.

Development

Installation

uv
prek

prek install

Setup

uv venv
source .venv/bin/activate
uv pip install modal==1.3.5
modal setup
modal secret
modal secret create huggingface-secret HF_TOKEN=<your_huggingface_token>

Commands

Serve an OpenAI-compatible API server. You can specify the following environment variables:

NNODES: # nodes (1..4, default=1)
N_GPU: # GPUs per node (1..8, default=1)
GPU_TYPE: GPU type (l4, l40s, a100, a100-40gb, a100-80gb, rtx-pro-6000, h100/h100!, h200, b200/b200+, default=a100)
RDMA: whether to use RDMA (0 or 1, default=0)

modal serve llmeng/api.py

Note that for multi-node deployment on Hopper and Blackwell chips, your Modal workspace must have RDMA support.

Run the benchmark suite. Some important parameters:

--engine: engine to benchmark (llmeng, minisgl, all, default=all)
--model: model to benchmark (Qwen/Qwen3-32B, Qwen/Qwen3-30B-A3B, all, default=all)
--data: data workload (in_out_128_1024, in_out_256_2048, in_out_512_512, in_out_512_4096, in_out_1024_128, in_out_1024_1024, in_out_2048_256, in_out_2048_2048, in_out_4096_512, all, default=in_out_128_1024)
--rate-type: request workload (synchronous, throughput, constant, all, default=synchronous)
--rate: only for constant rate type, request rate per second (float, default=None)

The results will be saved to benchmark/results.json and benchmark/results.html. Work is parallelized across engine/model servers. Within each server, data workloads and rate workloads run sequentially to avoid interference.

modal run benchmark/main.py

Run the tests. You can specify the following parameters:

--target: a specific test file or test function (e.g., tests/kernel/test_radix.py::test_fast_compare_key_perf).
--pytest-args: additional pytest arguments (e.g., -q).

modal run ci.py

Roadmap

port mini-sglang to Modal
~~replace nccl with penny~~
- reverted for now due to compatibility issues between nccl4py and nvshmem4py
rewrite C++/CUDA/Triton in Numba/CuTe-DSL/CuTile
benchmark against mini-sglang properly
add speculative speculative decoding (SSD)

Credit

mini-sglang
Penny, worklogs 1, 2, 3
kernel benchmarking code
Simon Veitner's blog posts
Getting Memory-Bound Kernels to Speed-of-Light
cuda-python, nvidia-cutlass-dsl, cuda-tile, nccl4py, nvshmem
LLM Almanac and its code
SSD, paper

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
assets		assets
benchmark		benchmark
llmeng		llmeng
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
ci.py		ci.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-engine

Development

Installation

Setup

Commands

Roadmap

Credit

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llm-engine

Development

Installation

Setup

Commands

Roadmap

Credit

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages