BrainDrain

LLM Denial-of-Service Research Framework — an interactive CLI for running sustained prompt-injection DoS attacks against LLM serving infrastructure and measuring server degradation in real time.

Research use only. Only target systems you own or have explicit written authorisation to test.

Overview

BrainDrain orchestrates concurrent attack requests designed to exhaust GPU KV-cache memory and compute on LLM serving backends (vLLM, llama.cpp, Ollama, managed APIs, etc.). It supports two attack techniques from published research, a suite of side-channel probes that measure server degradation while the attack runs, and several tools to help calibrate the right attack parameters for a given target.

Installation

git clone <repo>
cd BrainDrain
python -m venv venv && source venv/bin/activate
pip install -e .
# Optional: richer terminal output
pip install rich

Requirements: Python ≥ 3.10, aiohttp, numpy, cmaes.

Running the CLI

python BrainDrain.py

The CLI is fully interactive — it walks you through a wizard with sensible defaults at each step. Press Enter to accept the value shown in brackets.

Wizard walkthrough

1. Target Configuration

The first section configures the endpoint that will be attacked.

API format

API format (openai / anthropic / ollama / vertex / custom) [openai]:

Format	Endpoint path used	Notes
`openai`	`/v1/chat/completions`	Any OpenAI-compatible server (vLLM, llama.cpp, together.ai, …)
`anthropic`	`/v1/messages`	Direct Anthropic API or Claude-compatible servers
`ollama`	`/api/chat`	Local Ollama instances
`vertex`	Full model URL (Vertex style)	Google Vertex AI — see below
`custom`	User-defined	Paste a full JSON request descriptor — see below

Target URL / IP

Target URL / IP (e.g. http://1.2.3.4 or https://api.example.com):

Base URL of the serving endpoint. The scheme defaults to http:// if omitted. For Vertex AI, paste the complete model path URL (the model name is extracted automatically).

Model name

Model name (e.g. deepseek-r1-7b, gpt-4o):

The model identifier forwarded in every request. For Vertex AI this is parsed from the URL automatically.

API key

API key (leave blank if not required):

Bearer token sent as Authorization: Bearer <key>. Leave blank for unauthenticated local servers. For Vertex AI, pass the OAuth2 access token from gcloud auth print-access-token.

Request timeout

Request timeout (seconds, leave blank for no timeout) [N/A]:

Per-request HTTP timeout. Enter N/A or leave blank for no timeout (recommended for long-running attack requests that are expected to take minutes). Enter a number of seconds to cap individual requests.

stream_options support (Ollama only)

Server supports stream_options (OpenAI >=1.x)? (y/n) [y]:

Only asked for Ollama. Set to n if your Ollama version does not support the stream_options field in the request body (older versions will reject it).

Vertex AI format

When vertex is selected, enter the full Vertex AI endpoint URL:

https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/MODEL

Example:

https://us-central1-aiplatform.googleapis.com/v1/projects/my-project/locations/us-central1/publishers/google/models/gemini-2.5-pro

The model name is extracted from the last path segment automatically. Use gcloud auth print-access-token to obtain a short-lived OAuth2 token as the API key.

Custom format

When custom is selected, paste a complete JSON request descriptor. The tool injects the attack prompt at runtime by replacing the sentinel string "__PROMPT__" in the body.

Format:

{
  "url": "https://...",
  "headers": {
    "Authorization": "Bearer ...",
    "anthropic-beta": "interleaved-thinking-2025-05-14"
  },
  "body": {
    "model": "claude-opus-4-5",
    "max_tokens": 16000,
    "thinking": { "type": "enabled", "budget_tokens": 10000 },
    "messages": [{"role": "user", "content": "__PROMPT__"}]
  }
}

url — the exact endpoint URL to POST to.
headers — all HTTP headers except Content-Type (set automatically).
body — the full request body. Replace the message content value with the literal string "__PROMPT__".
stream and max_tokens are injected automatically at runtime; you can still include them as defaults.

After pasting the JSON, press Enter on a blank line to finish. You will then be asked to choose a response stream format (for parsing only) and a display model name.

2. KV Cache Saturation Calculator (optional)

Run KV cache saturation calculator? (y / n) [n]:

An interactive calculator that estimates how many concurrent attacker instances are needed to completely fill the KV cache on a single GPU. You provide the GPU and model parameters and it outputs the minimum number of concurrent requests required.

Inputs:

Prompt	Description
GPU VRAM (GB)	Total VRAM of the GPU (e.g. `96` for an RTX Pro 6000)
Model weights on-device (GB)	Size after quantisation (e.g. `14` for DeepSeek-R1-7B in bf16)
GPU memory utilisation fraction	vLLM's `--gpu-memory-utilization` setting, typically `0.85`
Activation overhead (GB)	CUDA workspace + activation buffers; typically `2` for a 7B model
Number of layers	Transformer depth (e.g. `28` for Qwen2.5-7B)
Number of KV heads	GQA KV-head count, not Q-head count (e.g. `4` for Qwen2.5-7B)
Head dimension	`hidden_size / num_q_heads` (e.g. `128`)
KV cache dtype bytes	`2` for fp16/bf16, `1` for fp8
Prompt tokens per request	Approximate token count of the attack prompt
Max tokens per request	The `max_tokens` value you plan to use in the attack

Output: the minimum number of concurrent requests to fill the KV cache. Use this as the lower bound for the "Number of concurrent attacker instances" setting in the attack configuration.

3. Load Balancer Detection (optional)

Run load-balancer detection test before attacking? (y / n) [n]:

Sends pairs of identical long-prefix requests to the target and measures whether the second request in each pair receives an Automatic Prefix Caching (APC) cache hit (significantly lower TTFT). Consistent cache hits indicate the requests land on the same backend node; absent hits indicate a load balancer is distributing them across multiple backends.

Inputs:

Prompt	Description
Number of probe pairs	How many warm/cold request pairs to send (default `6`)
Prompt repeat factor	Controls prefix length — higher means a longer shared prefix and a more detectable cache hit (default `12`)

Output: verdict (SAME_BACKEND vs LOAD_BALANCED), cache hit rate, median warm/cold TTFT ratio, and a per-pair table.

Why this matters: If requests always land on the same backend, setting a shared system prompt pins all attack and probe requests to the same KV-cache slot via prefix-aware routing, which concentrates the pressure on a single node. If a load balancer is detected you may need to run multiple independent attack processes or increase the number of instances to cover all backends.

4. System Prompt and Request Prefix

System prompt (leave blank to omit):
Request prefix (leave blank to omit):

Both fields are sent with every attack and probe request.

System prompt — set as the system field (Anthropic format) or as a system-role message (OpenAI format).
Request prefix — prepended to the user message content of every request (attack and probe).

A shared prefix pins all concurrent requests to the same KV-cache slot via prefix-aware routing on vLLM (Automatic Prefix Caching). This maximises cache pressure on a single memory region, making the attack more effective.

5. Attack Configuration

Attack type

Attack type (reasoning_bomb / think_trap) [reasoning_bomb]:

Type Description

reasoning_bomb Feeds short math/logic puzzles to a Large Reasoning Model. The model's extended chain-of-thought produces a pathologically long reasoning trace, achieving a very high output/input token amplification ratio. Based on Liu et al., 2026. Requires puzzle JSON file. Only effective against Large Reasoning Models (LRMs) — models with an extended chain-of-thought/thinking capability such as DeepSeek-R1, QwQ, Gemini 2.5, and Claude 3.7+ Sonnet. Standard instruction-tuned models without a reasoning mode will not produce the long traces the attack relies on and the amplification ratio will be negligible.

think_trap Injects pre-optimised adversarial prompts generated offline by a CMA-ES search that maximised output token count against a surrogate model. Based on Li et al., NDSS 2026. Requires a prompt cache JSON file.

Number of concurrent attacker instances

Number of concurrent attacker instances [4]:

How many independent request loops to run in parallel. Each instance continuously sends a new request as soon as the previous one completes. Higher values fill the KV cache faster. Use the KV saturation calculator output as the target minimum.

Token budget

Token budget (leave blank for unlimited):

Optional total output token cap across all instances combined. The attack stops automatically once the cumulative output + reasoning token count reaches this limit. Leave blank to run until you press Ctrl+C.

Max tokens per request

Max tokens per request (max_completion_tokens) [<auto-detected>]:

The max_tokens parameter sent in every request. This controls the upper bound on how many tokens the server will generate per response. The CLI probes the target automatically to detect its context window limit and suggests a value that accounts for the prompt size:

max_completion_tokens = context_window - prompt_tokens - 64

Set this as high as the model allows to maximise KV-cache occupancy per request.

Reasoning Bomb specific options

Path to puzzle JSON file [prompts/reasoningBomb_puzzles.json]:
Puzzle budget tier (128 / 256 / 512 / mixed) [256]:

Puzzle file: Path to the sample_puzzles.json from the ReasoningBomb GitHub repo. Budget tier: The token-budget tier of puzzles to draw from. Puzzles in each tier are designed to induce a specific reasoning depth in the model.

Tier	Prompt length	Typical amplification
`128`	~128 tokens	Highest (shortest input → most amplification)
`256`	~256 tokens	High
`512`	~512 tokens	Moderate
`mixed`	Rotates 128/256/512	Distributes across tiers; each instance gets a different tier to prevent synchronised completions

Each concurrent instance is assigned a different puzzle index so requests do not share a prompt (which would allow the server to serve them from prefix cache rather than computing a full response).

ThinkTrap specific options

Path to ThinkTrap prompts JSON file [prompts/thinktrap_prompts.json]:

Path to the adversarial prompt cache produced by ThinkTrapAPG. This file must be generated offline before running the attack (see generating ThinkTrap prompts).

6. Sustained Pressure Options

These settings control how the attack avoids the "wave pattern" — where all instances happen to finish at the same time, briefly releasing the KV cache and letting the scheduler recover.

Launch stagger

Launch stagger between instances (seconds, 0=off) [0.0]:

Delay between starting each successive instance. Staggered launch times cause instances to complete at different wall-clock times, maintaining continuous KV-cache pressure even across request boundaries.

Max-token spread

Max-token spread (%, 0=off, e.g. 25) [0.0]:

Assigns each instance a slightly different max_tokens value distributed linearly across the range [max_tokens × (1 - spread%), max_tokens]. This ensures that even after the initial stagger window has passed, instances continue to complete at different times because they hit their token limits at different points.

Example: with max_tokens=16000 and spread 25%, 4 instances get [12000, 13333, 14667, 16000] tokens.

Stream read delay

Stream read delay per chunk (seconds, 0=off) [0.0]:

Inserts a sleep between consuming each streamed chunk from the server. This backs up the server's output queue, extending the effective KV-block occupancy of each request beyond its pure generation time. The connection stays open and the KV blocks remain allocated until all tokens are consumed.

Recommended range: 0.005 – 0.02 seconds (5–20 ms per chunk).

ITL probing

Enable ITL probing during attack? (y / n) [y]:

When enabled, the orchestrator fires lightweight probe requests every 10 seconds and displays real-time KV-cache utilisation estimates and infra state classifications in the live status panel. When disabled, no probe requests are sent (useful if the target rate-limits all traffic and you need to preserve the full quota for attack requests).

Live Status Display

During the attack a live panel shows:

Field	Description
Elapsed	Wall-clock time since launch
Instances	Number of active attack loops
Requests	Total requests completed so far
Output tokens	Cumulative completion + reasoning tokens generated (vs budget if set)
Input tokens	Cumulative prompt tokens consumed
Infra state	Current classified server state (see below)
KV est	Estimated KV-cache utilisation ∈ [0, 1]
Mean ITL	Mean inter-token latency across recent probe window (ms)
TTFT	Time-to-first-token from most recent probe (s)
Errors	Count of failed requests and last error message

Infra states

State	Meaning
`unknown`	No baseline calibrated or insufficient probe data
`idle`	Near-baseline ITL and TTFT; KV usage low
`filling`	KV usage rising; ITL trending up (1.5–3× baseline)
`saturated`	KV near capacity; ITL >> baseline (3×+)
`hol_blocked`	TTFT has spiked — WAITING queue is frozen (head-of-line blocking)
`thrashing`	High ITL variance — preemption/eviction loop detected
`degraded`	Significant degradation that does not match other categories

Press Ctrl+C at any time to stop the attack and print the final summary.

Error Recovery

If 10 consecutive requests fail with the same classifiable error, the attack pauses automatically and prompts you to fix the configuration:

Error category	Recovery prompt
`token_limit`	Enter a lower `max_tokens` value
`auth`	Enter a new API key
`model_not_found`	Enter the correct model name
`rate_limit`	Enter an inter-request delay in seconds

After correcting the setting, choose y to restart the instances with the updated configuration, or n to stop.

Attack Summary

After stopping, the CLI prints a summary table:

Metric	Description
Wall clock	Total elapsed time
Total requests	Requests sent (successful + failed)
Successful	Requests that completed with a response
Timed out	Requests that hit the timeout limit
Cancelled	Requests cancelled on shutdown
Failed	Requests that returned an error
Mean amp ratio	Mean (output tokens) / (input tokens) amplification
P95 amp ratio	95th-percentile amplification ratio
Output tokens	Total completion + reasoning tokens generated
Input tokens	Total prompt tokens consumed
Total tokens	Sum of output and input
Tokens/s	Output token generation rate
Mean TTFT	Mean time-to-first-token across all requests
Mean duration	Mean total request duration

Generating ThinkTrap Prompts (Offline)

Before running a think_trap attack you must generate the adversarial prompt cache. This is a one-time offline step that queries the target (or a compatible surrogate) to find prompts that maximise output length via CMA-ES optimisation.

import asyncio
from attacks.think_trap import ThinkTrapAPG, ThinkTrapConfig, make_victim_fn, load_surrogate_embeddings
from core.base_attack import TargetConfig, APIFormat

target = TargetConfig(
    base_url="http://your-target:8000",
    model="your-model",
    api_format=APIFormat.OPENAI,
)

cfg = ThinkTrapConfig(
    surrogate_model="meta-llama/Llama-2-7b-hf",  # or use surrogate_embeddings_path
    prompt_length=20,
    latent_dim=20,
    query_budget=200,
    top_k_keep=10,
)

T, tokenizer = load_surrogate_embeddings(cfg.surrogate_model)
victim = make_victim_fn(target, max_tokens=4096, tokenizer=tokenizer)

apg = ThinkTrapAPG(victim_fn=victim, T_surrogate=T, config=cfg, tokenizer=tokenizer)
asyncio.run(apg.run())
apg.save("prompts/thinktrap_prompts.json")

To avoid downloading the full model weights every run, save the embedding matrix once:

import numpy as np
T, tokenizer = load_surrogate_embeddings("meta-llama/Llama-2-7b-hf")
np.save("llama2_embeddings.npy", T)
# Next time:
T, _ = load_surrogate_embeddings("", embeddings_path="llama2_embeddings.npy")

Script interface (non-interactive)

bdscript.py mirrors every wizard option as a command-line flag, suitable for automated or headless deployments where no human is present at the keyboard — cron jobs, CI pipelines, or distributed test rigs.

python bdscript.py --help

Flag reference

Flag	Wizard equivalent	Default
`-t / --target URL`	Target URL	— (required)
`-m / --model NAME`	Model name	— (required except vertex/custom)
`-f / --format FORMAT`	API format	`openai`
`-k / --api-key KEY`	API key	—
`--timeout SECONDS`	Request timeout	none
`--no-stream-options`	stream_options prompt (Ollama)	off
`--no-verify-ssl`	—	off
`-S / --system-prompt TEXT`	System prompt	—
`--request-prefix TEXT`	Request prefix	—
`--custom-descriptor PATH`	Custom JSON descriptor	—
`--response-format FORMAT`	Custom response format	`openai`
`-a / --attack TYPE`	Attack type	`reasoning_bomb`
`-n / --instances N`	Concurrent instances	`4`
`--max-tokens N`	Max tokens per request	auto-detected
`--budget TOKENS`	Token budget	unlimited
`--no-preflight`	—	off
`--puzzle-file PATH`	Puzzle JSON file	`prompts/reasoningBomb_puzzles.json`
`--budget-tier TIER`	Budget tier	`256`
`--prompts-file PATH`	ThinkTrap prompts file	`prompts/thinktrap_prompts.json`
`--stagger SECONDS`	Launch stagger	`0`
`--spread PERCENT`	Max-token spread	`0`
`--stream-delay SECONDS`	Stream read delay	`0`
`--no-probe`	ITL probing toggle	off
`-o / --output PATH`	—	—

`--output` JSON format

When --output results.json is provided, a JSON file is written after the attack with two top-level keys:

{
  "config": {
    "target": "http://10.0.0.1:8000",
    "model": "deepseek-r1-7b",
    "format": "openai",
    "attack": "reasoning_bomb",
    "instances": 8,
    "max_tokens": 16384,
    "budget": null,
    "budget_tier": "256",
    "stagger_s": 8.0,
    "spread_pct": 25.0,
    "stream_delay_s": 0.01,
    "no_probe": false,
    "system_prompt": null,
    "request_prefix": null
  },
  "result": {
    "wall_clock_s": 120.5,
    "total_requests": 47,
    "successful": 45,
    "failed": 2,
    "mean_amplification_ratio": 312.4,
    "total_generated_tokens": 741280,
    "tokens_generated_per_second": 6151.0,
    "state_timeline": [[0.0, "idle"], [12.3, "filling"], [28.7, "saturated"]],
    "instance_stats": [...]
  }
}

Examples

# Minimal — reasoning bomb, 4 instances, auto-detect tokens
python bdscript.py -t http://localhost:8000 -m deepseek-r1-7b

# ThinkTrap with full pressure options and JSON output
python bdscript.py \
    -t http://10.0.0.1:8000 -m qwen3-32b \
    --attack think_trap --instances 8 \
    --stagger 4.0 --spread 25 --stream-delay 0.01 \
    --system-prompt "You are a helpful assistant." \
    --output results/run_001.json

# Anthropic API target
python bdscript.py \
    -t https://api.anthropic.com -m claude-opus-4-5 \
    --format anthropic --api-key sk-ant-... \
    --attack reasoning_bomb --budget-tier 128 --instances 6

# Custom descriptor (e.g. Vertex AI with extended-thinking headers)
python bdscript.py \
    --format custom --custom-descriptor descriptor.json \
    --response-format anthropic --model claude-opus-4-5 \
    --attack reasoning_bomb --instances 4 --output results/vertex.json

# Run on multiple machines simultaneously (collect outputs centrally)
ssh node1 "cd BrainDrain && python bdscript.py -t http://target:8000 -m deepseek-r1-7b -n 8 -o /tmp/node1.json"
ssh node2 "cd BrainDrain && python bdscript.py -t http://target:8000 -m deepseek-r1-7b -n 8 -o /tmp/node2.json"

File layout

BrainDrain/
├── BrainDrain.py              # Interactive wizard CLI
├── bdscript.py                # Non-interactive script interface
├── attacks/
│   ├── reasoning_bomb.py      # ReasoningBomb attack module
│   └── think_trap.py          # ThinkTrap attack module (APG + DSA)
├── core/
│   └── base_attack.py         # Abstract base class, TargetConfig, AttackConfig
├── orchestration/
│   ├── attacker_instance.py   # Single sustained request loop
│   ├── registry.py            # Attack config factory helpers
│   └── result_collector.py    # Result aggregation and summary
├── probes/
│   ├── itl_probes.py          # ITL side-channel probe suite
│   └── kv_saturation_calculator.py
└── prompts/
    ├── reasoningBomb_puzzles.json   # Download from ReasoningBomb repo
    └── thinktrap_prompts.json       # Generated by ThinkTrapAPG

Quick example: attacking a local vLLM server

python BrainDrain.py

  API format: openai
  Target URL: http://localhost:8000
  Model name: deepseek-r1-7b
  API key:    (blank)
  Timeout:    N/A

  KV calculator: n
  LB detection:  n
  System prompt: (blank)
  Request prefix:(blank)

  Attack type:    reasoning_bomb
  Instances:      8
  Token budget:   (blank — unlimited)
  [auto-detected context limit: 131072 tokens]
  Max tokens:     130944

  Launch stagger: 8.0   (est. 60s request / 8 instances)
  Token spread:   25
  Stream delay:   0.01
  ITL probing:    y

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
attacks		attacks
braindrain.egg-info		braindrain.egg-info
core		core
orchestration		orchestration
probes		probes
prompts		prompts
results		results
.gitignore		.gitignore
BrainDrain.py		BrainDrain.py
README.md		README.md
bdscript.py		bdscript.py
pyproject.toml		pyproject.toml
thinkTrap_test.py		thinkTrap_test.py

Folders and files

Latest commit

History

Repository files navigation

BrainDrain

Overview

Installation

Running the CLI

Wizard walkthrough

1. Target Configuration

API format

Target URL / IP

Model name

API key

Request timeout

stream_options support (Ollama only)

Vertex AI format

Custom format

2. KV Cache Saturation Calculator (optional)

3. Load Balancer Detection (optional)

4. System Prompt and Request Prefix

5. Attack Configuration

Attack type

Number of concurrent attacker instances

Token budget

Max tokens per request

Reasoning Bomb specific options

ThinkTrap specific options

6. Sustained Pressure Options

Launch stagger

Max-token spread

Stream read delay

ITL probing

Live Status Display

Infra states

Error Recovery

Attack Summary

Generating ThinkTrap Prompts (Offline)

Script interface (non-interactive)

Flag reference

--output JSON format

Examples

File layout

Quick example: attacking a local vLLM server

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`--output` JSON format

Packages