Implemented progressive </think> logit-bias control as a llama.cpp sampler patch #21050

tls5657 · 2026-03-27T00:41:55Z

tls5657
Mar 27, 2026

Hi, I built a small llama.cpp patchset for more graceful reasoning-length control.

Instead of hard-cutting reasoning at token N, this patch progressively increases the logit of the </think> token as thinking gets longer.

What it adds:

per-request think_control JSON in the OpenAI-compatible request body
server-side validation and token resolution
a custom sampler inserted into the native sampler chain
3 phases:
- stage1: gentle bias
- stage2: stronger bias
- plateau: strongest bias

This was motivated by the same kind of problems discussed in:

If there is interest, I can also share benchmark results and tuning notes for Qwen-style reasoning models and VLM + structured-output workloads.