[Bug]:  enable_block_reuse=True crashes Gemma 3 VSWA inflight batching with decoderFinishedEvent must be nullopt

### System Info

- CPU architecture: `x86_64`
- GPU name: `NVIDIA RTX PRO 4500 Blackwell`
- GPU memory size: `32623 MiB`
- NVIDIA driver version: `580.82.07`
- NVIDIA-SMI version: `580.82.07`
- CUDA version reported by `nvidia-smi`: `13.0`
- Python: `3.12.11`
- OS / container base: `nvidia/cuda:13.0.0-cudnn-runtime-ubuntu24.04`
- PyTorch: `2.9.1`
- TensorRT-LLM versions tested:
  - `1.3.0rc11`: engine build + inference both reproduced the issue
  - `1.2.0`: issue still reproduced after downgrading to stable
- Model: `google/gemma-3-1b-it`
- Runtime path: Python `LLM(...)` API with async generation
- Engine path: TensorRT engine (`bfloat16_trt_llm_1_2_0`)
- Batching type: `INFLIGHT`
- Relevant runtime config:
  - `enable_block_reuse=True`
  - `max_seq_len=5500`
  - `max_attention_window=[512, 512, 512, 512, 512, 5500, 512, 512, 512, 512, 512, 5500, 512, 512, 512, 512, 512, 5500, 512, 512, 512, 512, 512, 5500, 512, 512]`
- GPU memory usage observed during failure: `30495 MiB / 32623 MiB`

### Who can help?

@laikhtewari 

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

We are serving Gemma 3 1B through the TensorRT engine path using the Python `LLM(...)` API with async generation.
The crash is reproducible when all of the following are true:
1. `batching_type="INFLIGHT"`
2. `KvCacheConfig(enable_block_reuse=True, max_attention_window=<VSWA list>)`
3. Gemma 3 engine built with `max_seq_len=5500`
The issue disappears when `enable_block_reuse=False`.
The VSWA list we use at runtime is a single-line list:
`[512, 512, 512, 512, 512, 5500, 512, 512, 512, 512, 512, 5500, 512, 512, 512, 512, 512, 5500, 512, 512, 512, 512, 512, 5500, 512, 512]`
This is a 26-layer VSWA configuration where most layers use local window `512`, layers `6, 12, 18, 24` use full window `5500`, and the final two layers also use `512`.
Simplified runtime setup:
```python
from transformers import AutoTokenizer
from tensorrt_llm import SamplingParams
from tensorrt_llm._tensorrt_engine import LLM
from tensorrt_llm.llmapi import KvCacheConfig
tokenizer = AutoTokenizer.from_pretrained("/path/to/gemma-3-1b-it")
attention_windows = [512, 512, 512, 512, 512, 5500, 512, 512, 512, 512, 512, 5500, 512, 512, 512, 512, 512, 5500, 512, 512, 512, 512, 512, 5500, 512, 512]
llm = LLM(
    "/path/to/engine_dir",
    tokenizer=tokenizer,
    kv_cache_config=KvCacheConfig(
        enable_block_reuse=True,
        max_attention_window=attention_windows,
    ),
    batching_type="INFLIGHT",
    max_batch_size=<engine build_config.max_batch_size>,
    max_seq_len=5500,
)
sampling_params = SamplingParams(max_tokens=64)
output = await llm.generate_async(
    "test prompt",
    sampling_params=sampling_params,
)
```

### Expected behavior

Either `enable_block_reuse=True` should work for this Gemma 3 VSWA configuration, or
TensorRT-LLM should reject this configuration cleanly if it is unsupported
It should not hit an internal assertion in forwardAsync.

### actual behavior

When `enable_block_reuse=True`, TensorRT-LLM crashes in inflight batching with:

`Assertion failed: decoderFinishedEvent must be nullopt`.

When `enable_block_reuse=False`, the exceptions disappear and the system is stable.

Observed failure:

Traceback (most recent call last):
```bash
  File "/gemma3_1b_service/.venv/lib/python3.12/site-packages/tensorrt_llm/llmapi/utils.py", line 40, in wrapper
    return func(*args, **kwargs)
  File "/root/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 81, in inner
    return func(*args, **kwds)
  File "/gemma3_1b_service/.venv/lib/python3.12/site-packages/tensorrt_llm/executor/result.py", line 474, in _handle_response
    handler(response.error_msg)
  File "/gemma3_1b_service/.venv/lib/python3.12/site-packages/tensorrt_llm/executor/executor.py", line 280, in _handle_background_error
    raise error
tensorrt_llm.executor.utils.RequestError: Encountered an error in forwardAsync function: [TensorRT-LLM][ERROR] Assertion failed: decoderFinishedEvent must be nullopt. (../tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:1132)
1       0x7fef27d8197f tensorrt_llm::_v1::common::throwRuntimeError(char const*, int, char const*) + 76
2       0x7feee876a4a8 /gemma3_1b_service/.venv/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x1b6a4a8) [0x7feee876a4a8]
3       0x7feee9ec51ac tensorrt_llm::executor::Executor::Impl::forwardAsync(...) + 444
4       0x7feee9ecca7e tensorrt_llm::executor::Executor::Impl::executionLoop() + 1550
```
After the first failure, similar errors continue to be emitted for subsequent requests.


### additional notes

We reproduced the same issue on both `tensorrt-llm==1.3.0rc11` and `tensorrt-llm==1.2.0`, so this does not appear to be specific to only the release candidate.
The strongest signal so far is that the crash is specifically tied to KV cache block reuse. Disabling block reuse makes the service stable without other changes.
GPU memory usage was high at the time of failure (`30495 MiB / 32623 MiB`), but the observed error is not an OOM; it is an internal assertion in the executor path.
This may be related to existing VSWA / KV-cache-reuse issues on the TRT engine path, but our case is a hard runtime failure rather than ineffective caching:
`#12563` Prefix caching (block reuse) ineffective for VSWA models like Gemma 3 on TRT engine path
`#2983` disable kv cache reuse when minimum window size is reached, instead of maximum window size

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: enable_block_reuse=True crashes Gemma 3 VSWA inflight batching with decoderFinishedEvent must be nullopt #13123

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: enable_block_reuse=True crashes Gemma 3 VSWA inflight batching with decoderFinishedEvent must be nullopt #13123

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions