Skip to content

feat: support multimodal in KVBM#7637

Open
furionw wants to merge 1 commit intomainfrom
qiwa/kvbm-multimodal-verify
Open

feat: support multimodal in KVBM#7637
furionw wants to merge 1 commit intomainfrom
qiwa/kvbm-multimodal-verify

Conversation

@furionw
Copy link
Copy Markdown
Contributor

@furionw furionw commented Mar 25, 2026

as title

Summary by CodeRabbit

Release Notes

Documentation

  • Added design documentation for multimodal prefix-caching improvements with a collision-safe hashing scheme to ensure block-level consistency
  • Provided reproduction script and diagnostic logs for troubleshooting a known multimodal request handling issue in the KV cache system

@github-actions github-actions bot added feat documentation Improvements or additions to documentation labels Mar 25, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 25, 2026

Walkthrough

This PR introduces a design document, runtime log, and reproduction script. The plan.md document describes a two-stage hashing approach to update KVBM's multimodal prefix-caching semantics at block granularity. The vllm.txt log captures a vLLM engine failure with multimodal requests. The vllm_native_kvbm.sh script reproduces the KVBM multimodal failure in vanilla vLLM.

Changes

Cohort / File(s) Summary
Design & Documentation
plan.md
Specifies architectural changes for multimodal prefix-caching: introduces two-stage collision-safe hashing (token_hash + extra_hashblock_hash), outlines Rust-side structural extensions (TokenBlockChunk, TokenBlockSequence), and describes Python-side helper (mm_block_hashes.py) and integration points for computing per-block multimodal hashes.
Failure Evidence & Reproduction
vllm.txt, vllm_native_kvbm.sh
Captures vLLM engine crash triggered by multimodal requests and provides executable Bash script to reproduce the failure by starting vLLM with KVBM configuration and issuing mixed text/image requests.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is minimal and only states 'as title', providing no substantive detail about the changes, implementation, or context required by the template. Complete the description template with Overview (summary of multimodal KVBM support), Details (explain the two-stage hashing approach and file changes), Where to start (highlight plan.md, test files), and Related Issues (link any relevant GitHub issues).
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding multimodal support to KVBM, which is directly reflected in the documentation plan and related files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (1)
vllm_native_kvbm.sh (1)

25-33: Harden startup timeout and cleanup semantics.

If readiness never succeeds, the script continues without a clear timeout failure path, and cleanup is not guaranteed on early exits. Add explicit timeout failure plus trap-based cleanup so the server process is always terminated.

Proposed script hardening
 set -euo pipefail
 
 MODEL="Qwen/Qwen3-VL-30B-A3B-Instruct-FP8"
-PORT=8000
+PORT="${PORT:-8000}"
+VLLM_PID=""
+
+cleanup() {
+  if [[ -n "${VLLM_PID}" ]]; then
+    kill "${VLLM_PID}" 2>/dev/null || true
+    wait "${VLLM_PID}" 2>/dev/null || true
+  fi
+}
+trap cleanup EXIT INT TERM
@@
 for i in $(seq 1 120); do
   if curl -s http://localhost:$PORT/v1/models > /dev/null 2>&1; then
     echo "Ready after ${i}s"
     break
   fi
   sleep 1
 done
+
+if ! curl -s http://localhost:$PORT/v1/models > /dev/null 2>&1; then
+  echo "Timed out waiting for vLLM on port ${PORT}" >&2
+  exit 1
+fi
@@
-# ── Cleanup ──
-echo ""
-echo "=== Cleanup ==="
-kill $VLLM_PID 2>/dev/null || true
-wait $VLLM_PID 2>/dev/null || true
-echo "Done."
+echo ""
+echo "=== Cleanup ==="
+echo "Done."

Also applies to: 69-74

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vllm_native_kvbm.sh` around lines 25 - 33, The readiness loop that polls
http://localhost:$PORT/v1/models (the for loop using seq 1 120 and curl) lacks a
failure path and no guaranteed cleanup; modify it to track elapsed seconds (or
use a timeout counter) and if the loop exceeds the max (120s) exit with a
non-zero status and error message, and add a trap that captures EXIT (and
INT/TERM) to kill the background server PID variable (e.g., the PID you get when
launching vLLM) so the server is always terminated on early exit; apply the same
timeout+trap pattern to the second readiness block referenced around lines 69-74
to ensure symmetric behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@plan.md`:
- Line 85: Tighten the contract for extra_block_hashes used by new() and
split_tokens(): ensure the vector has one entry per normalized block (use None
for text-only blocks) to avoid indexed-access hazards; implement a validation in
new() that either rejects mismatched lengths (return Err/panic) or normalizes
the input by expanding an empty or shorter extra_block_hashes into a
Vec<Option<u64>> of length num_blocks filled with None, and update
split_tokens() to rely on that validated/normalized vector instead of indexing
blindly (references: new(), split_tokens(), extra_block_hashes).
- Line 161: The fenced code block containing the example tokens and "Chained
seq_hash" lines is missing a language tag (MD040); update that fence to include
a language identifier (e.g., change ``` to ```text) so the block becomes ```text
... ```; locate the block that contains "Tokens: [sys×32, "What is", img_pad×48,
"?"]" and "Chained seq_hash: S0→S1 shared prefix, S2 diverges." and add the
language tag to the opening fence.
- Around line 49-50: The plan references the wrong Rust module path for the
tokens implementation; update any mentions that point to the nonexistent
llm-level tokens file to reference the actual tokens crate/module (the tokens
module used by the project, e.g., the tokens crate or tokens:: module and its
lib entry) and the block/token implementation symbol (the tokens module and its
block implementation) so guidance and TODOs target the real code; make the same
correction for the other occurrence noted in the plan.

In `@vllm_native_kvbm.sh`:
- Line 1: Add the required repository file header to the top of
vllm_native_kvbm.sh: insert the project's copyright/license/header boilerplate
as the first lines (above the shebang or replace the current first line if
policy expects header before shebang) so the file passes the repository
copyright/header check; ensure the header format matches other shell files in
the repo (same SPDX, copyright notices, and any required contributor lines).
- Line 9: The script currently hard-codes PORT=8000 which breaks reproducibility
on shared hosts; change the script to allocate a free port dynamically (or call
the repo’s port allocator utility) and assign it to the PORT variable, then use
that PORT when launching the server and in any curl calls; update all
occurrences of the literal 8000 in this script (the PORT variable and any
server/curl invocations) to reference the new PORT variable so the same
allocated port is consistently passed through.

In `@vllm.txt`:
- Line 1: The file vllm.txt is missing the standard repository file header which
is causing CI to fail; add the project's standard header block (the same header
used across other source files—license, copyright, and any required metadata
lines) to the very top of vllm.txt so the file matches the repository header
format expected by CI.
- Line 7: The log in utils.py (around the non-default args logging at
utils.py:261) is emitting environment-specific identifiers (e.g., engine_id,
kv_ip, kv_port and any other connector-specific fields) — update the code that
formats or emits the non-default args to redact or replace those fields with
stable placeholders (e.g., "<ENGINE_ID_REDACTED>", "<KV_IP_REDACTED>",
"<KV_PORT_REDACTED>") before logging; implement this by filtering or mapping
keys in the dict used for the log (the non-default args object) or by calling a
small sanitizer function (e.g., redact_env_identifiers) prior to the logging
call so engine_id, kv_connector_module_path, network endpoints and similar
infra-specific values are not committed in plaintext.

---

Nitpick comments:
In `@vllm_native_kvbm.sh`:
- Around line 25-33: The readiness loop that polls
http://localhost:$PORT/v1/models (the for loop using seq 1 120 and curl) lacks a
failure path and no guaranteed cleanup; modify it to track elapsed seconds (or
use a timeout counter) and if the loop exceeds the max (120s) exit with a
non-zero status and error message, and add a trap that captures EXIT (and
INT/TERM) to kill the background server PID variable (e.g., the PID you get when
launching vLLM) so the server is always terminated on early exit; apply the same
timeout+trap pattern to the second readiness block referenced around lines 69-74
to ensure symmetric behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7d05730a-1438-4817-94e4-26065d300c0e

📥 Commits

Reviewing files that changed from the base of the PR and between 02b1c58 and 329f3ad.

📒 Files selected for processing (3)
  • plan.md
  • vllm.txt
  • vllm_native_kvbm.sh

Comment on lines +49 to +50
**File**: `lib/llm/src/tokens.rs`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix the Rust path references to the actual token module.

The plan points to lib/llm/src/tokens.rs, but the current token/block implementation is in lib/tokens/src/lib.rs. This mismatch can send implementation work to the wrong location.

Also applies to: 179-179

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plan.md` around lines 49 - 50, The plan references the wrong Rust module path
for the tokens implementation; update any mentions that point to the nonexistent
llm-level tokens file to reference the actual tokens crate/module (the tokens
module used by the project, e.g., the tokens crate or tokens:: module and its
lib entry) and the block/token implementation symbol (the tokens module and its
block implementation) so guidance and TODOs target the real code; make the same
correction for the other occurrence noted in the plan.

}
```

`new()` and `split_tokens()` accept `extra_block_hashes: Vec<Option<u64>>`. In `split_tokens()`, Rayon parallel map passes `extra_block_hashes[i]` to each chunk.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Tighten extra_block_hashes length contract to avoid index hazards.

The plan currently permits an empty extra_block_hashes, while also describing per-block indexed access. Please require one normalized entry per block (None for text-only blocks) or explicitly define a safe fallback when vector is empty.

Also applies to: 104-104, 119-129

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plan.md` at line 85, Tighten the contract for extra_block_hashes used by
new() and split_tokens(): ensure the vector has one entry per normalized block
(use None for text-only blocks) to avoid indexed-access hazards; implement a
validation in new() that either rejects mismatched lengths (return Err/panic) or
normalizes the input by expanding an empty or shorter extra_block_hashes into a
Vec<Option<u64>> of length num_blocks filled with None, and update
split_tokens() to rely on that validated/normalized vector instead of indexing
blindly (references: new(), split_tokens(), extra_block_hashes).


## Block-Level Example

```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language tag to the fenced code block (MD040).

This warning is valid from markdownlint; specify a fence language (for example, text).

Proposed doc fix
-```
+```text
 Tokens: [sys×32, "What is", img_pad×48, "?"]
 block_size = 16
 ...
-Chained seq_hash: S0→S1 shared prefix, S2 diverges.
+Chained seq_hash: S0→S1 shared prefix, S2 diverges.
</details>

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.21.0)</summary>

[warning] 161-161: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @plan.md at line 161, The fenced code block containing the example tokens and
"Chained seq_hash" lines is missing a language tag (MD040); update that fence to
include a language identifier (e.g., change totext) so the block becomes
text ... ; locate the block that contains "Tokens: [sys×32, "What is",
img_pad×48, "?"]" and "Chained seq_hash: S0→S1 shared prefix, S2 diverges." and
add the language tag to the opening fence.


</details>

<!-- fingerprinting:phantom:poseidon:hawk -->

<!-- This is an auto-generated comment by CodeRabbit -->

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation feat size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant