Skip to content

Garbled / gibberish output after serving Kimi-K2.5 with vLLM on 8×H200 (INT4) for some time #23

@momaek

Description

@momaek

Description

I’m serving Kimi-K2.5 on a single machine with 8× NVIDIA H200 using vLLM (OpenAI-compatible server). The service runs normally at first, but after running for a while the model sometimes starts returning garbled / nonsensical text (looks like random multilingual fragments, broken tokens, and junk characters). This happens in the reasoning field (and the response becomes unreadable / meaningless).

The deployment generally follows the Kimi-K2.5 recommended inference engines (vLLM is listed as recommended in the repo README). 

Env

  • GPUs: 8× NVIDIA H200
  • Serving: vLLM OpenAI server
  • Container image: vllm-openai:nightly-8fae54faff485e446dc8d1a700417f07659ef89e
  • CUDA libs mounted via LD_LIBRARY_PATH=/usr/local/cuda-12.9/compat:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
  • Model: moonshotai/Kimi-K2.5 (local volume mount)

docker-compose

version: "3.9"

services:
  kimi_k25_int4:
    image: vllm-openai:nightly-8fae54faff485e446dc8d1a700417f07659ef89e
    container_name: kimi-k25
    ipc: host
    ports:
      - "40000:8000"
    environment:
      - LD_LIBRARY_PATH=/usr/local/cuda-12.9/compat:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - /data3/models/Kimi-K2.5:/model:ro
    command: >
      --host 0.0.0.0
      --port 8000
      --model /model
      --served-model-name kimi-k2.5
      --tensor-parallel-size 8
      --tool-call-parser kimi_k2
      --reasoning-parser kimi_k2
      --mm-encoder-tp-mode data
      --trust-remote-code
      --enable-auto-tool-choice

Steps to reproduce

  1. Start the server with the configuration above.
  2. Send chat completion requests normally (with reasoning enabled / returned by the server).
  3. After the server has been running for some time (and under ongoing requests), responses occasionally become garbled.

Expected behavior

Responses (including reasoning) remain coherent and readable.

Actual behavior

The reasoning content becomes unreadable / looks like corrupted tokens. Example:

灵性土。稍地

-elect. In‍oJC. After。鸽0 Bloodh o199h18wmm4 o @ has been.3.
A more.

|dc I. .ah
00AACY undning0000

GThe fluoride
B在邓·王e要求:Orcle rock whiskeyTheaypal solar.  pick by barDear user通过短流为。这个gg
3lit coni.

Example: Digu1.stice oil comesk7 aerobic i-s.控件J4rab2 When office:λ

D radiation

00h8 a&#81488618 blog 005723O.
003007NH) is wedding. Thermal equipment virus serum
  December患失_APPROXry3388那那狗 .

Sel"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions