Skip to content

[Bug]: under long hour testing, MTP with enable_schedule_overlap=false shows request prompt is too long #1127

@phantomlei3

Description

@phantomlei3

Your environment

MLU
commit id:3011eb4c7a0f4722204878298bafe04c66164ea0

🐛 Describe the bug

test model: GLM-5-W8A8
start command:

for ((i = 0; i < NNODES; i++)); do
    DEVICE=$((START_DEVICE + i))
    LOG_FILE="${LOG_DIR}/node_${i}.log"
    # node_rank = server_rank * NNODES + i
    NODE_RANK=$((SERVER_RANK * NNODES + i))
    xllm \
        --model "${MODEL_PATH}" \
        --devices="mlu:${DEVICE}" \
	--draft_model "${MODEL_PATH}" --draft_devices="mlu:${DEVICE}" --num_speculative_tokens 1 \
	--port "${PORT}" \
        --host="0.0.0.0" \
        --master_node_addr="${MASTER_NODE_ADDR}" \
        --nnodes="${WORLD_SIZE}" \
        --max_memory_utilization=0.84 \
        --max_tokens_per_batch="${max_tokens_per_batch}" \
        --max_seqs_per_batch="${max_seqs_per_batch}" \
        --block_size=16 \
        --max_cache_size=0 \
        --enable_prefix_cache=true \
        --enable_chunked_prefill=true \
        --enable_schedule_overlap=false \
        --enable_prefill_sp=false \
        --node_rank="${NODE_RANK}" \
        --enable_shm=false \
        --enable_graph=false \
        --random_seed=42 \
        --reasoning_parser glm5 \
        --tool_call_parser glm5 \
        --expert_parallel_degree=2 \
	--dp_size=4 \
        --ep_size=${WORLD_SIZE} \
    > "${LOG_FILE}" 2>&1 &
done

description
There is no request but prefill_only_scheduler consistently shows "Request prompt is too long, no enough memory to schedule a single sequence." after long-hour testing even I send a small prompt

message:

I20260329 21:33:44.348765 28607 request.cpp:92] x-request-id: , x-request-time: , request_id: chatcmpl-7722415046857869456-MKtu3aAF95WPirBSMLMrGj, sequence 0, max_tokens: 32768, temperature: 1, finish_reason: stop, prompt_tokens: 32792, generated_tokens: 24, ttft: 25476.0ms, total_latency: 26504.9ms, avg tpot: 44.7ms, generation speed: 23.3 tokens/s
I20260329 21:34:05.048291 28608 request.cpp:92] x-request-id: , x-request-time: , request_id: chatcmpl-7722415046857869456-ZwUgB4NC5EFmck9V5YPobM, sequence 0, max_tokens: 32768, temperature: 1, finish_reason: stop, prompt_tokens: 13723, generated_tokens: 514, ttft: 19702.0ms, total_latency: 40283.4ms, avg tpot: 40.1ms, generation speed: 25.0 tokens/s
I20260329 21:34:31.200860 28609 request.cpp:92] x-request-id: , x-request-time: , request_id: chatcmpl-7722415046857869456-RqJFEM7XjtCorrue24fhGz, sequence 0, max_tokens: 32768, temperature: 1, finish_reason: stop, prompt_tokens: 90582, generated_tokens: 631, ttft: 38378.0ms, total_latency: 63884.9ms, avg tpot: 40.5ms, generation speed: 24.7 tokens/s
E20260329 21:34:33.941910 28640 prefill_only_scheduler.cpp:222] Request prompt is too long, no enough memory to schedule a single sequence.
E20260329 21:34:33.942020 28640 prefill_only_scheduler.cpp:222] Request prompt is too long, no enough memory to schedule a single sequence.
E20260329 21:34:33.942056 28640 prefill_only_scheduler.cpp:222] Request prompt is too long, no enough memory to schedule a single sequence.
I20260329 21:34:33.942041 28610 request.cpp:118] x-request-id: , x-request-time: , request_id: chatcmpl-7722415046857869456-xRTKq8FZFo92LDQD5tbz7A, sequence 0, max_tokens: 32768, temperature: 1, prompt_tokens: 5720, status_code : 5, status_msg : No enough memory to schedule single sequence
E20260329 21:34:33.942098 28640 prefill_only_scheduler.cpp:222] Request prompt is too long, no enough memory to schedule a single sequence.
I20260329 21:34:33.942086 28608 request.cpp:118] x-request-id: , x-request-time: , request_id: chatcmpl-7722415046857869456-q7LGa6hn46ZBZyd8x9AQsY, sequence 0, max_tokens: 32768, temperature: 1, prompt_tokens: 30746, status_code : 5, status_msg : No enough memory to schedule single sequence
E20260329 21:34:33.942142 28640 prefill_only_scheduler.cpp:222] Request prompt is too long, no enough memory to schedule a single sequence.
I20260329 21:34:33.942102 28609 request.cpp:118] x-request-id: , x-request-time: , request_id: chatcmpl-7722415046857869456-HUyfF9NJH7DndYQDT49Gue, sequence 0, max_tokens: 32768, temperature: 1, prompt_tokens: 33069, status_code : 5, status_msg : No enough memory to schedule single sequence
E20260329 21:34:33.942168 28640 prefill_only_scheduler.cpp:222] Request prompt is too long, no enough memory to schedule a single sequence.
E20260329 21:34:33.942193 28640 prefill_only_scheduler.cpp:222] Request prompt is too long, no enough memory to schedule a single sequence.
I20260329 21:34:33.942723 28607 request.cpp:92] x-request-id: , x-request-time: , request_id: chatcmpl-7722415046857869456-bAz5smbowDGcwYhymnVjsm, sequence 0, max_tokens: 32768, temperature: 1, finish_reason: stop, prompt_tokens: 8142, generated_tokens: 61, ttft: 62676.0ms, total_latency: 65106.9ms, avg tpot: 40.5ms, generation speed: 25.1 tokens/s
I20260329 21:34:33.943156 28610 request.cpp:118] x-request-id: , x-request-time: , request_id: chatcmpl-7722415046857869456-8aJPb3GxWAKoK6nnjdQwbR, sequence 0, max_tokens: 32768, temperature: 1, prompt_tokens: 24306, status_code : 5, status_msg : No enough memory to schedule single sequence
I20260329 21:34:33.943763 28609 request.cpp:118] x-request-id: , x-request-time: , request_id: chatcmpl-7722415046857869456-tq7MeL6F2pB3i5TRQHqSDe, sequence 0, max_tokens: 32768, temperature: 1, prompt_tokens: 91555, status_code : 5, status_msg : No enough memory to schedule single sequence

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions