-
Notifications
You must be signed in to change notification settings - Fork 166
[Bug]: under long hour testing, MTP with enable_schedule_overlap=false shows request prompt is too long #1127
Copy link
Copy link
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Your environment
MLU
commit id:3011eb4c7a0f4722204878298bafe04c66164ea0
🐛 Describe the bug
test model: GLM-5-W8A8
start command:
for ((i = 0; i < NNODES; i++)); do
DEVICE=$((START_DEVICE + i))
LOG_FILE="${LOG_DIR}/node_${i}.log"
# node_rank = server_rank * NNODES + i
NODE_RANK=$((SERVER_RANK * NNODES + i))
xllm \
--model "${MODEL_PATH}" \
--devices="mlu:${DEVICE}" \
--draft_model "${MODEL_PATH}" --draft_devices="mlu:${DEVICE}" --num_speculative_tokens 1 \
--port "${PORT}" \
--host="0.0.0.0" \
--master_node_addr="${MASTER_NODE_ADDR}" \
--nnodes="${WORLD_SIZE}" \
--max_memory_utilization=0.84 \
--max_tokens_per_batch="${max_tokens_per_batch}" \
--max_seqs_per_batch="${max_seqs_per_batch}" \
--block_size=16 \
--max_cache_size=0 \
--enable_prefix_cache=true \
--enable_chunked_prefill=true \
--enable_schedule_overlap=false \
--enable_prefill_sp=false \
--node_rank="${NODE_RANK}" \
--enable_shm=false \
--enable_graph=false \
--random_seed=42 \
--reasoning_parser glm5 \
--tool_call_parser glm5 \
--expert_parallel_degree=2 \
--dp_size=4 \
--ep_size=${WORLD_SIZE} \
> "${LOG_FILE}" 2>&1 &
done
description
There is no request but prefill_only_scheduler consistently shows "Request prompt is too long, no enough memory to schedule a single sequence." after long-hour testing even I send a small prompt
message:
I20260329 21:33:44.348765 28607 request.cpp:92] x-request-id: , x-request-time: , request_id: chatcmpl-7722415046857869456-MKtu3aAF95WPirBSMLMrGj, sequence 0, max_tokens: 32768, temperature: 1, finish_reason: stop, prompt_tokens: 32792, generated_tokens: 24, ttft: 25476.0ms, total_latency: 26504.9ms, avg tpot: 44.7ms, generation speed: 23.3 tokens/s
I20260329 21:34:05.048291 28608 request.cpp:92] x-request-id: , x-request-time: , request_id: chatcmpl-7722415046857869456-ZwUgB4NC5EFmck9V5YPobM, sequence 0, max_tokens: 32768, temperature: 1, finish_reason: stop, prompt_tokens: 13723, generated_tokens: 514, ttft: 19702.0ms, total_latency: 40283.4ms, avg tpot: 40.1ms, generation speed: 25.0 tokens/s
I20260329 21:34:31.200860 28609 request.cpp:92] x-request-id: , x-request-time: , request_id: chatcmpl-7722415046857869456-RqJFEM7XjtCorrue24fhGz, sequence 0, max_tokens: 32768, temperature: 1, finish_reason: stop, prompt_tokens: 90582, generated_tokens: 631, ttft: 38378.0ms, total_latency: 63884.9ms, avg tpot: 40.5ms, generation speed: 24.7 tokens/s
E20260329 21:34:33.941910 28640 prefill_only_scheduler.cpp:222] Request prompt is too long, no enough memory to schedule a single sequence.
E20260329 21:34:33.942020 28640 prefill_only_scheduler.cpp:222] Request prompt is too long, no enough memory to schedule a single sequence.
E20260329 21:34:33.942056 28640 prefill_only_scheduler.cpp:222] Request prompt is too long, no enough memory to schedule a single sequence.
I20260329 21:34:33.942041 28610 request.cpp:118] x-request-id: , x-request-time: , request_id: chatcmpl-7722415046857869456-xRTKq8FZFo92LDQD5tbz7A, sequence 0, max_tokens: 32768, temperature: 1, prompt_tokens: 5720, status_code : 5, status_msg : No enough memory to schedule single sequence
E20260329 21:34:33.942098 28640 prefill_only_scheduler.cpp:222] Request prompt is too long, no enough memory to schedule a single sequence.
I20260329 21:34:33.942086 28608 request.cpp:118] x-request-id: , x-request-time: , request_id: chatcmpl-7722415046857869456-q7LGa6hn46ZBZyd8x9AQsY, sequence 0, max_tokens: 32768, temperature: 1, prompt_tokens: 30746, status_code : 5, status_msg : No enough memory to schedule single sequence
E20260329 21:34:33.942142 28640 prefill_only_scheduler.cpp:222] Request prompt is too long, no enough memory to schedule a single sequence.
I20260329 21:34:33.942102 28609 request.cpp:118] x-request-id: , x-request-time: , request_id: chatcmpl-7722415046857869456-HUyfF9NJH7DndYQDT49Gue, sequence 0, max_tokens: 32768, temperature: 1, prompt_tokens: 33069, status_code : 5, status_msg : No enough memory to schedule single sequence
E20260329 21:34:33.942168 28640 prefill_only_scheduler.cpp:222] Request prompt is too long, no enough memory to schedule a single sequence.
E20260329 21:34:33.942193 28640 prefill_only_scheduler.cpp:222] Request prompt is too long, no enough memory to schedule a single sequence.
I20260329 21:34:33.942723 28607 request.cpp:92] x-request-id: , x-request-time: , request_id: chatcmpl-7722415046857869456-bAz5smbowDGcwYhymnVjsm, sequence 0, max_tokens: 32768, temperature: 1, finish_reason: stop, prompt_tokens: 8142, generated_tokens: 61, ttft: 62676.0ms, total_latency: 65106.9ms, avg tpot: 40.5ms, generation speed: 25.1 tokens/s
I20260329 21:34:33.943156 28610 request.cpp:118] x-request-id: , x-request-time: , request_id: chatcmpl-7722415046857869456-8aJPb3GxWAKoK6nnjdQwbR, sequence 0, max_tokens: 32768, temperature: 1, prompt_tokens: 24306, status_code : 5, status_msg : No enough memory to schedule single sequence
I20260329 21:34:33.943763 28609 request.cpp:118] x-request-id: , x-request-time: , request_id: chatcmpl-7722415046857869456-tq7MeL6F2pB3i5TRQHqSDe, sequence 0, max_tokens: 32768, temperature: 1, prompt_tokens: 91555, status_code : 5, status_msg : No enough memory to schedule single sequence
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working