Not too sure if it's a bug but looks like it?
using 0.2.0 pre-compiled releases for windows and the following parameters (gotten from quickstart).
I have a 4090 using CUDA 12 version. I was reserving about 22-23 GB of VRAM
.\llama-server.exe `
-m "models\gemma-4-31B-it-Q4_K_S.gguf" `
--spec-draft-model "models\gemma4-31b-it-dflash-Q4_K_M.gguf" `
--spec-type dflash `
--spec-dflash-cross-ctx 1024 `
--port 8082 `
-np 1 `
--kv-unified `
-ngl all `
--spec-draft-ngl all `
-b 2048 -ub 512 `
--ctx-size 32768 `
--cache-type-k q5_0 --cache-type-v q4_1 `
--flash-attn on `
--cache-ram 0 `
--jinja `
--no-mmap --mlock `
--no-host `
--reasoning on `
--temp 1.0 --top-k 64 --top-p 0.95 --min-p 0.0
After about two-three generation I get the following:
�[0mslot operator (): id 0 | task 467 | adaptive dm profit: cur=0 recommended=16 score=15.4 action=apply
sched_reserve: reserving ...
sched_reserve: CUDA0 compute buffer size = 139.88 MiB
sched_reserve: CUDA_Host compute buffer size = 5.44 MiB
sched_reserve: graph nodes = 185
sched_reserve: graph splits = 2
sched_reserve: reserve took 1.99 ms, sched copies = 1
slot operator (): id 0 | task 467 | adaptive dm profit: cur=16 recommended=14 score=22.0 action=apply
slot operator (): id 0 | task 467 | adaptive dm profit: cur=14 recommended=12 score=24.4 action=apply
slot operator (): id 0 | task 467 | adaptive dm profit: cur=12 recommended=10 score=24.6 action=apply
sched_reserve: reserving ...
sched_reserve: CUDA0 compute buffer size = 139.88 MiB
sched_reserve: CUDA_Host compute buffer size = 5.50 MiB
sched_reserve: graph nodes = 185
sched_reserve: graph splits = 2
sched_reserve: reserve took 2.31 ms, sched copies = 1
slot operator (): id 0 | task 467 | adaptive dm profit: cur=10 recommended=8 score=22.8 action=apply
slot operator (): id 0 | task 467 | adaptive dm profit: cur=8 recommended=7 score=23.2 action=apply
slot operator (): id 0 | task 467 | adaptive dm profit: cur=7 recommended=6 score=25.4 action=apply
slot operator (): id 0 | task 467 | adaptive dm profit: cur=6 recommended=5 score=25.5 action=apply
slot operator (): id 0 | task 467 | adaptive dm profit: cur=5 recommended=4 score=23.6 action=apply
sched_reserve: reserving ...
sched_reserve: CUDA0 compute buffer size = 139.88 MiB
sched_reserve: CUDA_Host compute buffer size = 5.57 MiB
sched_reserve: graph nodes = 185
sched_reserve: graph splits = 2
sched_reserve: reserve took 2.05 ms, sched copies = 1
dflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=386 cross_len=386)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=387 cross_len=387)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=388 cross_len=388)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=389 cross_len=389)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=390 cross_len=390)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=391 cross_len=391)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/6 (top_k=1 committed=392 cross_len=392)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=393 cross_len=393)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=394 cross_len=394)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=395 cross_len=395)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=396 cross_len=396)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=397 cross_len=397)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=398 cross_len=398)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=399 cross_len=399)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=400 cross_len=400)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=401 cross_len=401)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=402 cross_len=402)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=403 cross_len=403)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/6 (top_k=1 committed=404 cross_len=404)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=405 cross_len=405)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=406 cross_len=406)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=407 cross_len=407)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=408 cross_len=408)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=409 cross_len=409)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=410 cross_len=410)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=411 cross_len=411)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=412 cross_len=412)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=413 cross_len=413)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=414 cross_len=414)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=415 cross_len=415)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/16 (top_k=1 committed=416 cross_len=416)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=417 cross_len=417)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=418 cross_len=418)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=419 cross_len=419)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=420 cross_len=420)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=421 cross_len=421)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=422 cross_len=422)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=423 cross_len=423)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=424 cross_len=424)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=425 cross_len=425)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=426 cross_len=426)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=427 cross_len=427)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/6 (top_k=1 committed=428 cross_len=428)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=429 cross_len=429)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=430 cross_len=430)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=431 cross_len=431)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=432 cross_len=432)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=433 cross_len=433)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=434 cross_len=434)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=435 cross_len=435)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=436 cross_len=436)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=437 cross_len=437)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=438 cross_len=438)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=439 cross_len=439)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/6 (top_k=1 committed=440 cross_len=440)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=441 cross_len=441)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=442 cross_len=442)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=443 cross_len=443)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=444 cross_len=444)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=445 cross_len=445)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=446 cross_len=446)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=447 cross_len=447)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=448 cross_len=448)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=449 cross_len=449)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=450 cross_len=450)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=451 cross_len=451)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/6 (top_k=1 committed=452 cross_len=452)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=453 cross_len=453)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=454 cross_len=454)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=455 cross_len=455)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=456 cross_len=456)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=457 cross_len=457)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=458 cross_len=458)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=459 cross_len=459)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=460 cross_len=460)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=461 cross_len=461)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=462 cross_len=462)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=463 cross_len=463)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/16 (top_k=1 committed=464 cross_len=464)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=465 cross_len=465)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=466 cross_len=466)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=467 cross_len=467)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=468 cross_len=468)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=469 cross_len=469)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=470 cross_len=470)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=471 cross_len=471)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=472 cross_len=472)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=473 cross_len=473)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=474 cross_len=474)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=475 cross_len=475)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/6 (top_k=1 committed=476 cross_len=476)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=477 cross_len=477)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=478 cross_len=478)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=479 cross_len=479)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=480 cross_len=480)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=481 cross_len=481)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=482 cross_len=482)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=483 cross_len=483)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=484 cross_len=484)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=485 cross_len=485)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=486 cross_len=486)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=487 cross_len=487)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/6 (top_k=1 committed=488 cross_len=488)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=489 cross_len=489)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=490 cross_len=490)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=491 cross_len=491)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=492 cross_len=492)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=493 cross_len=493)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=494 cross_len=494)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=495 cross_len=495)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=496 cross_len=496)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=497 cross_len=497)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=498 cross_len=498)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=499 cross_len=499)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/6 (top_k=1 committed=500 cross_len=500)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=501 cross_len=501)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=502 cross_len=502)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=503 cross_len=503)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=504 cross_len=504)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=505 cross_len=505)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=506 cross_len=506)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=507 cross_len=507)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=508 cross_len=508)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=509 cross_len=509)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=510 cross_len=510)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/5 (top_k=1 committed=511 cross_len=511)
�[0mdflash: invalid reduced-logits token -1 in draft at row=1/16 (top_k=1 committed=512 cross_len=512)
�[0msched_reserve: reserving ...
sched_reserve: CUDA0 compute buffer size = 139.88 MiB
sched_reserve: CUDA_Host compute buffer size = 5.63 MiB
sched_reserve: graph nodes = 185
sched_reserve: graph splits = 2
sched_reserve: reserve took 1.91 ms, sched copies = 1
Is this normal behavior ? The last part repeats 3x
Not too sure if it's a bug but looks like it?
using 0.2.0 pre-compiled releases for windows and the following parameters (gotten from quickstart).
I have a 4090 using CUDA 12 version. I was reserving about 22-23 GB of VRAM
After about two-three generation I get the following:
Is this normal behavior ? The last part repeats 3x