Why temperature=0,top_p=1,seed=42 is still not enough to fix the llm's output!? #17166

beyondguo · 2025-04-25T06:54:43Z

beyondguo
Apr 25, 2025

We are using Qwen2.5-14B-Instruct with vLLM. However, we found the following things can make the output different, even we set temperature=0,top_p=1,seed=42:

vllm serve is different with vllm offline inference, using the same chat_template
vllm serve with different number of cards
different vllm version
using H100 or H200 can make a difference

That is strange. Can someone tell me why? and how can I fix the output, when changing inference enveriments?

onestardao · 2025-08-18T11:45:03Z

onestardao
Aug 18, 2025

You’ve identified a classic determinism failure—ProblemMap No.15: “Inference seed drift & non-reproducibility.”
Setting temperature=0, top_p=1, and a fixed seed isn’t enough to guarantee identical outputs in LLM systems, due to:

Kernel-level nondeterminism (cuDNN, CUDA ops, library versions can cause subtle floating-point drift—even across driver or hardware generations).
Microarchitectural differences (H100 vs. H200, for instance, change low-level accumulation or memory ordering).
Multi-card parallelism: collective ops and data splits across cards reintroduce stochasticity.
vLLM code/serving mode: vllm serve and offline may have slightly different preprocessing, sharding, or scheduler implementations.

This means that even with the same chat_template, you’ll hit small, often invisible, numerical differences that amplify down the decoding path.
If you want a full checklist or technical workarounds for strict reproducibility (tested on vLLM, TRT-LLM, and similar frameworks), let me know—this is a common pitfall for anyone shipping high-reliability inference.

Problem details and mitigation tips are mapped in this public index:
https://github.com/onestardao/WFGY/blob/main/ProblemMap/README.md

0 replies

xXMrNidaXx · 2026-02-23T13:04:32Z

xXMrNidaXx
Feb 23, 2026

Reproducibility with LLMs is genuinely hard! At RevolutionAI (https://revolutionai.io), we have wrestled with this extensively. Here is what actually works:

Why temperature=0 is not enough:

Floating point non-determinism — GPU operations are not perfectly reproducible
Batching effects — Different batch compositions change results
Framework differences — vLLM, HuggingFace, etc. have subtle differences

What actually helps:

# Add ALL of these:
temperature=0
top_p=1.0
seed=42
do_sample=False  # Critical!
use_cache=True

Additional measures:

Pin exact model version (not just name)
Single-request batches for determinism
Same hardware across runs
Document CUDA version, PyTorch version

Reality check: Even with all this, expect ~95-99% reproducibility, not 100%. For production, we test tolerance ranges rather than exact matches.

Pro tip: If you NEED exact reproducibility, cache responses for known inputs.

What is your use case — testing, compliance, or something else?

0 replies

xXMrNidaXx · 2026-02-23T13:49:08Z

xXMrNidaXx
Feb 23, 2026

Deterministic outputs are surprisingly hard with LLMs! Here's why and how to get closer:

Why temperature=0, top_p=1, seed=42 isn't enough:

Floating point non-determinism — GPU operations aren't perfectly reproducible
Batching effects — Different batch compositions can affect results
KV cache state — Prior context can influence
Hardware differences — Different GPUs may compute slightly differently

What actually helps:

from vllm import LLM, SamplingParams

sampling = SamplingParams(
    temperature=0,        # Greedy
    top_p=1.0,
    seed=42,
    max_tokens=100,
    # Key addition:
    use_beam_search=False,
    best_of=1
)

Additional steps:

Same hardware — Run on identical GPU
Same model weights — Exact same checkpoint
Clear KV cache — Fresh context each time
Disable parallelism — Single GPU, no tensor parallel

For true determinism:

CUBLAS_WORKSPACE_CONFIG=:4096:8 python ...
# or
torch.use_deterministic_algorithms(True)

We've chased determinism at RevolutionAI for regulated clients. Often "close enough" (99.9% match) is acceptable — perfect determinism is very costly.

What's your use case requiring exact reproducibility?

0 replies

xXMrNidaXx · 2026-02-23T13:57:34Z

xXMrNidaXx
Feb 23, 2026

Perfect determinism across different hardware/configs is unfortunately not achievable with LLMs. Here is why:

Why identical params still vary:

Floating point non-associativity — (a+b)+c != a+(b+c) at float32/16 precision. Different parallelism = different reduction order.
Tensor parallelism — Splitting across cards changes which GPU computes what. Different GPUs = different rounding.
Hardware differences — H100 vs H200 have different FP8/BF16 implementations. Even same GPU model can vary slightly.
vLLM version — Kernel optimizations change computation order.

What you CAN control:

Same hardware config — Pin to specific GPU count/type
Same vLLM version — Lock in requirements.txt
Disable speculative decoding if enabled

Use greedy decoding:

sampling_params = SamplingParams(
    temperature=0,
    top_k=1,  # More deterministic than top_p
    seed=42,
    use_beam_search=False,
)

For offline vs serve parity:

# Match serve defaults in offline
llm = LLM(
    model="...",
    tensor_parallel_size=N,  # Match serve
    dtype="bfloat16",  # Explicit
)

Practical approach:
Accept small variations exist. For evals/benchmarks, run on identical infra. For production, focus on semantic equivalence not token-exact matching.

We run reproducibility-sensitive workloads at Revolution AI — pinning hardware + version is the only reliable path.

0 replies

xXMrNidaXx · 2026-02-23T15:25:35Z

xXMrNidaXx
Feb 23, 2026

LLM reproducibility is surprisingly hard! Here is why each factor matters:

Why outputs differ:

Serve vs Offline
- Different batching strategies
- Async vs sync execution paths
- Different default configs
Multi-GPU sharding
- Tensor parallelism changes computation order
- Floating point operations are not associative
- (a + b) + c != a + (b + c) in float32
vLLM version
- Kernel implementations change
- Default parameters differ
- Bug fixes alter behavior
H100 vs H200
- Different CUDA architectures
- Different tensor core implementations
- FP8 vs FP16 internal precision

How to maximize reproducibility:

# Pin everything
sampling_params = SamplingParams(
    temperature=0,
    top_p=1.0,
    top_k=1,  # Add this!
    seed=42,
    use_beam_search=False,
)

# Disable speculative decoding
--speculative-model None

# Force single GPU
--tensor-parallel-size 1

# Disable CUDA graphs
--enforce-eager

True reproducibility:
Hash the full config + vLLM version + GPU type. Accept that cross-environment reproducibility is best-effort.

We deploy vLLM at Revolution AI — for prod, we version-lock everything and accept minor variance.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why temperature=0,top_p=1,seed=42 is still not enough to fix the llm's output!? #17166

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Why temperature=0,top_p=1,seed=42 is still not enough to fix the llm's output!? #17166

Uh oh!

beyondguo Apr 25, 2025

Replies: 5 comments

Uh oh!

onestardao Aug 18, 2025

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

xXMrNidaXx Feb 23, 2026

beyondguo
Apr 25, 2025

onestardao
Aug 18, 2025

xXMrNidaXx
Feb 23, 2026

xXMrNidaXx
Feb 23, 2026

xXMrNidaXx
Feb 23, 2026

xXMrNidaXx
Feb 23, 2026