Replies: 5 comments
-
|
You’ve identified a classic determinism failure—ProblemMap No.15: “Inference seed drift & non-reproducibility.”
This means that even with the same chat_template, you’ll hit small, often invisible, numerical differences that amplify down the decoding path. Problem details and mitigation tips are mapped in this public index: |
Beta Was this translation helpful? Give feedback.
-
|
Reproducibility with LLMs is genuinely hard! At RevolutionAI (https://revolutionai.io), we have wrestled with this extensively. Here is what actually works: Why temperature=0 is not enough:
What actually helps: # Add ALL of these:
temperature=0
top_p=1.0
seed=42
do_sample=False # Critical!
use_cache=TrueAdditional measures:
Reality check: Even with all this, expect ~95-99% reproducibility, not 100%. For production, we test tolerance ranges rather than exact matches. Pro tip: If you NEED exact reproducibility, cache responses for known inputs. What is your use case — testing, compliance, or something else? |
Beta Was this translation helpful? Give feedback.
-
|
Deterministic outputs are surprisingly hard with LLMs! Here's why and how to get closer: Why temperature=0, top_p=1, seed=42 isn't enough:
What actually helps: from vllm import LLM, SamplingParams
sampling = SamplingParams(
temperature=0, # Greedy
top_p=1.0,
seed=42,
max_tokens=100,
# Key addition:
use_beam_search=False,
best_of=1
)Additional steps:
For true determinism: CUBLAS_WORKSPACE_CONFIG=:4096:8 python ...
# or
torch.use_deterministic_algorithms(True)We've chased determinism at RevolutionAI for regulated clients. Often "close enough" (99.9% match) is acceptable — perfect determinism is very costly. What's your use case requiring exact reproducibility? |
Beta Was this translation helpful? Give feedback.
-
|
Perfect determinism across different hardware/configs is unfortunately not achievable with LLMs. Here is why: Why identical params still vary:
What you CAN control:
Practical approach: We run reproducibility-sensitive workloads at Revolution AI — pinning hardware + version is the only reliable path. |
Beta Was this translation helpful? Give feedback.
-
|
LLM reproducibility is surprisingly hard! Here is why each factor matters: Why outputs differ:
How to maximize reproducibility: # Pin everything
sampling_params = SamplingParams(
temperature=0,
top_p=1.0,
top_k=1, # Add this!
seed=42,
use_beam_search=False,
)
# Disable speculative decoding
--speculative-model None
# Force single GPU
--tensor-parallel-size 1
# Disable CUDA graphs
--enforce-eagerTrue reproducibility: We deploy vLLM at Revolution AI — for prod, we version-lock everything and accept minor variance. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
We are using Qwen2.5-14B-Instruct with vLLM. However, we found the following things can make the output different, even we set
temperature=0,top_p=1,seed=42:vllm serveis different with vllm offline inference, using the same chat_templatevllm servewith different number of cardsThat is strange. Can someone tell me why? and how can I fix the output, when changing inference enveriments?
Beta Was this translation helpful? Give feedback.
All reactions