This document explains the design and implementation of unified generation parameters (temperature, seed, thinking/reasoning) across all AbstractCore providers.
Parameters are declared at the AbstractCoreInterface level, ensuring:
- Consistent API contract across all providers
- Type safety and documentation at the interface level
- Automatic inheritance by all provider implementations
Common parameters are handled centrally to avoid:
- Code duplication across providers
- Inconsistent parameter handling
- Maintenance overhead for parameter changes
Providers that don't support certain parameters:
- Accept the parameters without error
- Issue appropriate warnings (e.g., Anthropic seed warning)
- Maintain consistent API behavior
- Provide fallback mechanisms where possible
AbstractCoreInterface (interface.py)
├── temperature: float = 0.7 # Interface-level default
├── seed: Optional[int] = None # Interface-level default
├── thinking: Optional[bool|str] = None # Unified thinking/reasoning control (best-effort)
└── _validate_parameters() # Validation logic
BaseProvider (base.py)
├── _prepare_generation_kwargs() # Unified parameter processing
├── _extract_generation_params() # Parameter extraction helper
├── _apply_thinking_request() # Provider-agnostic + provider-specific thinking mapping
└── Parameter fallback hierarchy # kwargs → instance → defaults
Individual Providers
├── Provider-specific parameters only (top_p, frequency_penalty, etc.)
├── Provider-specific parameter mapping
└── Native API integrationParameters follow a clear precedence order:
-
Method-level kwargs (highest priority)
llm.generate("Hello", temperature=0.9, seed=123)
-
Instance-level parameters
llm = create_llm("openai", temperature=0.5, seed=42)
-
Interface defaults (lowest priority)
# temperature=0.7, seed=None (from AbstractCoreInterface)
# Direct parameter mapping to provider API
call_params["temperature"] = params["temperature"]
if "seed" in params:
call_params["seed"] = params["seed"]# Accept parameters but log limitation
if "seed" in params:
self.logger.debug(f"Seed {params['seed']} requested but not supported - logged for debugging")Portkey is a routing gateway that forwards payloads to many backends (OpenAI, Anthropic, Gemini, Grok, etc.). To avoid sending defaults that strict models reject, the Portkey provider:
- Forwards optional generation parameters only when explicitly set by the user (constructor or
generate()kwargs). - Drops unsupported parameters for OpenAI reasoning families (gpt-5/o1), and uses
max_completion_tokensinstead ofmax_tokensfor those models. - Keeps legacy
max_tokensfor non-reasoning families to preserve compatibility with older backends.
Modern models may expose “thinking”/“reasoning effort” as either:
- a request-side control (enable/disable or low/medium/high), and/or
- a separate output channel (provider fields or inline tags).
AbstractCore exposes a single best-effort parameter:
response = llm.generate("Solve this", thinking=None) # auto (provider/model default)
response = llm.generate("Solve this", thinking="off") # try to reduce/disable thinking
response = llm.generate("Solve this", thinking="none") # alias for "off"
response = llm.generate("Solve this", thinking="on") # enable thinking
response = llm.generate("Solve this", thinking="low") # lower effort / smaller budgets (when supported)
response = llm.generate("Solve this", thinking="medium") # balanced (when supported)
response = llm.generate("Solve this", thinking="high") # higher effort / larger budgets (when supported)
print(response.reasoning)Accepted values: None|"auto"|"on"|"off"|"none"|True|False|"low"|"medium"|"high" (legacy aliases: "minimal", "xhigh", "extra high").
Best-effort mappings (as of Mar 2026):
- OpenAI (
OpenAIProvider): Chat Completionsreasoning_effort(values come fromreasoning_levelsinmodel_capabilities.json).thinking="off"maps toreasoning_effort="none"when supported; otherwise it falls back to the minimum supported effort with a warning (e.g.,gpt-5.2-pro→"medium"). - Anthropic (
AnthropicProvider): Messages APIthinking+ (for Claude 4.6 adaptive thinking)output_config.effort.- Unified levels map to effort:
low|medium|high|xhigh→low|medium|high|max(when supported);xhighfalls back tohighwith a warning whenmaxis unavailable. - For older models, AbstractCore falls back to manual thinking budgets via
thinking: {type:\"enabled\", budget_tokens: ...}(best-effort; newer models deprecate this).
- Unified levels map to effort:
- LM Studio / OpenAI-compatible local servers (
LMStudioProvider,OpenAICompatibleProvider):- Qwen3 / Qwen3.5 / Nemotron:
chat_template_kwargs.enable_thinking(andenableThinkingfor LM Studio’s custom-field naming).- This is the “clean” LM Studio approach: it matches the model’s own
Enable Thinkingcustom field and does not rely on system-prompt injection. - LM Studio robustness note (Qwen3/Qwen3.5): some LM Studio runtimes do not consistently honor
chat_template_kwargsfor all model formats. As a fallback forthinking="off"/"none", AbstractCore can append an empty Qwen think block right before generation (<think>\n\n</think>\n\n) to hard-disable thinking without polluting the system prompt. - Qwen also supports a “soft”
/no_think//thinkinstruction (stateful across turns), but AbstractCore prefers the stateless hard-switch where needed. Seedocs/fallbacks.md. - Effort levels: for Qwen3/Qwen3.5 on LM Studio,
thinking="low|medium|high"currently maps to “thinking enabled” (boolean). Most templates do not expose a stable per-effort budget knob, so effort scaling is best-effort and may be a no-op beyond on/off. - Nemotron:
thinking="low"additionally maps tochat_template_kwargs.low_effort=Truewhen supported by the template.
- This is the “clean” LM Studio approach: it matches the model’s own
- Seed‑OSS:
chat_template_kwargs.thinking_budget(levels map to budgets: low=512, medium=1024, high=4096, xhigh=8192;off→ 0).
- Qwen3 / Qwen3.5 / Nemotron:
- HuggingFace (GGUF / llama-cpp-python) (
HuggingFaceProviderwith GGUF models):- llama.cpp’s CLI/server supports template kwargs (e.g.,
--chat-template-kwargs '{"enable_thinking":false}'), butllama-cpp-python’sLlama.create_chat_completion()does not currently expose/forward per-request template kwargs likeenable_thinking. As a result, Qwen3/Qwen3.5thinking="off"/"none"uses the Qwen hard-switch marker (<think>\n\n</think>\n\n) as a robust input-side control. thinking="low|medium|high"is treated as “thinking enabled” (best-effort) and may be a no-op beyond on/off for Qwen templates.- Local context note: model cards may advertise extremely large context windows (e.g. 262k). For GGUF loads, AbstractCore will first try the advertised
max_tokens(context window); if allocation fails locally it retries with smaller llama.cppn_ctxvalues (best-effort). Passmax_tokens=...toHuggingFaceProvider()to explicitly control the runtimen_ctx.
- llama.cpp’s CLI/server supports template kwargs (e.g.,
- vLLM:
extra_body.chat_template_kwargs.enable_thinking(commonly used by Qwen3/Qwen3.5 templates)- When
thinkingis a level (low|medium|high|xhigh), AbstractCore also setsextra_body.thinking_token_budget(vLLM reasoning-budget feature).
- When
- Ollama: request field
think(bool for most models;"low"|"medium"|"high"for GPT‑OSS) - GPT‑OSS (Harmony): inject system line
Reasoning: low|medium|high(traces can’t be fully disabled;"off"maps to"low"with a warning)
Output semantics: when a provider/model exposes reasoning, AbstractCore normalizes it into GenerateResponse.metadata["reasoning"] and keeps GenerateResponse.content clean using abstractcore/architectures/response_postprocessing.py (asset-driven via assets/model_capabilities.json + assets/architecture_formats.json).
When a requested thinking mode is not supported by a model/provider, AbstractCore emits a RuntimeWarning and applies a best-effort approximation:
- If the model advertises
reasoning_levels, AbstractCore maps the requested level to the nearest supported level (generic ordering:minimal < low < medium < high < xhigh) and reports the effective level in the warning. - If a provider/model can only toggle reasoning on/off (no effort scaling), AbstractCore still enables reasoning for level requests and warns that the requested effort level may be ignored.
When thinking= is provided, AbstractCore records the requested and effective thinking mode in GenerateResponse.metadata:
thinking_requested: normalized unified request ("off","on", or a level like"high")thinking_effective: effective unified control after mappings (for example"xhigh" → "high"for a model that only supports up to"high")thinking_level_requested/thinking_level_effective: effort-level details when applicablethinking_handled_enable_disable/thinking_handled_level: whether the provider/model actually implemented the on/off toggle and/or the effort scaling knobthinking_supported_levels: model-advertised effort enum when available (from assets)thinking_supports_output/thinking_supports_control: asset-driven capability split (model emits reasoning vs model exposes a request-side knob)
These fields make it easier to debug best-effort fallbacks without relying only on warnings.
Sessions maintain persistent parameters across conversations:
session = BasicSession(
provider=llm,
temperature=0.5, # Default for all messages
seed=42 # Consistent across conversation
)
# Uses session defaults
response1 = session.generate("Hello")
# Override for specific message
response2 = session.generate("Be creative!", temperature=0.9)For prompt-cache-aware long chats, use CachedSession (see docs/prompt-caching.md):
from abstractcore import CachedSession
session = CachedSession(provider=llm, system_prompt="You are helpful.", prompt_cache_strategy="auto")
session.generate("Hello")# In each provider:
self.temperature = kwargs.get("temperature", 0.7)
self.seed = kwargs.get("seed", None)
# ... parameter extraction logic in each provider# In AbstractCoreInterface:
def __init__(self, ..., temperature: float = 0.7, seed: Optional[int] = None):
self.temperature = temperature
self.seed = seed
# In BaseProvider:
def _extract_generation_params(self, **kwargs) -> Dict[str, Any]:
return {
"temperature": kwargs.get("temperature", self.temperature),
"seed": kwargs.get("seed", self.seed) if self.seed is not None else None
}Adding new parameters requires only:
- Declaration in
AbstractCoreInterface - Logic in
BaseProvider._extract_generation_params() - Provider-specific mapping where supported
No changes needed in individual provider __init__ methods.
Parameters are tested at multiple levels:
- Interface level: Parameter inheritance and defaults
- Provider level: Native API integration and fallback behavior
- Session level: Parameter persistence and override behavior
- Integration level: End-to-end parameter flow
- Minimal Overhead: Parameter extraction happens once per generation call
- Memory Efficient: No parameter duplication across providers
- CPU Efficient: Simple dictionary operations for parameter resolution
All changes are fully backward compatible:
- Existing code continues to work unchanged
- New parameters are optional with sensible defaults
- Provider behavior remains consistent for existing use cases
Determinism across LLM providers is not guaranteed. When supported, AbstractCore passes seed-like
controls to providers/backends and recommends temperature=0 to reduce randomness, but results can
still vary with backend settings, hardware, and model/server updates.
To verify determinism for your exact provider/model/backend, run:
python tests/manual_seed_verification.pyProvider-Specific Implementations:
- OpenAI: Native
seedparameter in API - MLX:
mx.random.seed()before generation - Ollama:
seedin options payload - HuggingFace:
torch.manual_seed()+ GGUF native seed - LMStudio: OpenAI-compatible
seedparameter - Anthropic: Issues
UserWarningwhen seed provided
Testing Commands:
# Verify determinism across providers
python tests/manual_seed_verification.py
# Test specific provider
python tests/manual_seed_verification.py --provider openai