Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 46 additions & 3 deletions docs/guides/multiturn.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,13 +40,16 @@ When executing a multiturn benchmark, GuideLLM:
4. **Sends the next turn** (prompt_i) along with the conversation history
5. **Repeat from (2)** for the `n` given turns

For `/v1/chat/completions`, the conversation history is passed as a messages array with alternating user and assistant roles. For `/v1/completions`, the history is concatenated as a single prompt string.
For `/v1/chat/completions`, the conversation history is passed as a messages array with alternating user and assistant roles. For `/v1/responses`, the history is either passed as alternating user and assistant roles, or as a previous request ID. For `/v1/completions`, the history is concatenated as a single prompt string.

For more information see [Request Formatting](#request-formatting) and [Server-Side Conversation History](#server-side-conversation-history-v1responses-only).

### Prefix Columns and System Prompts

Prefix columns (if present) are treated specially:

- In `/v1/chat/completions`, the prefix becomes a system message in the conversation array
- In `/v1/responses`, the prefix becomes the `instructions` field
- In `/v1/completions`, the prefix is prepended to the turn's prompt
- Prefixes can be specified with a turn index if desired; however the recommended use-case is a single prefix for the first turn
- Synthetic data only supports a prefix on the first turn
Expand Down Expand Up @@ -108,7 +111,7 @@ Multiturn conversations are formatted differently depending on the request forma

#### Chat Completions (`/v1/chat/completions`)

For chat completions, GuideLLM creates a messages array with the conversation history:
For chat completions, GuideLLM creates a `messages` array with the conversation history:

```json
{
Expand All @@ -123,6 +126,23 @@ For chat completions, GuideLLM creates a messages array with the conversation hi
}
```

#### Responses API (`/v1/responses`)

For the Responses API with `server_history` disabled, GuideLLM creates an `input` array with the conversation history and sets the prefix as `instructions`:

```json
{
"instructions": "prefix content",
"input": [
{"role": "user", "content": [{"type": "input_text", "text": "prompt_0 content"}]},
{"role": "assistant", "content": "response to prompt_0"},
{"role": "user", "content": [{"type": "input_text", "text": "prompt_1 content"}]},
{"role": "assistant", "content": "response to prompt_1"},
{"role": "user", "content": [{"type": "input_text", "text": "prompt_2 content"}]}
]
}
```

#### Text Completions (`/v1/completions`)

For text completions, the conversation history is concatenated:
Expand All @@ -131,6 +151,28 @@ For text completions, the conversation history is concatenated:
prefix content prompt_0 content response to prompt_0 prompt_1 content response to prompt_1 prompt_2 content
```

### Server-Side Conversation History (`/v1/responses` only)

By default, GuideLLM replays the full conversation history in each request (client-side history). For the Responses API, you can instead use **server-side history** via the `previous_response_id` field, where the server stores and manages conversation context.

Enable it with `--backend-kwargs`:

```bash
guidellm benchmark run \
--target "http://localhost:8000" \
--request-format /v1/responses \
--backend-kwargs '{"server_history": true}' \
--data "prompt_tokens=200,output_tokens=100,turns=3"
```

When enabled, GuideLLM sends only the current turn's input and references the previous response by ID. The server reconstructs the full conversation context internally.

**Requirements:**

- The server must support `previous_response_id` with response storage enabled. For vLLM, set the `VLLM_ENABLE_RESPONSES_API_STORE=1` environment variable when starting the server.
- If the server does not support response storage, requests on turn 2+ will fail with an error (typically a 404).
- This option is only valid with `/v1/responses`. Using it with other request formats raises an error at startup.

## The TurnPivot Preprocessor

GuideLLM supports passing multiple `--data` options, each pointing to a separate dataset. If there are matches for the same column type across multiple datasets, they are treated as separate batches. Normally this is useful for layering columns from different datasets within the same request. For example adding a text column from one dataset to another with images or combining multiple normally-distributed synthetic datasets into a multimodal distribution. We can use the **TurnPivot** preprocessor to transpose turn columns and dataset batches.
Expand Down Expand Up @@ -371,7 +413,8 @@ guidellm benchmark run \
Multiturn benchmarking is currently supported for:

- `/v1/chat/completions` - Utilizing chat template formatting
- `/v1/completions` - with basic concatenated history
- `/v1/responses` - Using the OpenAI Responses API input format
- `/v1/completions` - With basic concatenated history

Audio endpoints (`/v1/audio/transcriptions`, `/v1/audio/translations`) do not support multiturn benchmarking.

Expand Down
19 changes: 19 additions & 0 deletions src/guidellm/backends/openai/http.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,13 @@ class OpenAIHttpBackendArgs(BackendArgs):
)
},
)
server_history: bool = Field(
default=False,
description=(
"Use server-side conversation history (previous_response_id) for "
"multi-turn requests. Only supported with /v1/responses."
),
)

@field_validator("request_format")
@classmethod
Expand Down Expand Up @@ -166,6 +173,7 @@ def __init__(
extras: dict[str, Any] | GenerationRequestArguments | None = None,
max_tokens: int | None = None,
max_completion_tokens: int | None = None,
server_history: bool = False,
):
"""
Initialize OpenAI HTTP backend with server configuration.
Expand All @@ -180,6 +188,8 @@ def __init__(
:param follow_redirects: Follow HTTP redirects automatically
:param verify: Enable SSL certificate verification
:param validate_backend: Backend validation configuration
:param server_history: Use server-side conversation history
(previous_response_id) for multi-turn. Only with /v1/responses.
"""
super().__init__(type_="openai_http")

Expand All @@ -202,6 +212,14 @@ def __init__(
f"{', '.join(valid_formats)}"
)
self.request_type = request_format
self.server_history = server_history

if self.server_history and self.request_type != "/v1/responses":
raise ValueError(
"server_history=True is only supported with the Responses API "
"(/v1/responses). Current request format: "
f"'{self.request_type}'"
)

# Store configuration
self.api_routes = api_routes or DEFAULT_API_PATHS
Expand Down Expand Up @@ -381,6 +399,7 @@ async def resolve( # type: ignore[override, misc]
stream=self.stream,
extras=self.extras,
max_tokens=self.max_tokens,
server_history=self.server_history,
)

request_url = f"{self.target}/{request_path}"
Expand Down
9 changes: 8 additions & 1 deletion src/guidellm/backends/openai/request_handlers.py
Original file line number Diff line number Diff line change
Expand Up @@ -849,15 +849,22 @@ def format(
history: HistoryT[GenerationRequest, GenerationResponse] | None = None,
**kwargs,
) -> GenerationRequestArguments:
use_server_history = kwargs.get("server_history") and history

prev_requests: list[GenerationRequestArguments] = []
if history:
if history and not use_server_history:
prev_requests = [
self.format(req, response=res, **kwargs) for req, res in history
]

arguments = GenerationRequestArguments()
arguments.body = {}

if use_server_history:
_, last_response = history[-1] # type: ignore[index]
if last_response and last_response.response_id:
arguments.body["previous_response_id"] = last_response.response_id

if kwargs.get("model") is not None:
arguments.body["model"] = kwargs["model"]

Expand Down
28 changes: 28 additions & 0 deletions tests/unit/backends/openai/test_http.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,34 @@ def test_invalid_validate_backend_parameter(self):
validate_backend=123, # type: ignore[arg-type]
)

@pytest.mark.sanity
def test_server_history_requires_responses_api(self):
"""
Test server_history=True raises ValueError for non-responses request formats.

## WRITTEN BY AI ##
"""
with pytest.raises(ValueError, match="server_history.*only supported"):
OpenAIHTTPBackend(
target="http://localhost:8000",
request_format="/v1/chat/completions",
server_history=True,
)

@pytest.mark.sanity
def test_server_history_with_responses_api(self):
"""
Test server_history=True is accepted with /v1/responses.

## WRITTEN BY AI ##
"""
backend = OpenAIHTTPBackend(
target="http://localhost:8000",
request_format="/v1/responses",
server_history=True,
)
assert backend.server_history is True

@pytest.mark.smoke
def test_factory_registration(self):
"""Test that OpenAIHTTPBackend is registered with Backend factory."""
Expand Down
51 changes: 51 additions & 0 deletions tests/unit/backends/openai/test_request_handlers.py
Original file line number Diff line number Diff line change
Expand Up @@ -2569,6 +2569,57 @@ def test_format_with_history(self, valid_instances):
assert input_items[1]["content"] == "4"
assert input_items[2]["role"] == "user"

@pytest.mark.sanity
def test_format_with_server_history(self, valid_instances):
"""
Test format uses previous_response_id instead of replaying history
when server_history is enabled.

## WRITTEN BY AI ##
"""
instance = valid_instances

prev_request = GenerationRequest(
columns={"text_column": ["What is 2+2?"]},
)
prev_response = GenerationResponse(
request_id="prev", request_args=None, text="4", response_id="resp_abc123"
)

data = GenerationRequest(
columns={"text_column": ["What is 3+3?"]},
)

result = instance.format(
data, history=[(prev_request, prev_response)], server_history=True
)

assert result.body["previous_response_id"] == "resp_abc123"
input_items = result.body["input"]
assert len(input_items) == 1
assert input_items[0]["role"] == "user"

@pytest.mark.sanity
def test_format_with_server_history_first_turn(self, valid_instances):
"""
Test format does not set previous_response_id on the first turn
(no history) even when server_history is enabled.

## WRITTEN BY AI ##
"""
instance = valid_instances

data = GenerationRequest(
columns={"text_column": ["Hello!"]},
)

result = instance.format(data, server_history=True)

assert "previous_response_id" not in result.body
input_items = result.body["input"]
assert len(input_items) == 1
assert input_items[0]["role"] == "user"

# Tool call response handling tests

@pytest.mark.sanity
Expand Down