vllm-project · jaredoconnell · Apr 17, 2026 · Apr 16, 2026
diff --git a/docs/guides/multiturn.md b/docs/guides/multiturn.md
@@ -40,13 +40,16 @@ When executing a multiturn benchmark, GuideLLM:
 4. **Sends the next turn** (prompt_i) along with the conversation history
 5. **Repeat from (2)** for the `n` given turns
 
-For `/v1/chat/completions`, the conversation history is passed as a messages array with alternating user and assistant roles. For `/v1/completions`, the history is concatenated as a single prompt string.
+For `/v1/chat/completions`, the conversation history is passed as a messages array with alternating user and assistant roles. For `/v1/responses`, the history is either passed as alternating user and assistant roles, or as a previous request ID. For `/v1/completions`, the history is concatenated as a single prompt string.
+
+For more information see [Request Formatting](#request-formatting) and [Server-Side Conversation History](#server-side-conversation-history-v1responses-only).
 
 ### Prefix Columns and System Prompts
 
 Prefix columns (if present) are treated specially:
 
 - In `/v1/chat/completions`, the prefix becomes a system message in the conversation array
+- In `/v1/responses`, the prefix becomes the `instructions` field
 - In `/v1/completions`, the prefix is prepended to the turn's prompt
 - Prefixes can be specified with a turn index if desired; however the recommended use-case is a single prefix for the first turn
 - Synthetic data only supports a prefix on the first turn
@@ -108,7 +111,7 @@ Multiturn conversations are formatted differently depending on the request forma
 
 #### Chat Completions (`/v1/chat/completions`)
 
-For chat completions, GuideLLM creates a messages array with the conversation history:
+For chat completions, GuideLLM creates a `messages` array with the conversation history:
 
 ```json
 {
@@ -123,6 +126,23 @@ For chat completions, GuideLLM creates a messages array with the conversation hi
 }
 ```
 
+#### Responses API (`/v1/responses`)
+
+For the Responses API with `server_history` disabled, GuideLLM creates an `input` array with the conversation history and sets the prefix as `instructions`:
+
+```json
+{
+  "instructions": "prefix content",
+  "input": [
+    {"role": "user", "content": [{"type": "input_text", "text": "prompt_0 content"}]},
+    {"role": "assistant", "content": "response to prompt_0"},
+    {"role": "user", "content": [{"type": "input_text", "text": "prompt_1 content"}]},
+    {"role": "assistant", "content": "response to prompt_1"},
+    {"role": "user", "content": [{"type": "input_text", "text": "prompt_2 content"}]}
+  ]
+}
+```
+
 #### Text Completions (`/v1/completions`)
 
 For text completions, the conversation history is concatenated:
@@ -131,6 +151,28 @@ For text completions, the conversation history is concatenated:
 prefix content prompt_0 content response to prompt_0 prompt_1 content response to prompt_1 prompt_2 content
 ```
 
+### Server-Side Conversation History (`/v1/responses` only)
+
+By default, GuideLLM replays the full conversation history in each request (client-side history). For the Responses API, you can instead use **server-side history** via the `previous_response_id` field, where the server stores and manages conversation context.
+
+Enable it with `--backend-kwargs`:
+
+```bash
+guidellm benchmark run \
+  --target "http://localhost:8000" \
+  --request-format /v1/responses \
+  --backend-kwargs '{"server_history": true}' \
+  --data "prompt_tokens=200,output_tokens=100,turns=3"
+```
+
+When enabled, GuideLLM sends only the current turn's input and references the previous response by ID. The server reconstructs the full conversation context internally.
+
+**Requirements:**
+
+- The server must support `previous_response_id` with response storage enabled. For vLLM, set the `VLLM_ENABLE_RESPONSES_API_STORE=1` environment variable when starting the server.
+- If the server does not support response storage, requests on turn 2+ will fail with an error (typically a 404).
+- This option is only valid with `/v1/responses`. Using it with other request formats raises an error at startup.
+
 ## The TurnPivot Preprocessor
 
 GuideLLM supports passing multiple `--data` options, each pointing to a separate dataset. If there are matches for the same column type across multiple datasets, they are treated as separate batches. Normally this is useful for layering columns from different datasets within the same request. For example adding a text column from one dataset to another with images or combining multiple normally-distributed synthetic datasets into a multimodal distribution. We can use the **TurnPivot** preprocessor to transpose turn columns and dataset batches.
@@ -371,7 +413,8 @@ guidellm benchmark run \
 Multiturn benchmarking is currently supported for:
 
 - `/v1/chat/completions` - Utilizing chat template formatting
-- `/v1/completions` - with basic concatenated history
+- `/v1/responses` - Using the OpenAI Responses API input format
+- `/v1/completions` - With basic concatenated history
 
 Audio endpoints (`/v1/audio/transcriptions`, `/v1/audio/translations`) do not support multiturn benchmarking.
 

diff --git a/src/guidellm/backends/openai/http.py b/src/guidellm/backends/openai/http.py
@@ -75,6 +75,13 @@ class OpenAIHttpBackendArgs(BackendArgs):
             )
         },
     )
+    server_history: bool = Field(
+        default=False,
+        description=(
+            "Use server-side conversation history (previous_response_id) for "
+            "multi-turn requests. Only supported with /v1/responses."
+        ),
+    )
 
     @field_validator("request_format")
     @classmethod
@@ -166,6 +173,7 @@ def __init__(
         extras: dict[str, Any] | GenerationRequestArguments | None = None,
         max_tokens: int | None = None,
         max_completion_tokens: int | None = None,
+        server_history: bool = False,
     ):
         """
         Initialize OpenAI HTTP backend with server configuration.
@@ -180,6 +188,8 @@ def __init__(
         :param follow_redirects: Follow HTTP redirects automatically
         :param verify: Enable SSL certificate verification
         :param validate_backend: Backend validation configuration
+        :param server_history: Use server-side conversation history
+            (previous_response_id) for multi-turn. Only with /v1/responses.
         """
         super().__init__(type_="openai_http")
 
@@ -202,6 +212,14 @@ def __init__(
                 f"{', '.join(valid_formats)}"
             )
         self.request_type = request_format
+        self.server_history = server_history
+
+        if self.server_history and self.request_type != "/v1/responses":
+            raise ValueError(
+                "server_history=True is only supported with the Responses API "
+                "(/v1/responses). Current request format: "
+                f"'{self.request_type}'"
+            )
 
         # Store configuration
         self.api_routes = api_routes or DEFAULT_API_PATHS
@@ -381,6 +399,7 @@ async def resolve(  # type: ignore[override, misc]
             stream=self.stream,
             extras=self.extras,
             max_tokens=self.max_tokens,
+            server_history=self.server_history,
         )
 
         request_url = f"{self.target}/{request_path}"

diff --git a/src/guidellm/backends/openai/request_handlers.py b/src/guidellm/backends/openai/request_handlers.py
@@ -849,15 +849,22 @@ def format(
         history: HistoryT[GenerationRequest, GenerationResponse] | None = None,
         **kwargs,
     ) -> GenerationRequestArguments:
+        use_server_history = kwargs.get("server_history") and history
+
         prev_requests: list[GenerationRequestArguments] = []
-        if history:
+        if history and not use_server_history:
             prev_requests = [
                 self.format(req, response=res, **kwargs) for req, res in history
             ]
 
         arguments = GenerationRequestArguments()
         arguments.body = {}
 
+        if use_server_history:
+            _, last_response = history[-1]  # type: ignore[index]
+            if last_response and last_response.response_id:
+                arguments.body["previous_response_id"] = last_response.response_id
+
         if kwargs.get("model") is not None:
             arguments.body["model"] = kwargs["model"]
 

diff --git a/tests/unit/backends/openai/test_http.py b/tests/unit/backends/openai/test_http.py
@@ -140,6 +140,34 @@ def test_invalid_validate_backend_parameter(self):
                 validate_backend=123,  # type: ignore[arg-type]
             )
 
+    @pytest.mark.sanity
+    def test_server_history_requires_responses_api(self):
+        """
+        Test server_history=True raises ValueError for non-responses request formats.
+
+        ## WRITTEN BY AI ##
+        """
+        with pytest.raises(ValueError, match="server_history.*only supported"):
+            OpenAIHTTPBackend(
+                target="http://localhost:8000",
+                request_format="/v1/chat/completions",
+                server_history=True,
+            )
+
+    @pytest.mark.sanity
+    def test_server_history_with_responses_api(self):
+        """
+        Test server_history=True is accepted with /v1/responses.
+
+        ## WRITTEN BY AI ##
+        """
+        backend = OpenAIHTTPBackend(
+            target="http://localhost:8000",
+            request_format="/v1/responses",
+            server_history=True,
+        )
+        assert backend.server_history is True
+
     @pytest.mark.smoke
     def test_factory_registration(self):
         """Test that OpenAIHTTPBackend is registered with Backend factory."""

diff --git a/tests/unit/backends/openai/test_request_handlers.py b/tests/unit/backends/openai/test_request_handlers.py
@@ -2569,6 +2569,57 @@ def test_format_with_history(self, valid_instances):
         assert input_items[1]["content"] == "4"
         assert input_items[2]["role"] == "user"
 
+    @pytest.mark.sanity
+    def test_format_with_server_history(self, valid_instances):
+        """
+        Test format uses previous_response_id instead of replaying history
+        when server_history is enabled.
+
+        ## WRITTEN BY AI ##
+        """
+        instance = valid_instances
+
+        prev_request = GenerationRequest(
+            columns={"text_column": ["What is 2+2?"]},
+        )
+        prev_response = GenerationResponse(
+            request_id="prev", request_args=None, text="4", response_id="resp_abc123"
+        )
+
+        data = GenerationRequest(
+            columns={"text_column": ["What is 3+3?"]},
+        )
+
+        result = instance.format(
+            data, history=[(prev_request, prev_response)], server_history=True
+        )
+
+        assert result.body["previous_response_id"] == "resp_abc123"
+        input_items = result.body["input"]
+        assert len(input_items) == 1
+        assert input_items[0]["role"] == "user"
+
+    @pytest.mark.sanity
+    def test_format_with_server_history_first_turn(self, valid_instances):
+        """
+        Test format does not set previous_response_id on the first turn
+        (no history) even when server_history is enabled.
+
+        ## WRITTEN BY AI ##
+        """
+        instance = valid_instances
+
+        data = GenerationRequest(
+            columns={"text_column": ["Hello!"]},
+        )
+
+        result = instance.format(data, server_history=True)
+
+        assert "previous_response_id" not in result.body
+        input_items = result.body["input"]
+        assert len(input_items) == 1
+        assert input_items[0]["role"] == "user"
+
     # Tool call response handling tests
 
     @pytest.mark.sanity