Skip to content

reverseproxy: add lb_retry_match condition on response status#7569

Open
seroperson wants to merge 1 commit intocaddyserver:masterfrom
seroperson:i6267-reverse-proxy-retry
Open

reverseproxy: add lb_retry_match condition on response status#7569
seroperson wants to merge 1 commit intocaddyserver:masterfrom
seroperson:i6267-reverse-proxy-retry

Conversation

@seroperson
Copy link
Copy Markdown

@seroperson seroperson commented Mar 14, 2026

Closes #6267

  • lb_retry_match now handles expression option which evaluates against the upstream response via {rp.status_code}, {rp.status_text}, and {rp.header.X-Header} placeholders in CEL expressions
  • Added new helper function to CEL expressions isTransportError()
  • Using both expression + old syntax (I mean just a plain old lb_retry_match options) isn't allowed
  • Fully backward-compatible

It wasn't possible to implement it like lb_retry_match `path('/foo*') && {rp.status_code} in [502, 503]` because of how lb_retry_match is handled, but you can use a oneliner like lb_retry_match expression `path('/foo*') && {rp.status_code} in [502, 503]`

Everything is covered with unit-tests, e2e tests and caddyfile tests.

⚠️ Initially I wanted to preserve both old syntax + new syntax to make everything AND'ed, but it wasn't possible because of how tough the logic is: some things rely on existence of lb_retry_match options, some things not, and sometimes it's unclear whether something would be retried or not.

For example, the previous logic:

  reverse_proxy upstream1 upstream2 {
      lb_retries 3
  }

  ┌───────────────────────┬───────────────┐
  │       Scenario        │   Behavior    │
  ├───────────────────────┼───────────────┤
  │ Dial error            │ retried       │
  ├───────────────────────┼───────────────┤
  │ Transport error, GET  │ retried       │
  ├───────────────────────┼───────────────┤
  │ Transport error, POST │ not retried   │
  ├───────────────────────┼───────────────┤
  │ Transport error, PUT  │ not retried   │
  ├───────────────────────┼───────────────┤
  │ Any response          │ proxied as-is │
  └───────────────────────┴───────────────┘
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          method PUT
      }
  }

  ┌───────────────────────┬───────────────┐
  │       Scenario        │   Behavior    │
  ├───────────────────────┼───────────────┤
  │ Dial error            │ retried       │
  ├───────────────────────┼───────────────┤
  │ Transport error, GET  │ not retried   │
  ├───────────────────────┼───────────────┤
  │ Transport error, POST │ not retried   │
  ├───────────────────────┼───────────────┤
  │ Transport error, PUT  │ retried       │
  ├───────────────────────┼───────────────┤
  │ Any response          │ proxied as-is │
  └───────────────────────┴───────────────┘

GET loses its' default transport error retry because RetryMatch is non-nil and only PUT is defined. Adding expression handling would result into even more unclear logic. That's why using expression with plain options isn't allowed.

Just to remind what transport error is (roughly):

  • Upstream accepted the connection then reset it
  • TLS handshake succeeded but upstream closed during request/response transfer
  • Upstream timed out while sending the response body
  • Upstream sent malformed HTTP response headers

Now examples of how the new expression can be used:

  • Retry 502/503 responses only
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `{rp.status_code} in [502, 503]`
      }
  }

  ┌──────────────────────────────┬───────────────┐
  │           Scenario           │   Behavior    │
  ├──────────────────────────────┼───────────────┤
  │ Dial error, any method       │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ Transport error, any method  │ not retried   │
  ├──────────────────────────────┼───────────────┤
  │ 502/503 response, any method │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ 200/500 response             │ proxied as-is │
  └──────────────────────────────┴───────────────┘
  • Retry on response header only
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `{rp.header.X-Upstream-Retry} == "true"`
      }
  }

  ┌──────────────────────────────────────┬───────────────┐
  │               Scenario               │   Behavior    │
  ├──────────────────────────────────────┼───────────────┤
  │ Dial error, any method               │ retried       │
  ├──────────────────────────────────────┼───────────────┤
  │ Transport error, any method          │ not retried   │
  ├──────────────────────────────────────┼───────────────┤
  │ Response with X-Upstream-Retry: true │ retried       │
  ├──────────────────────────────────────┼───────────────┤
  │ Response without header              │ proxied as-is │
  └──────────────────────────────────────┴───────────────┘
  • Retry 503 only for POST
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `method('POST') && {rp.status_code} == 503`
      }
  }

  ┌─────────────────────────────┬───────────────┐
  │          Scenario           │   Behavior    │
  ├─────────────────────────────┼───────────────┤
  │ Dial error, any method      │ retried       │
  ├─────────────────────────────┼───────────────┤
  │ Transport error, any method │ not retried   │
  ├─────────────────────────────┼───────────────┤
  │ 503 POST                    │ retried       │
  ├─────────────────────────────┼───────────────┤
  │ 503 GET                     │ not retried   │
  ├─────────────────────────────┼───────────────┤
  │ 502 POST                    │ not retried   │
  ├─────────────────────────────┼───────────────┤
  │ 200 POST                    │ proxied as-is │
  └─────────────────────────────┴───────────────┘
  • Retry 5xx for /api* paths only
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `path('/api*') && {rp.status_code} >= 500`
      }
  }

  ┌─────────────────────────────┬───────────────┐
  │          Scenario           │   Behavior    │
  ├─────────────────────────────┼───────────────┤
  │ Dial error, any method      │ retried       │
  ├─────────────────────────────┼───────────────┤
  │ Transport error, any method │ not retried   │
  ├─────────────────────────────┼───────────────┤
  │ 500 GET /api/foo            │ retried       │
  ├─────────────────────────────┼───────────────┤
  │ 502 POST /api/foo           │ retried       │
  ├─────────────────────────────┼───────────────┤
  │ 500 GET /other              │ not retried   │
  ├─────────────────────────────┼───────────────┤
  │ 200 GET /api/foo            │ proxied as-is │
  └─────────────────────────────┴───────────────┘
  • Retry all transport errors + 502/503 responses
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `isTransportError() || {rp.status_code} in [502, 503]`
      }
  }

  ┌──────────────────────────────┬───────────────┐
  │           Scenario           │   Behavior    │
  ├──────────────────────────────┼───────────────┤
  │ Dial error, any method       │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ Transport error, any method  │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ 502/503 response, any method │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ 200/500 response             │ proxied as-is │
  └──────────────────────────────┴───────────────┘
  • Retry GET transport errors + 502/503 responses
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `(isTransportError() && method('GET')) || {rp.status_code} in [502, 503]`
      }
  }

  ┌──────────────────────────────┬───────────────┐
  │           Scenario           │   Behavior    │
  ├──────────────────────────────┼───────────────┤
  │ Dial error, any method       │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ Transport error, GET         │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ Transport error, POST        │ not retried   │
  ├──────────────────────────────┼───────────────┤
  │ 502/503 response, any method │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ 200 response                 │ proxied as-is │
  └──────────────────────────────┴───────────────┘
  • Retry PUT transport errors to /api* + any 502
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `(isTransportError() && method('PUT') && path('/api*')) || {rp.status_code} == 502`
      }
  }

  ┌────────────────────────────────────┬───────────────┐
  │              Scenario              │   Behavior    │
  ├────────────────────────────────────┼───────────────┤
  │ Dial error, any method             │ retried       │
  ├────────────────────────────────────┼───────────────┤
  │ Transport error, PUT /api/foo      │ retried       │
  ├────────────────────────────────────┼───────────────┤
  │ Transport error, PUT /other        │ not retried   │
  ├────────────────────────────────────┼───────────────┤
  │ Transport error, GET /api/foo      │ not retried   │
  ├────────────────────────────────────┼───────────────┤
  │ 502 response, any method, any path │ retried       │
  ├────────────────────────────────────┼───────────────┤
  │ 200 response                       │ proxied as-is │
  └────────────────────────────────────┴───────────────┘
  • Retry POST on 503, GET on transport errors
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `method('POST') && {rp.status_code} == 503`
      }
      lb_retry_match {
          expression `isTransportError() && method('GET')`
      }
  }

  ┌────────────────────────┬───────────────────┐
  │        Scenario        │     Behavior      │
  ├────────────────────────┼───────────────────┤
  │ Dial error, any method │ retried           │
  ├────────────────────────┼───────────────────┤
  │ Transport error, GET   │ retried (block 2) │
  ├────────────────────────┼───────────────────┤
  │ Transport error, POST  │ not retried       │
  ├────────────────────────┼───────────────────┤
  │ 503 POST               │ retried (block 1) │
  ├────────────────────────┼───────────────────┤
  │ 503 GET                │ not retried       │
  ├────────────────────────┼───────────────────┤
  │ 502 POST               │ not retried       │
  └────────────────────────┴───────────────────┘
  • Retry idempotent transport errors + 502 for specific host
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `isTransportError() && (method('GET') || method('PUT') || method('DELETE'))`
      }
      lb_retry_match {
          expression `host('api.example.com') && {rp.status_code} == 502`
      }
  }

  ┌──────────────────────────┬───────────────────┐
  │         Scenario         │     Behavior      │
  ├──────────────────────────┼───────────────────┤
  │ Dial error, any method   │ retried           │
  ├──────────────────────────┼───────────────────┤
  │ Transport error, GET     │ retried (block 1) │
  ├──────────────────────────┼───────────────────┤
  │ Transport error, PUT     │ retried (block 1) │
  ├──────────────────────────┼───────────────────┤
  │ Transport error, DELETE  │ retried (block 1) │
  ├──────────────────────────┼───────────────────┤
  │ Transport error, POST    │ not retried       │
  ├──────────────────────────┼───────────────────┤
  │ 502 to api.example.com   │ retried (block 2) │
  ├──────────────────────────┼───────────────────┤
  │ 502 to other.example.com │ not retried       │
  ├──────────────────────────┼───────────────────┤
  │ 200 response             │ proxied as-is     │
  └──────────────────────────┴───────────────────┘
  • Retry transport errors for multiple methods + 5xx for /api*
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `isTransportError() && (method('GET') || method('PUT'))`
      }
      lb_retry_match {
          expression `path('/api*') && {rp.status_code} >= 500`
      }
  }

  ┌────────────────────────┬───────────────────┐
  │        Scenario        │     Behavior      │
  ├────────────────────────┼───────────────────┤
  │ Dial error, any method │ retried           │
  ├────────────────────────┼───────────────────┤
  │ Transport error, GET   │ retried (block 1) │
  ├────────────────────────┼───────────────────┤
  │ Transport error, PUT   │ retried (block 1) │
  ├────────────────────────┼───────────────────┤
  │ Transport error, POST  │ not retried       │
  ├────────────────────────┼───────────────────┤
  │ 500 GET /api/foo       │ retried (block 2) │
  ├────────────────────────┼───────────────────┤
  │ 502 POST /api/foo      │ retried (block 2) │
  ├────────────────────────┼───────────────────┤
  │ 500 GET /other         │ not retried       │
  ├────────────────────────┼───────────────────┤
  │ 200 response           │ proxied as-is     │
  └────────────────────────┴───────────────────┘
  • Retry on response header + all transport errors
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `isTransportError() || {rp.header.X-Upstream-Retry} == "true"`
      }
  }

  ┌─────────────────────────────────┬───────────────┐
  │            Scenario             │   Behavior    │
  ├─────────────────────────────────┼───────────────┤
  │ Dial error, any method          │ retried       │
  ├─────────────────────────────────┼───────────────┤
  │ Transport error, any method     │ retried       │
  ├─────────────────────────────────┼───────────────┤
  │ 200 with X-Upstream-Retry: true │ retried       │
  ├─────────────────────────────────┼───────────────┤
  │ 502 with X-Upstream-Retry: true │ retried       │
  ├─────────────────────────────────┼───────────────┤
  │ 200 without header              │ proxied as-is │
  └─────────────────────────────────┴───────────────┘

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 14, 2026

CLA assistant check
All committers have signed the CLA.

@seroperson
Copy link
Copy Markdown
Author

seroperson commented Mar 16, 2026

I've revisited this PR to make it simpler. Shortly, the only thing left is retrying based on status, which I moved to the lb_retry_match block.

I've removed retrying based on conditions as it messes with default behavior too much and it makes everything too complicated and unclear. I still can implement them in this PR, but feels like you should point me on how they must be defined and handled.

Updated the PR's description, previous one is now hidden under "Initial version" spoiler.

@seroperson seroperson changed the title feat(reverse_proxy): lb_retry_on conditions feat(reverse_proxy): lb_retry_match now contains status field Mar 16, 2026
@seroperson seroperson force-pushed the i6267-reverse-proxy-retry branch from b8aaa26 to 4e31ada Compare March 16, 2026 14:50
@seroperson
Copy link
Copy Markdown
Author

@mohammed90 Hello! Sorry for ping, but would you mind reviewing this?

@seroperson seroperson force-pushed the i6267-reverse-proxy-retry branch from eeb231c to d106d8c Compare March 25, 2026 00:16
@francislavoie
Copy link
Copy Markdown
Member

I find this quite awkward. As documented, the config layer says lb_retry_match takes a request matcher https://caddyserver.com/docs/caddyfile/directives/reverse_proxy#lb_retry_match. But with this, it's a weird mix of request and response matchers. I'm not sure I like that. I understand the motivation though.

Could we use response matchers https://caddyserver.com/docs/caddyfile/response-matchers directly instead of statusCodes? It would allow the status code matching to come "for free", and add header matching support. I don't know if header matching is that useful, but it might be if the backend explicitly responds with some X-Retry-Me: true header or something lol

This also needs Caddyfile adapt tests. See .caddyfiletest files, there should be coverage for various patterns to cover backwards compat and new config.

@francislavoie francislavoie added the feature ⚙️ New feature or request label Mar 25, 2026
@francislavoie francislavoie changed the title feat(reverse_proxy): lb_retry_match now contains status field reverseproxy: add lb_retry_match condition on response status Mar 25, 2026
@seroperson
Copy link
Copy Markdown
Author

seroperson commented Mar 25, 2026

@francislavoie yea, the parsing looks weird in such scenario.

The reason why I did it exactly like this is that I wanted to preserve the ability to write the spec like this:

lb_retry_match {
  method POST
  status 503
}

This way it looks understandable for an end user and reads exactly like "retry on match method POST and status 503". We can preserve this approach and just re-use some parsing from response-matcher, so it will look like:

lb_retry_match {
  <request-matcher>
  <response-matcher>
}
// or like this
lb_retry_match {
  <request-matcher>
  on_response {
    <response-matcher>
  }
}

It will be still the same mix, but with re-used parsing logic.


OR we can split the matchers, so it will look like:

lb_retry_match {
  <request-matcher>
}
lb_retry_match_response {
  <response-matcher>
}

This way the code is cleaner and there is a strict separation of matching requests and responses, but this approach a bit less flexible because of how we treat conditions:

  • Currently every lb_retry_match is ORed and nested fields are ANDed, like lb_retry_match { ... AND ... } OR lb_retry_match { ... AND ... }.
  • With both lb_retry_match and lb_retry_match_response it intuitively feels like this: (lb_retry_match { ... AND ... } OR lb_retry_match { ... AND ... }) AND (lb_retry_match_response { ... AND ... } OR lb_retry_match_response { ... AND ... })

But then we will be unable to implement something like "retry on (DELETE and 401) or (POST and 503)". It will be only "retry on (DELETE or POST) and (401 or 503)".

Maybe I'm missing something.


@francislavoie @mohammed90 What do you think? Which approach to implement then?

@francislavoie
Copy link
Copy Markdown
Member

francislavoie commented Mar 27, 2026

Hmm.

I think we can just reframe this.

What we want is to match on response status or headers.

We can actually do this using CEL expressions already, as long as the Replacer is filled with properties of the response, placeholders can be used in the CEL expression to do complex conditions involving those.

We document in https://caddyserver.com/docs/caddyfile/directives/reverse_proxy#intercepting-responses that {rp.status_code}, {rp.status_text}, and {rp.header.*} can be used in handle_response, we can ensure those are populated as well in the lb_retry codepath if it's not already. Then writing config like this should be possible:

lb_retry_match `path('/foo*') && {rp.status_code} in [502, 503]`

The backtick syntax might not work as-is in lb_retry_match, I don't remember how I implemented that, whether it only works on @named-matchers or if it would also work on subdirectives that take matchers too, but if not then the expression matcher name would need to be put in front of it.

We'll also need some .caddyfiletest Caddyfile -> JSON adapt fixtures for this functionality to prove is encodes properly.

@seroperson seroperson force-pushed the i6267-reverse-proxy-retry branch 2 times, most recently from 03fcc72 to 8b1f9c6 Compare March 28, 2026 15:26
@seroperson
Copy link
Copy Markdown
Author

seroperson commented Mar 28, 2026

@francislavoie thank you for the review and pointing in the right direction! I've revisited changes and re-implemented them.

Now:

  • lb_retry_match now handles expression option which evaluates against the upstream response via {rp.status_code}, {rp.status_text}, and {rp.header.X-Header} placeholders in CEL expressions
  • Added new helper function to CEL expressions isTransportError()
  • Using both expression + old syntax (I mean just a plain old lb_retry_match options) isn't allowed
  • Fully backward-compatible

It wasn't possible to implement it like lb_retry_match `path('/foo*') && {rp.status_code} in [502, 503]` because of how lb_retry_match is handled, but you can use a oneliner like lb_retry_match expression `path('/foo*') && {rp.status_code} in [502, 503]`

Everything is covered with unit-tests, e2e tests and caddyfile tests.

⚠️ Initially I wanted to preserve both old syntax + new syntax to make everything AND'ed, but it wasn't possible because of how tough the logic is: some things rely on existence of lb_retry_match options, some things not, and sometimes it's unclear whether something would be retried or not.

For example, the previous logic:

  reverse_proxy upstream1 upstream2 {
      lb_retries 3
  }

  ┌───────────────────────┬───────────────┐
  │       Scenario        │   Behavior    │
  ├───────────────────────┼───────────────┤
  │ Dial error            │ retried       │
  ├───────────────────────┼───────────────┤
  │ Transport error, GET  │ retried       │
  ├───────────────────────┼───────────────┤
  │ Transport error, POST │ not retried   │
  ├───────────────────────┼───────────────┤
  │ Transport error, PUT  │ not retried   │
  ├───────────────────────┼───────────────┤
  │ Any response          │ proxied as-is │
  └───────────────────────┴───────────────┘
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          method PUT
      }
  }

  ┌───────────────────────┬───────────────┐
  │       Scenario        │   Behavior    │
  ├───────────────────────┼───────────────┤
  │ Dial error            │ retried       │
  ├───────────────────────┼───────────────┤
  │ Transport error, GET  │ not retried   │
  ├───────────────────────┼───────────────┤
  │ Transport error, POST │ not retried   │
  ├───────────────────────┼───────────────┤
  │ Transport error, PUT  │ retried       │
  ├───────────────────────┼───────────────┤
  │ Any response          │ proxied as-is │
  └───────────────────────┴───────────────┘

GET loses its' default transport error retry because RetryMatch is non-nil and only PUT is defined. Adding expression handling would result into even more unclear logic. That's why using expression with plain options isn't allowed.

Just to remind what transport error is (roughly):

  • Upstream accepted the connection then reset it
  • TLS handshake succeeded but upstream closed during request/response transfer
  • Upstream timed out while sending the response body
  • Upstream sent malformed HTTP response headers

Now examples of how the new expression can be used:

  • Retry 502/503 responses only
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `{rp.status_code} in [502, 503]`
      }
  }

  ┌──────────────────────────────┬───────────────┐
  │           Scenario           │   Behavior    │
  ├──────────────────────────────┼───────────────┤
  │ Dial error, any method       │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ Transport error, any method  │ not retried   │
  ├──────────────────────────────┼───────────────┤
  │ 502/503 response, any method │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ 200/500 response             │ proxied as-is │
  └──────────────────────────────┴───────────────┘
  • Retry on response header only
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `{rp.header.X-Upstream-Retry} == "true"`
      }
  }

  ┌──────────────────────────────────────┬───────────────┐
  │               Scenario               │   Behavior    │
  ├──────────────────────────────────────┼───────────────┤
  │ Dial error, any method               │ retried       │
  ├──────────────────────────────────────┼───────────────┤
  │ Transport error, any method          │ not retried   │
  ├──────────────────────────────────────┼───────────────┤
  │ Response with X-Upstream-Retry: true │ retried       │
  ├──────────────────────────────────────┼───────────────┤
  │ Response without header              │ proxied as-is │
  └──────────────────────────────────────┴───────────────┘
  • Retry 503 only for POST
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `method('POST') && {rp.status_code} == 503`
      }
  }

  ┌─────────────────────────────┬───────────────┐
  │          Scenario           │   Behavior    │
  ├─────────────────────────────┼───────────────┤
  │ Dial error, any method      │ retried       │
  ├─────────────────────────────┼───────────────┤
  │ Transport error, any method │ not retried   │
  ├─────────────────────────────┼───────────────┤
  │ 503 POST                    │ retried       │
  ├─────────────────────────────┼───────────────┤
  │ 503 GET                     │ not retried   │
  ├─────────────────────────────┼───────────────┤
  │ 502 POST                    │ not retried   │
  ├─────────────────────────────┼───────────────┤
  │ 200 POST                    │ proxied as-is │
  └─────────────────────────────┴───────────────┘
  • Retry 5xx for /api* paths only
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `path('/api*') && {rp.status_code} >= 500`
      }
  }

  ┌─────────────────────────────┬───────────────┐
  │          Scenario           │   Behavior    │
  ├─────────────────────────────┼───────────────┤
  │ Dial error, any method      │ retried       │
  ├─────────────────────────────┼───────────────┤
  │ Transport error, any method │ not retried   │
  ├─────────────────────────────┼───────────────┤
  │ 500 GET /api/foo            │ retried       │
  ├─────────────────────────────┼───────────────┤
  │ 502 POST /api/foo           │ retried       │
  ├─────────────────────────────┼───────────────┤
  │ 500 GET /other              │ not retried   │
  ├─────────────────────────────┼───────────────┤
  │ 200 GET /api/foo            │ proxied as-is │
  └─────────────────────────────┴───────────────┘
  • Retry all transport errors + 502/503 responses
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `isTransportError() || {rp.status_code} in [502, 503]`
      }
  }

  ┌──────────────────────────────┬───────────────┐
  │           Scenario           │   Behavior    │
  ├──────────────────────────────┼───────────────┤
  │ Dial error, any method       │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ Transport error, any method  │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ 502/503 response, any method │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ 200/500 response             │ proxied as-is │
  └──────────────────────────────┴───────────────┘
  • Retry GET transport errors + 502/503 responses
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `(isTransportError() && method('GET')) || {rp.status_code} in [502, 503]`
      }
  }

  ┌──────────────────────────────┬───────────────┐
  │           Scenario           │   Behavior    │
  ├──────────────────────────────┼───────────────┤
  │ Dial error, any method       │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ Transport error, GET         │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ Transport error, POST        │ not retried   │
  ├──────────────────────────────┼───────────────┤
  │ 502/503 response, any method │ retried       │
  ├──────────────────────────────┼───────────────┤
  │ 200 response                 │ proxied as-is │
  └──────────────────────────────┴───────────────┘
  • Retry PUT transport errors to /api* + any 502
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `(isTransportError() && method('PUT') && path('/api*')) || {rp.status_code} == 502`
      }
  }

  ┌────────────────────────────────────┬───────────────┐
  │              Scenario              │   Behavior    │
  ├────────────────────────────────────┼───────────────┤
  │ Dial error, any method             │ retried       │
  ├────────────────────────────────────┼───────────────┤
  │ Transport error, PUT /api/foo      │ retried       │
  ├────────────────────────────────────┼───────────────┤
  │ Transport error, PUT /other        │ not retried   │
  ├────────────────────────────────────┼───────────────┤
  │ Transport error, GET /api/foo      │ not retried   │
  ├────────────────────────────────────┼───────────────┤
  │ 502 response, any method, any path │ retried       │
  ├────────────────────────────────────┼───────────────┤
  │ 200 response                       │ proxied as-is │
  └────────────────────────────────────┴───────────────┘
  • Retry POST on 503, GET on transport errors
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `method('POST') && {rp.status_code} == 503`
      }
      lb_retry_match {
          expression `isTransportError() && method('GET')`
      }
  }

  ┌────────────────────────┬───────────────────┐
  │        Scenario        │     Behavior      │
  ├────────────────────────┼───────────────────┤
  │ Dial error, any method │ retried           │
  ├────────────────────────┼───────────────────┤
  │ Transport error, GET   │ retried (block 2) │
  ├────────────────────────┼───────────────────┤
  │ Transport error, POST  │ not retried       │
  ├────────────────────────┼───────────────────┤
  │ 503 POST               │ retried (block 1) │
  ├────────────────────────┼───────────────────┤
  │ 503 GET                │ not retried       │
  ├────────────────────────┼───────────────────┤
  │ 502 POST               │ not retried       │
  └────────────────────────┴───────────────────┘
  • Retry idempotent transport errors + 502 for specific host
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `isTransportError() && (method('GET') || method('PUT') || method('DELETE'))`
      }
      lb_retry_match {
          expression `host('api.example.com') && {rp.status_code} == 502`
      }
  }

  ┌──────────────────────────┬───────────────────┐
  │         Scenario         │     Behavior      │
  ├──────────────────────────┼───────────────────┤
  │ Dial error, any method   │ retried           │
  ├──────────────────────────┼───────────────────┤
  │ Transport error, GET     │ retried (block 1) │
  ├──────────────────────────┼───────────────────┤
  │ Transport error, PUT     │ retried (block 1) │
  ├──────────────────────────┼───────────────────┤
  │ Transport error, DELETE  │ retried (block 1) │
  ├──────────────────────────┼───────────────────┤
  │ Transport error, POST    │ not retried       │
  ├──────────────────────────┼───────────────────┤
  │ 502 to api.example.com   │ retried (block 2) │
  ├──────────────────────────┼───────────────────┤
  │ 502 to other.example.com │ not retried       │
  ├──────────────────────────┼───────────────────┤
  │ 200 response             │ proxied as-is     │
  └──────────────────────────┴───────────────────┘
  • Retry transport errors for multiple methods + 5xx for /api*
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `isTransportError() && (method('GET') || method('PUT'))`
      }
      lb_retry_match {
          expression `path('/api*') && {rp.status_code} >= 500`
      }
  }

  ┌────────────────────────┬───────────────────┐
  │        Scenario        │     Behavior      │
  ├────────────────────────┼───────────────────┤
  │ Dial error, any method │ retried           │
  ├────────────────────────┼───────────────────┤
  │ Transport error, GET   │ retried (block 1) │
  ├────────────────────────┼───────────────────┤
  │ Transport error, PUT   │ retried (block 1) │
  ├────────────────────────┼───────────────────┤
  │ Transport error, POST  │ not retried       │
  ├────────────────────────┼───────────────────┤
  │ 500 GET /api/foo       │ retried (block 2) │
  ├────────────────────────┼───────────────────┤
  │ 502 POST /api/foo      │ retried (block 2) │
  ├────────────────────────┼───────────────────┤
  │ 500 GET /other         │ not retried       │
  ├────────────────────────┼───────────────────┤
  │ 200 response           │ proxied as-is     │
  └────────────────────────┴───────────────────┘
  • Retry on response header + all transport errors
  reverse_proxy upstream1 upstream2 {
      lb_retries 3
      lb_retry_match {
          expression `isTransportError() || {rp.header.X-Upstream-Retry} == "true"`
      }
  }

  ┌─────────────────────────────────┬───────────────┐
  │            Scenario             │   Behavior    │
  ├─────────────────────────────────┼───────────────┤
  │ Dial error, any method          │ retried       │
  ├─────────────────────────────────┼───────────────┤
  │ Transport error, any method     │ retried       │
  ├─────────────────────────────────┼───────────────┤
  │ 200 with X-Upstream-Retry: true │ retried       │
  ├─────────────────────────────────┼───────────────┤
  │ 502 with X-Upstream-Retry: true │ retried       │
  ├─────────────────────────────────┼───────────────┤
  │ 200 without header              │ proxied as-is │
  └─────────────────────────────────┴───────────────┘

@seroperson seroperson force-pushed the i6267-reverse-proxy-retry branch from 8b1f9c6 to f4ec226 Compare March 28, 2026 15:38
@francislavoie
Copy link
Copy Markdown
Member

I'm confused. That looks like way too many changes for this. Why do we need "isTransportError" and such? That seems very over engineered. It's also changing how the placeholders are being set outside of the retries, I don't understand why. And why is it blocking other matchers to be alongside expression? That's very arbitrary, nothing else does that.

In my mind, the change should've just been "make sure placeholders are set during the retry flow, and add tests". That's it really. Possibly also "allow omitting expression token if the first input to lb_retry_match is a backtick-wrapped token".

@seroperson
Copy link
Copy Markdown
Author

seroperson commented Mar 28, 2026

@francislavoie thank you for a quick response

It's also changing how the placeholders are being set outside of the retries, I don't understand why.

It's now set earlier because the old place executes after the place where we decide to retry. Without the moving these variables would be unset in retry logic, which must go before HandleResponse, as HandleResponse is actually writes bytes to a downstream. It doesn't change the HandleResponse behaviour, just variables are set earlier.

Also I've added header_keys tracking to clear stale headers from previous attempts and prevent situations when, for example, upstream A returned header A, but upstream B didn't and read that previously set header A instead.

Why do we need "isTransportError" and such?

Everything else (path, query, protocol etc) already available in expression out-of-box, there is nothing new added besides isTransportError. But let's imagine we don't have isTransportError. When user writes something like this:

lb_retry_match { expression `{rp.status_code} in [502, 503]` }

Then in case of transport error, that's what happening in retry logic:

if lb.RetryMatch == nil && req.Method != "GET" {
// by default, don't retry requests if they aren't GET
return false
}
match, err := lb.RetryMatch.AnyMatchWithError(req)
if err != nil {
logger.Error("error matching request for retry", zap.Error(err))
return false
}
if !match {
return false
}

We don't get into the very first if because lb.RetryMatch != nil (lb_retry_match is set), and this match, err := lb.RetryMatch.AnyMatchWithError(req) evaluates into false because: rp.status_code is nil at the moment (got transport error, so no response from an upstream at all), so nil in [502,503] returns false.

It makes all transport error non-retriable if expression is set and contains rp.* checks (they always likely will be false). Which makes the whole transport error retrying logic even more unclear for user:

  • Default behavior: on transport error only GET is retriable, other methods are not
  • lb_retry_match { method PUT }: on transport error only PUT is retriable
  • lb_retry_match { method PUT; expression {rp.status_code} in [502, 503] } (or just only expression): on transport error nothing is retriable

I didn't found easy ways of fixing this + not break the existing logic + not code workarounds + let API remain simple and not confusing for user.

And why is it blocking other matchers to be alongside expression?

That's why I separated expression from existing retry logic:

  • This way everything remains backward compatible
  • Transport errors now have a transparent way of handling with isTransportError()
  • Less footguns for users with mixing both plain options and expression (which also allows to use method, path etc)

If we somehow deal with transport error handling, then I guess there is no point to restrict mixing. Actually if we stick to isTransportError, then there is already no strict necessity, just I thought that it will be more clear for users to define everything either using old or new approach, but not mixing them (because options-only approach has its' own way of handling transport errors).

"make sure placeholders are set during the retry flow, and add tests". That's it really.

I also thought that initially, but it's not that simple. Besides things described above, I've also added additional handling of failure counting for health checks (so upstream marked unhealthy only in case of dial/transport errors), and the actual response matcher handling in reverseProxy: the pre-existed retrying logic is done only in tryAgain where response isn't available.

I can revert the isTransportError() and the mixing restriction, but then transport error handling will be unpredictable for users. The rest is necessary for things to work, though maybe I'm missing something and some things may be simplified.

@seroperson seroperson force-pushed the i6267-reverse-proxy-retry branch from f4ec226 to 0283a77 Compare March 28, 2026 23:59
Comment on lines +330 to +336
// expression matchers and non-expression matchers cannot be
// mixed in the same block - expression matchers evaluate
// against responses while non-expression matchers evaluate
// against transport errors
if _, hasExpr := matcherSet["expression"]; hasExpr && len(matcherSet) > 1 {
return d.Errf("lb_retry_match: expression cannot be mixed with other matchers in the same block; use either an expression or request matchers, not both")
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this condition makes sense, it's okay if users mix.

Comment on lines +129 to +132
// rewrite isTransportError() to isTransportError(req) so the
// function receives the request context
m.expandedExpr = transportErrorRegexp.ReplaceAllString(m.expandedExpr, transportErrorExpansion)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want matchers to be proxy/transport aware. That's a very leaky abstraction. I think you only added this as convenience but I don't think it's a good idea, let users just use a placeholder for the transport error check, don't add a new CEL function.


reverse_proxy 127.0.0.1:65535 {
lb_retries 3
lb_retry_match expression `{rp.status_code} in [502, 503]`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This syntax should be possible too (like named matchers also allow it)

Suggested change
lb_retry_match expression `{rp.status_code} in [502, 503]`
lb_retry_match `{rp.status_code} in [502, 503]`

Basically just need to get the first argument and if it's a quoted arg, assume it's an expression. You can find the existing code that does that for named matcher parsing

Comment on lines +1060 to +1066
if prev, ok := repl.Get("http.reverse_proxy.header_keys"); ok {
if keys, ok := prev.([]string); ok {
for _, k := range keys {
repl.Delete("http.reverse_proxy.header." + k)
}
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there must be a way to do this without filing in a map with all the header keys, right? Can we make a like DeleteByPrefix() maybe which takes in foo.bar. to delete everything starting with that string from the replacer?

Comment on lines +1081 to +1083
// {rp.header.*}). Pure request-only matchers (method, path, etc.)
// are skipped to avoid retrying every response that matches a
// request condition
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... Is that why you did it with a cel function? That's so gnarly 😬 that'll take some more thinking. I'd much rather not leak into CEL like that.

I don't remember if I had already done this, do we have a way to register a CEL function from another package? I think we did that for the file matcher, I'd be more ok with it if we defined the function in the proxy package instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature ⚙️ New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Custom conditions for retrying proxy requests

4 participants