sql: Add design doc for addressing the "optimizer/customer trade-off problem" by mgree · Pull Request #35441 · MaterializeInc/materialize

mgree · 2026-03-11T21:07:03Z

When we make changes to the optimizer, we hope that everyone will have a good time (more freshness, less memory usage, etc.). But not every customer's workload is the same, and there's the potential for a customer to have a bad time.

It would be good to ensure that people can continue to have the kind of time they're having, even as we make changes to the optimizer.

This design doc (rendered) explores the space of solutions, and proposes a particular one: plan pinning by way of freezing clusters.

github-actions · 2026-03-11T21:07:14Z

Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone.

PR title guidelines

Use imperative mood: "Fix X" not "Fixed X" or "Fixes X"
Be specific: "Fix panic in catalog sync when controller restarts" not "Fix bug" or "Update catalog code"
Prefix with area if helpful: compute: , storage: , adapter: , sql:

Pre-merge checklist

The PR title is descriptive and will make sense in the git log.
This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).

doc/developer/design/20260311_optimizer_versioning.md

mtabebe · 2026-03-30T17:54:29Z

doc/developer/design/20260311_optimizer_versioning.md

+  - Feature flags apply to entire environments, not individual clusters. If different teams need different flags, they are out of luck.
+  - Exponentially many configurations---we can't test every combination of flags, and flags interact.
+  - Unknown support windows.
+


Isn't this missing: not everything is feature flaggable (or if it is, it might be VERY hard) - since the intro motivated this with the repr type work.

doc/developer/design/20260311_optimizer_versioning.md

ggevay

I have a lot of questions

doc/developer/design/20260311_optimizer_versioning.md

maheshwarip · 2026-04-01T21:35:37Z

doc/developer/design/20260311_optimizer_versioning.md

+
+```sql
+ALTER CLUSTER foo FREEZE;
+ALTER CLUSTER foo UNFREEZE;


So this means to pin a plan, I freeze the cluster? As soon as I unfreeze, the plan immediately goes to the latest LIR plan?

As soon as I unfreeze, the plan immediately goes to the latest LIR plan?

I'm not sure, but I would imagine that the typical procedure for moving away from a pinned plan would be to do either a blue-green deploy, or an ALTER MV-like workflow, so that you can see whether the new plan would work well. We should think more about the details of this, e.g., how it would look like when the "ALTER MV-like workflow" for moving to a newer plan would need to happen on a cluster level.

There's a similar question here about what happens during a blue/green dbt deploy. We spin up & spin down new clusters during a blue green - which means that your plan pinning would disappear!

Is there a way for us to specify plan version explicitly? Something like CREATE CLUSTER WITH OPTIMIZER PLAN v1 or something like that?

mtabebe · 2026-04-02T00:20:14Z

doc/developer/design/20260311_optimizer_versioning.md

+No changes can be made to `foo`: no new dataflows, no removals.
+It will not be part of this work, but it seems sensible to limit other actions on frozen clusters, e.g., you many only run fast-path `SELECT`s and `SUBSCRIBE`s (with the possible exception of queries that touch introspection sources).
+
+### Why at the cluster level?


mtabebe

I think this is a much stronger document and proposal. Nice work

ggevay

I'm liking this new version quite a bit! My biggest concern with plan pinning was

Any changes to the plan and you lose your pin.

(I'd formulate it as "Any changes to the query and you lose your pin.")

But

Mitigation: use MVs to separate the units you care about.

might be enough of a mitigation.

ggevay · 2026-04-02T09:08:55Z

doc/developer/design/20260311_optimizer_versioning.md

+If we freeze `C3`, we certainly can't make changes to `C3`.
+What about `V2` (which is inlined into the definition of `MV2`)?
+What about `MV1` (which is read from persist)?
+What about `S1`?


I'd say the only optimizer-relevant question here is a potential change in V2. This is because changes in MV1 and S1 would change the persist-level input of MV2, so no direct effect on the plan of MV2. (And avoiding downstream surprises is an active topic of discussion in the ALTER MV workstream.)

For V2, actually, our current ALTER VIEW can't change the view's query (can only change the name and owner), so we don't have an issue with this for now.

ggevay · 2026-04-02T09:18:07Z

doc/developer/design/20260311_optimizer_versioning.md

+  - Who flips the bits? If it's us: high support burden. If it's someone else: what if they break things?
+  - Unknown support windows.
+
+### `plan-pinning`


A thing that is unclear to me with plan-pinning is how will users know when they'll have to migrate away from an old pinned plan? With optimizer-versions, this is more clear: We can warn them some time before the scheduled removal of an old optimizer version, and it should be easy even for self-managed users to see whether they'll be affected or not. But with plan-pinning, a breaking change that would make a pinned plan impossible to render anymore would not have an obvious warning before it.

To solve this, we might want to explicitly add some code some time before making a breaking change to just hunt down pinned plans that would be affected, but not break them yet, and warn the user loudly that they should try a migration away from the pinned plan, with also telling them when the actual breaking change is scheduled.

ggevay · 2026-04-02T09:20:05Z

doc/developer/design/20260311_optimizer_versioning.md

+
+Pros:
+
+  + Ties in neatly with related ideas of "production clusters", guarantees, and auto-scaling.


Yes, I feel more and more momentum building lately behind the "production clusters" idea.

ggevay · 2026-04-02T09:22:04Z

doc/developer/design/20260311_optimizer_versioning.md

+
+  + Ties in neatly with related ideas of "production clusters", guarantees, and auto-scaling.
+  + Ties in neatly with related ideas of "DDIR" or some other stable, low-level interface.
+  + Offers the most reliable possible experience---a fixed LIR plan would be stable even if bugfixes in MIR cause queries to change.


Well, optimizer-versions would be even more reliable, in that it would also solve the query-tweaking problem, i.e., when a tiny query tweak does a big plan change due to the tweaked query being planned with different optimizer code.

ggevay · 2026-04-02T09:24:33Z

doc/developer/design/20260311_optimizer_versioning.md

+
+There are two closely related but not identical problems:
+ 1. **`our-bad`** MZ optimizer changed and it broke on redeploy.
+ 2. **`your-bad`** You changed something and it broke in staging.


Well, it's kinda debatable that when a user tweaks a query, and it has a big plan change not because of the change being big at the SQL level, or because of a discontinuity in optimizer code, but because of different optimizer code running, then is it our fault (optimizer code change) or their fault (query change).

But the "Mitigation: use MVs to separate the units you care about." might mitigate this problem enough.

ggevay · 2026-04-02T09:33:49Z

doc/developer/design/20260311_optimizer_versioning.md

+
+```sql
+ALTER CLUSTER foo FREEZE;
+ALTER CLUSTER foo UNFREEZE;


As soon as I unfreeze, the plan immediately goes to the latest LIR plan?

I'm not sure, but I would imagine that the typical procedure for moving away from a pinned plan would be to do either a blue-green deploy, or an ALTER MV-like workflow, so that you can see whether the new plan would work well. We should think more about the details of this, e.g., how it would look like when the "ALTER MV-like workflow" for moving to a newer plan would need to happen on a cluster level.

antiguru

Leaving some comments. I'm not yet convinced that durably recording LIR is a good idea!

antiguru · 2026-04-02T07:42:09Z

doc/developer/design/20260311_optimizer_versioning.md

+
+Cons:
+
+  - Code duplication. (Somewhat mitigated by `git subtree`.)


Please do not use subtrees, it's too much of a headache.

antiguru · 2026-04-02T07:43:14Z

doc/developer/design/20260311_optimizer_versioning.md

+Cons:
+
+  - Code duplication. (Somewhat mitigated by `git subtree`.)
+  - We do not know what kind of support window we will want, and may get backed into things we end up disliking.


That's a product question, not an engineering problem I think!

doc/developer/design/20260311_optimizer_versioning.md

antiguru · 2026-04-02T07:46:07Z

doc/developer/design/20260311_optimizer_versioning.md

+
+Cons:
+
+  - LIR is a not a stable interface. DDIR does not actually exist.


I think we're treating LIR as the stable boundary between optimizer and rendering. It can change, but requires adjustment in all optimizer versions. Changes to rendering are not part of this design doc (but essentially suffer from the same problems explained here.)

antiguru · 2026-04-02T07:49:06Z

doc/developer/design/20260311_optimizer_versioning.md

+
+Cons:
+
+  - LIR is a not a stable interface. DDIR does not actually exist.


Well, DDIR doesn't exist, so hard to say, but I think this:

When evolving it, we can just add a new field that does a new thing.

is potentially dangerous: LIR should have less special-casing in the future, not more. The fact that it has special-casing doesn't justify piling more onto it, but should rather motivate figuring out what's wrong.

Overall, I'd say this design should stop at LIR, especially given we're not going to invest in a DDIR right now.

antiguru · 2026-04-02T07:50:43Z

doc/developer/design/20260311_optimizer_versioning.md

+  + Ties in neatly with related ideas of "DDIR" or some other stable, low-level interface.
+  + Offers the most reliable possible experience---a fixed LIR plan would be stable even if bugfixes in MIR cause queries to change.
+
+Cons:


Another cons: More durable state.

antiguru · 2026-04-02T07:51:35Z

doc/developer/design/20260311_optimizer_versioning.md

+  + The finest-grained control.
+  + Avoids/defers the need to have smart query planning.
+
+Cons:


Another cons: It feels like a trap-door, undoing hints is hard once people adopt it.

antiguru · 2026-04-02T09:42:59Z

doc/developer/design/20260311_optimizer_versioning.md

+
+## Solution Proposal
+
+We propose using **`plan-pinning`**.


I think we shouldn't underestimate the burden to commit to a durable LIR format. What's the process of evolving it? How do we detect changes? How do we prevent silent incompatibilities?

LIR depends on large parts of Mz's implementation, relation expressions, scalars, etc. and I'm not sure all parts have committed to a stable representation.

antiguru · 2026-04-02T09:43:46Z

doc/developer/design/20260311_optimizer_versioning.md

+ALTER CLUSTER foo UNFREEZE;
+```
+
+We will store the LIR for all of the dataflows on `foo`, and automatically use those LIR plans on reboot.


Where will you store the LIR?

mgree added 2 commits March 18, 2026 16:28

draft design doc, wip

badf423

full draft

7fbd51c

mgree force-pushed the design-doc-optimizer-versioning branch from 5801a05 to 7fbd51c Compare March 18, 2026 20:28

mgree mentioned this pull request Mar 25, 2026

[design doc] optimizer release engineering #30233

Closed

5 tasks

mgree marked this pull request as ready for review March 25, 2026 14:31

mtabebe reviewed Mar 30, 2026

View reviewed changes

ggevay reviewed Apr 1, 2026

View reviewed changes

maheshwarip reviewed Apr 1, 2026

View reviewed changes

doc/developer/design/20260311_optimizer_versioning.md Show resolved Hide resolved

rehash, based on feedback from @ggevay, @mtabebe, and @frankmcsherry

3908e42

mgree changed the title ~~sql: Add design doc for optimizer versioning~~ sql: Add design doc for addressing the "optimizer/customer trade-off problem" Apr 1, 2026

maheshwarip reviewed Apr 1, 2026

View reviewed changes

mtabebe reviewed Apr 2, 2026

View reviewed changes

ggevay reviewed Apr 2, 2026

View reviewed changes

antiguru reviewed Apr 2, 2026

View reviewed changes


		Pros:

		+ Ties in neatly with related ideas of "production clusters", guarantees, and auto-scaling.


		Cons:

		- Code duplication. (Somewhat mitigated by `git subtree`.)


		Cons:

		- LIR is a not a stable interface. DDIR does not actually exist.

Conversation

mgree commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 11, 2026

PR title guidelines

Pre-merge checklist

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggevay left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mtabebe left a comment

Choose a reason for hiding this comment

Uh oh!

ggevay left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antiguru left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

mgree commented Mar 11, 2026 •

edited

Loading