Compress json #1090

KaspariK · 2026-01-29T21:57:45Z

Mix of human and Claude, fyi.

The crux of the work here is adding compression. This is made somewhat more painful by supporting S and B JSON (for now). Manually tested restores against a few tables:

devc
- First restore (no compressed items, restoring from existing data)
- Second restore (mixed items, restoring from both)
  - Forgot to fluffy this before I blew it away, but it looked normal
prod
- First restore (no compressed items, restoring from existing data)
- Second restore (mixed state formats, restoring from both)

This is also running in tron-infrastage for a few days now, though the jobs there are only testing the non partitioned saves (that's like 99% of jobs, but 0% of problematic jobs :p). You can see some mixed state here.

In terms of revert safety, I'm just counting on pickles in a worst case scenario (i.e. read_json false). The alternatives are not great (multiple stages of read + write rollouts / adding a 3rd json_c attribute or something) and are part of why I haven't rolled this out in prior unhacks.

For impact on partitions, see a sample large job's state.

Note: I should probably add a script here for compressing already saved items. Similar to de-pickling work that would allow us to speed through validating things and get everything to a compressed state quickly.

Extra things outside of compression

Metrics

Added counter and gauge for save loop errors. Imo this is the first step towards maybe not "safe" crashing on some arbitrary number of errors. I still think in most cases we can't really do anything to help, but at least we'd alert on save issues instead of waiting for some other thing to alert (like staleness or api availability)

Logging

I can't recall the ticket number but most of us have run into this verbose logging so I tried to minimize that a bit. I moved some logs up to statemanager so they're not showing up for every restored object + got rid of some unhelpful logs elsewhere.

Tests

This is mostly addressing personal pet peeves. I organized things a bit (delete, restore, and save tests together + removed unused things, etc.). I think these could still use love, but the end is nigh :meow-pensive-pray:.

I did add a test for "legacy" JSON + renamed and split some of the tests around saving and restoring for consistency reasons.

Types

Fixed some mypy issues. Way back when we set return types on some of the serialization funcs as Optional[str] and eventually str | None, I don't think this was correct. ~~I'm dumb?~~ (Only a little)

…entual deletions

…e errors

nemacysts

(i think this is the mypy error we were talking about earlier - i can ship once that's resolved)

nemacysts · 2026-02-03T19:22:30Z

tests/serialize/runstate/dynamodb_state_store_test.py

-            vals = store.restore([key])
-        assert key not in vals
+            vals = store.restore(keys)
+        assert vals == {keys[1]: small_object}


i would probably just inline "job_run_state" rather than keys[1] - but that's just me :D

(i did spend long than i'd like to admit trying to figure out if i was missing something though :p)

…st assertion

KaspariK · 2026-02-04T15:03:58Z

(i think this is the mypy error we were talking about earlier - i can ship once that's resolved)

I opted to remove Persistable. I'm not strongly opinionated on this since I think it's pretty funky either way. If you prefer returning "{}" I can go that route. I could also flat out remove NoActionRunner in another branch 🤷

jfongatyelp

(that's like 99% of jobs, but 0% of problematic jobs :p)

I do think it might be worth getting one of the silly spark jobs to run on infrastage, since that's where most of the jobrun state bloat is coming from -- i'm kind of betting even the simplest of executor: spark jobs still needs partitioning, and so would be a good test?

Note: I should probably add a script here for compressing already saved items. Similar to de-pickling work that would allow us to speed through validating things and get everything to a compressed state quickly.

Is the plan that we aren't going to roll out until we have a tool to update existing data, or are we fine just rolling it out and letting it write compressed, restore from either+fallback to pickles for a little bit?

KaspariK · 2026-02-05T15:58:03Z

(that's like 99% of jobs, but 0% of problematic jobs :p)

I do think it might be worth getting one of the silly spark jobs to run on infrastage, since that's where most of the jobrun state bloat is coming from -- i'm kind of betting even the simplest of executor: spark jobs still needs partitioning, and so would be a good test?

I did test with a large job locally by creating a dag with excessive actions (~50 actions of this), it just seemed especially gross to commit to yelpsoa or here. I can take a peek at what actually gets partitioned these days, but I suspect you're right.

Note: I should probably add a script here for compressing already saved items. Similar to de-pickling work that would allow us to speed through validating things and get everything to a compressed state quickly.

Is the plan that we aren't going to roll out until we have a tool to update existing data, or are we fine just rolling it out and letting it write compressed, restore from either+fallback to pickles for a little bit?

My vague plan was to try and finish it off during my on-call since the previous slow rollouts are partly why this is still in a Roman riding state. Technically this should be ok as-is, but the sooner we convert everything the sooner we can move on :p. I meant for the script to already be in here, but ran out of time to test it yesterday. Coming soon!

KaspariK added 6 commits January 29, 2026 13:51

Compress json

d4420d2

Add metrics and move some logic around to improve code clarity for ev…

490a73c

…entual deletions

Types + simplified metrics and logging

11d187b

Import dynamodb Binary type and remove high cardinality labels on sav…

c2c6411

…e errors

Some test cleanup and organization + a test for restoring 'legacy' json

fe1f234

More log and comment cleanup + fixing types

d21752d

KaspariK marked this pull request as ready for review February 2, 2026 19:57

KaspariK requested a review from a team as a code owner February 2, 2026 19:57

KaspariK changed the title ~~[WIPWIPWIPWIPWIPWIP] U/kkasp/tron 2452 compress json~~ Compress json Feb 2, 2026

Update Persistable return type

a2c7e0e

nemacysts reviewed Feb 3, 2026

View reviewed changes

Remove Persistable from NoActionRunnerFactory; use explicit key in te…

874c082

…st assertion

jfongatyelp approved these changes Feb 5, 2026

View reviewed changes

nemacysts approved these changes Feb 6, 2026

View reviewed changes

KaspariK merged commit b4cdf8b into master Feb 9, 2026
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compress json #1090

Compress json #1090

Uh oh!

KaspariK commented Jan 29, 2026 •

edited

Loading

Uh oh!

nemacysts left a comment

Uh oh!

nemacysts Feb 3, 2026

Uh oh!

KaspariK commented Feb 4, 2026

Uh oh!

jfongatyelp left a comment

Uh oh!

KaspariK commented Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Compress json #1090

Compress json #1090

Uh oh!

Conversation

KaspariK commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Extra things outside of compression

Metrics

Logging

Tests

Types

Uh oh!

nemacysts left a comment

Choose a reason for hiding this comment

Uh oh!

nemacysts Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

KaspariK commented Feb 4, 2026

Uh oh!

jfongatyelp left a comment

Choose a reason for hiding this comment

Uh oh!

KaspariK commented Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KaspariK commented Jan 29, 2026 •

edited

Loading