Skip to content

Conversation

@KaspariK
Copy link
Member

@KaspariK KaspariK commented Jan 29, 2026

Mix of human and Claude, fyi.

The crux of the work here is adding compression. This is made somewhat more painful by supporting S and B JSON (for now). Manually tested restores against a few tables:

  • devc
    • First restore (no compressed items, restoring from existing data)
    • Second restore (mixed items, restoring from both)
      • Forgot to fluffy this before I blew it away, but it looked normal
  • prod

This is also running in tron-infrastage for a few days now, though the jobs there are only testing the non partitioned saves (that's like 99% of jobs, but 0% of problematic jobs :p). You can see some mixed state here.

In terms of revert safety, I'm just counting on pickles in a worst case scenario (i.e. read_json false). The alternatives are not great (multiple stages of read + write rollouts / adding a 3rd json_c attribute or something) and are part of why I haven't rolled this out in prior unhacks.

For impact on partitions, see a sample large job's state.

Note: I should probably add a script here for compressing already saved items. Similar to de-pickling work that would allow us to speed through validating things and get everything to a compressed state quickly.

Extra things outside of compression

Metrics

Added counter and gauge for save loop errors. Imo this is the first step towards maybe not "safe" crashing on some arbitrary number of errors. I still think in most cases we can't really do anything to help, but at least we'd alert on save issues instead of waiting for some other thing to alert (like staleness or api availability)

Logging

I can't recall the ticket number but most of us have run into this verbose logging so I tried to minimize that a bit. I moved some logs up to statemanager so they're not showing up for every restored object + got rid of some unhelpful logs elsewhere.

Tests

This is mostly addressing personal pet peeves. I organized things a bit (delete, restore, and save tests together + removed unused things, etc.). I think these could still use love, but the end is nigh :meow-pensive-pray:.

I did add a test for "legacy" JSON + renamed and split some of the tests around saving and restoring for consistency reasons.

Types

Fixed some mypy issues. Way back when we set return types on some of the serialization funcs as Optional[str] and eventually str | None, I don't think this was correct. I'm dumb? (Only a little)

@KaspariK KaspariK marked this pull request as ready for review February 2, 2026 19:57
@KaspariK KaspariK requested a review from a team as a code owner February 2, 2026 19:57
@KaspariK KaspariK changed the title [WIPWIPWIPWIPWIPWIP] U/kkasp/tron 2452 compress json Compress json Feb 2, 2026
Copy link
Member

@nemacysts nemacysts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(i think this is the mypy error we were talking about earlier - i can ship once that's resolved)

vals = store.restore([key])
assert key not in vals
vals = store.restore(keys)
assert vals == {keys[1]: small_object}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would probably just inline "job_run_state" rather than keys[1] - but that's just me :D

(i did spend long than i'd like to admit trying to figure out if i was missing something though :p)

@KaspariK
Copy link
Member Author

KaspariK commented Feb 4, 2026

(i think this is the mypy error we were talking about earlier - i can ship once that's resolved)

I opted to remove Persistable. I'm not strongly opinionated on this since I think it's pretty funky either way. If you prefer returning "{}" I can go that route. I could also flat out remove NoActionRunner in another branch 🤷

Copy link
Contributor

@jfongatyelp jfongatyelp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(that's like 99% of jobs, but 0% of problematic jobs :p)

I do think it might be worth getting one of the silly spark jobs to run on infrastage, since that's where most of the jobrun state bloat is coming from -- i'm kind of betting even the simplest of executor: spark jobs still needs partitioning, and so would be a good test?

Note: I should probably add a script here for compressing already saved items. Similar to de-pickling work that would allow us to speed through validating things and get everything to a compressed state quickly.

Is the plan that we aren't going to roll out until we have a tool to update existing data, or are we fine just rolling it out and letting it write compressed, restore from either+fallback to pickles for a little bit?

@KaspariK
Copy link
Member Author

KaspariK commented Feb 5, 2026

(that's like 99% of jobs, but 0% of problematic jobs :p)

I do think it might be worth getting one of the silly spark jobs to run on infrastage, since that's where most of the jobrun state bloat is coming from -- i'm kind of betting even the simplest of executor: spark jobs still needs partitioning, and so would be a good test?

I did test with a large job locally by creating a dag with excessive actions (~50 actions of this), it just seemed especially gross to commit to yelpsoa or here. I can take a peek at what actually gets partitioned these days, but I suspect you're right.

Note: I should probably add a script here for compressing already saved items. Similar to de-pickling work that would allow us to speed through validating things and get everything to a compressed state quickly.

Is the plan that we aren't going to roll out until we have a tool to update existing data, or are we fine just rolling it out and letting it write compressed, restore from either+fallback to pickles for a little bit?

My vague plan was to try and finish it off during my on-call since the previous slow rollouts are partly why this is still in a Roman riding state. Technically this should be ok as-is, but the sooner we convert everything the sooner we can move on :p. I meant for the script to already be in here, but ran out of time to test it yesterday. Coming soon!

@KaspariK KaspariK merged commit b4cdf8b into master Feb 9, 2026
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants