-
Notifications
You must be signed in to change notification settings - Fork 67
Compress json #1090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compress json #1090
Conversation
nemacysts
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(i think this is the mypy error we were talking about earlier - i can ship once that's resolved)
| vals = store.restore([key]) | ||
| assert key not in vals | ||
| vals = store.restore(keys) | ||
| assert vals == {keys[1]: small_object} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would probably just inline "job_run_state" rather than keys[1] - but that's just me :D
(i did spend long than i'd like to admit trying to figure out if i was missing something though :p)
I opted to remove Persistable. I'm not strongly opinionated on this since I think it's pretty funky either way. If you prefer returning "{}" I can go that route. I could also flat out remove NoActionRunner in another branch 🤷 |
jfongatyelp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(that's like 99% of jobs, but 0% of problematic jobs :p)
I do think it might be worth getting one of the silly spark jobs to run on infrastage, since that's where most of the jobrun state bloat is coming from -- i'm kind of betting even the simplest of executor: spark jobs still needs partitioning, and so would be a good test?
Note: I should probably add a script here for compressing already saved items. Similar to de-pickling work that would allow us to speed through validating things and get everything to a compressed state quickly.
Is the plan that we aren't going to roll out until we have a tool to update existing data, or are we fine just rolling it out and letting it write compressed, restore from either+fallback to pickles for a little bit?
I did test with a large job locally by creating a dag with excessive actions (~50 actions of this), it just seemed especially gross to commit to yelpsoa or here. I can take a peek at what actually gets partitioned these days, but I suspect you're right.
My vague plan was to try and finish it off during my on-call since the previous slow rollouts are partly why this is still in a Roman riding state. Technically this should be ok as-is, but the sooner we convert everything the sooner we can move on :p. I meant for the script to already be in here, but ran out of time to test it yesterday. Coming soon! |
Mix of human and Claude, fyi.
The crux of the work here is adding compression. This is made somewhat more painful by supporting S and B JSON (for now). Manually tested restores against a few tables:
This is also running in tron-infrastage for a few days now, though the jobs there are only testing the non partitioned saves (that's like 99% of jobs, but 0% of problematic jobs :p). You can see some mixed state here.
In terms of revert safety, I'm just counting on pickles in a worst case scenario (i.e. read_json false). The alternatives are not great (multiple stages of read + write rollouts / adding a 3rd json_c attribute or something) and are part of why I haven't rolled this out in prior unhacks.
For impact on partitions, see a sample large job's state.
Note: I should probably add a script here for compressing already saved items. Similar to de-pickling work that would allow us to speed through validating things and get everything to a compressed state quickly.
Extra things outside of compression
Metrics
Added counter and gauge for save loop errors. Imo this is the first step towards maybe not "safe" crashing on some arbitrary number of errors. I still think in most cases we can't really do anything to help, but at least we'd alert on save issues instead of waiting for some other thing to alert (like staleness or api availability)
Logging
I can't recall the ticket number but most of us have run into this verbose logging so I tried to minimize that a bit. I moved some logs up to statemanager so they're not showing up for every restored object + got rid of some unhelpful logs elsewhere.
Tests
This is mostly addressing personal pet peeves. I organized things a bit (delete, restore, and save tests together + removed unused things, etc.). I think these could still use love, but the end is nigh :meow-pensive-pray:.
I did add a test for "legacy" JSON + renamed and split some of the tests around saving and restoring for consistency reasons.
Types
Fixed some mypy issues. Way back when we set return types on some of the serialization funcs as
Optional[str]and eventuallystr | None, I don't think this was correct.I'm dumb?(Only a little)