This repository was archived by the owner on Jan 8, 2024. It is now read-only.
Not interpreting server NotFound error as server down#4854
Draft
izaaklauer wants to merge 1 commit intomainfrom
Draft
Not interpreting server NotFound error as server down#4854izaaklauer wants to merge 1 commit intomainfrom
izaaklauer wants to merge 1 commit intomainfrom
Conversation
izaaklauer
commented
Jul 21, 2023
Comment on lines
+12
to
+15
|
|
||
| projConfig "github.com/hashicorp/waypoint/internal/config" | ||
| "github.com/hashicorp/waypoint/internal/core" | ||
| pb "github.com/hashicorp/waypoint/pkg/server/gen" |
Contributor
Author
There was a problem hiding this comment.
Unrelated auto-formatting fix
0a9baf8 to
915362f
Compare
Co-authored-by: Martina Santangelo <[email protected]>
915362f to
bb671e8
Compare
martisah
approved these changes
Jul 25, 2023
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #4460
How does the runner get stuck?
Here's the failure scenario:
1: The server and runner are both connected and fine.
2: A new job arrives, and the server decides to assign it to our runner.
3: Our first problem - the server has an issue assigning the job to a runner, and it's a NotFound-type error. We've seen it happen here, while getting config-sourcers.
waypoint/pkg/server/singleprocess/service_runner.go
Line 661 in 918df22
4: The runner receives this error, and because it's a NotFound error, it calls
goto RESTART_JOB_STREAM(here), which we should only do if we've actually disconnected from the server. This is the bug. Relevant chunk:waypoint/internal/runner/accept.go
Lines 278 to 287 in 918df22
5: The runner follows the goto, and then blocks here forever:
waypoint/internal/runner/accept.go
Line 224 in 918df22
Why is the runner blocking there, you ask? Here's my best understanding:
If this were a true disconnect, there is a second goroutine that would also be attempting to reconnect - the recvConfig loop (here). It has sophisticated reconnect logic, and if it were to also have seen the disconnect, it would reconnect, and then increment the runner-wide state version here, which would unblock the accept goroutine. @martisah and I verified this through observation.
That isn't happening here, because the recvConfig goroutine hasn't seen an error case, so it never has a reason to modify the runner state.
How does this fix it?
The primary cause is that the runner's accept goroutine is interpreting any NotFound error from the server as a disconnect event, which it expects all other goroutines will see. That's a bad assumption.
This PR modifies the accept goroutine to exit if it sees a NotFound error, rather than attempt a server reconnect. This raised a new bug: the top-level
AcceptManyfunction was interpreting the server sendingNotFoundon the runner job stream as a deregistration event:waypoint/internal/runner/accept.go
Lines 76 to 80 in 918df22
That's also a bad assumption. The server in fact sends FailedPrecondition (here).
This fixes that second bug by again interpreting NotFound as a generic error. The AcceptMany NotFound check was introduced to prevent a runner DDOS event (here), and this continues to protect against that by moving the sleep before reconnect to the default case.
tl;dr: with this change, if the server responds to the runner with a NotFound error, the Accept loop will log the error, exit, and then be restarted successfully by AcceptMany, ready to accept another job.
How can I reproduce this bug?
waypoint up's, destroying the project, recreating the project, and repeating.server down before accepting a job, will reconnect, and then refuse to accept additional jobs. You will notice continued logs fromwaypoint.runner.agent.runner.config.watcher, but none from the accept loop.Future considerations