Rolling upgrade test (BATS, Docker)#3706
Conversation
…eper Co-authored-by: dsmiley <377295+dsmiley@users.noreply.github.com>
…r checking Co-authored-by: dsmiley <377295+dsmiley@users.noreply.github.com>
Co-authored-by: dsmiley <377295+dsmiley@users.noreply.github.com>
Co-authored-by: dsmiley <377295+dsmiley@users.noreply.github.com>
…, clean up comments Co-authored-by: dsmiley <377295+dsmiley@users.noreply.github.com>
Co-authored-by: dsmiley <377295+dsmiley@users.noreply.github.com>
…ssert, eliminate tedious patterns Co-authored-by: dsmiley <377295+dsmiley@users.noreply.github.com>
…rade Co-authored-by: dsmiley <377295+dsmiley@users.noreply.github.com>
…, reverse upgrade order, skip intermediate verification Co-authored-by: dsmiley <377295+dsmiley@users.noreply.github.com>
…iners Co-authored-by: dsmiley <377295+dsmiley@users.noreply.github.com>
…er ownership Co-authored-by: dsmiley <377295+dsmiley@users.noreply.github.com>
…set proper ownership" This reverts commit 769a8e4.
…ted working Co-authored-by: dsmiley <377295+dsmiley@users.noreply.github.com>
|
Idea: Soon, solr will have Testcontainers as a dependency (#3670), and Testcontainers is ideal for orchestrating and managing containers like this. I suppose you can write a JUnit test that sets up a docker network, then adds 3 Solr containers in cloud mode, then write a loop where you swap one after the other with a newer version and inspect logs or what you need. It makes sense for such a test to be disabled by default or included in nightly. Here's Claude's draft stab at a rolling test: https://claude.ai/public/artifacts/c4d09d7b-df54-419a-b660-169e22191a26 |
Thank you for #3670 but you say having a dependency (on Testcontainers) as if we've achieved something as a project. It's a few lines of configuration to depend on it. An achievement would be https://issues.apache.org/jira/browse/SOLR-17653 albeit that particular issue wouldn't be helpful for the upgrade test.
I am a huge fan of Testcontainers; I filed SOLR-17653 after all. I would have personally enjoyed writing/working with Java code to manipulate containers instead of Bash/BATS. I could have used my desire for SOLR-17653 as an excuse here to do it. I was awefully tempted. I contemplated doing that very thing and I had a 1:1 call with Eric Pugh about this rolling upgrade test to discuss how to go about it. I chose Bash/BATS in spite of disliking the framework. I am no good at this framework; GitHub Copilot did the critical 1st draft and most improvements afterwards. The rolling upgrade test scenario doesn't speak to the strengths of Testcontainers specifically (JUnit wrapper for Docker). It's to the strenghts of Docker, yes, put not specifically Testcontainers/JUnit. This is just orchestration here (Not SolrJ integration which definitely requires SOLR-17653), which is basically what our pile of Bash/BATS scripts do. We have a pile of similar Docker tests in Bash too, albeit not using BATS. Together ( Put differently, if this specific test were to exist as a JUnit/Testcontainers based test, then it really calls into question why we have any Bash/BATS scripts at all -- why would/shouldn't they also be TestContainers based? "When in Rome, do as the romans do" -- I respected our status quo because it works, even though I'm not productive in it. |
i thought we changed out back to having curl in it? |
| # which means you don't get an error message for passing a start arg, like --jvm-opts to a stop commmand. | ||
|
|
||
| # Pre-check | ||
| timeout || skip "timeout utility is not available" |
There was a problem hiding this comment.
I tried to use the new wait_for method but didn't have any luck. So instead, let's just have better handling for when timeout isn't available.
| local resp code body | ||
| sleep 5 | ||
| resp=$(curl -s -S -w "\n%{http_code}" -X POST -H 'Content-type:application/json' -d "$json" "$url") | ||
| code="${resp##*$'\n'}" |
There was a problem hiding this comment.
I wonder if this should be using a run and a assert_success type bats calls?
|
I've been doing some testing of Solr 9.7 to 9.10 wiht changing What would we want it to have to be considred worth merging? SSL setup? oauth based security? Other complex things? |
Gosh; nothing complex to get this useful test merged! I think the leading concern I have is that this test is slow and just different... I would like it to somehow get run separately from a normal integration test run. Like this test should perhaps be opt-in, and we do the opt-in on Jenkins CI but not elsewhere. |
|
@dsmiley I think the one change I want to make is rename it I'd love to get this in to use in testing some other scenarios. |
This reverts commit 4bebd3c.
Tried to make it much closer to test_docker_solrcloud style, but the combination of setup and setup_file + the tear downs defeated me.
dsmiley
left a comment
There was a problem hiding this comment.
test_rolling_upgrade.bats is a much better name.
What keeps this heavy test from the normal rotation?
There was a problem hiding this comment.
why change this test in this PR?
There was a problem hiding this comment.
I was hoping to get both of them using better docker patterns, I'll back it out...
Will rename. As far as heavy test form normal rotation, it honestly isn't that much heavier than some other tests that run. I think in a seperate PR would should come up with criteria for what is considered heavy, and deal with it there... |
|
@dsmiley how do you feel about me merging this? I'd like to build on it for another scenario... |
|
Go for it Eric! Very happy to have team-work where I start something and others take it further. |
|
this integration test has been failing every day since March 20th -- https://ci-builds.apache.org/job/Solr/job/Solr-TestIntegration-10.x/ I wish our CI failures would notify the hell out of us individually. Basically nobody monitors the build list unless it's like an act of charity (and I'm in the mood to be charitable at the moment) |
|
Let la like "wait_for" command error? Though I thought that was a bash function we defined??? |
|
Okay, looks like we time out after 30 seconds, but that appears from the log like it's way not enough time due to slow machine downloading docker images... That is how I read it.. Before I bump it to 120 seconds or larger, would love it if you thought the same: |
|
actually, that doens't make sense... where the wait_for on line 118 is way after we have pulled hte images, so it's not waiting for the docker image to download to run the command I don't think... |
|
From the description of this PR:
I still think that. I doubt we'll ever write a more costly test. |
I spent a bunch of time with a GitHub Copilot Agent to develop a new BATS test to test a rolling upgrade of SolrCloud.
It ignores the current project/build since it runs Docker containers of Solr.
I think this test should not be run by default; it's a slow test. Not sure yet on the best way to segment this test and maybe other slow ones. At least another directory.
It could be improved to build the local project in Docker and use that image as the upgrade destination target.
Separately, I want to use this as a basis to check for Overseer disablement settings, but that's not present here. Maybe I'll add it in this PR.
Disclaimer: Bash is not my comfort zone, but AI churned this out. IntelliJ Junie was also super helpful.