-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test/system: run tests in parallel where possible #23048
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Luap99 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I don't see this being possible. For instance, any test in |
That is what --no-parallelize-across-files is for different files are never executed in parallel, then we just set the BATS_NO_PARALLELIZE_WITHIN_FILE=true option in any file that cannot run in parallel which is what I do right now. So now I just go over all the file and see where I can make it work in parallel, the potential savings are huge. If it turns out to be not worth it or to complex we can abandon this effort and look into other solutions. |
Just as a base where we are right now. I will push some more changes to see if there is a noticeable speed-up compared to that. |
Looks better an average I would say. Although not by much, I think the run to run variance is still very high I think I need to cut into the slower tests more to make a noticeable impact and maybe up the cpu cores so we can actually take advantage of the parallel runs. |
Grrr, I deserve that for trying to do a quick sloppy review on my way out the door. I will be more careful today. |
Current timings, the time is going down but not as much as I would have hoped. Many slow tests cannot be run in parallel without major changes and our VMs currently only have two cores. I try to use 4 core VM like the int tests next. |
Also one selinux test seem to have flaked which I need to figure out. |
7423739
to
3b44c98
Compare
So the one things I am trying to understand is when running locally (12 threads) I see a huge delta from run to run, i.e. running the selinux file is around ~13s most of the times but then there are outliers where it is 45-50s. I like to understand this before I continue pushing more changes into this PR. |
Are you running with |
Yes always with
Note I am using remote here as I have been using it all day to reproduce selinux remote flake I saw but I am pretty sure I have seen this with local and other files as well. |
It is always a different tests that is slow so not sure where the pattern is |
#22886 enabled
Weird. My first assumption was a lock, but that seems unlikely if only one test is hiccuping. I've skimmed the source file and see nothing obvious. |
3b44c98
to
8c27e91
Compare
Cherry-picked commits from #22831. Given it runs in parallel here maybe IO is a bigger bottleneck so I like to try it out. |
Well the good news is we see good speed improvements
The said news is weird flakes...
^^^ This one is not even part of a parallel test file so this is extremely weird and it failed in two different runs so likely very common. There is also the selinux remote failure thing I looked at yesterday. I found the root cause for that but it might take a bit before I can fix it. |
e0c07d2
to
3542056
Compare
3542056
to
ee3fab4
Compare
Ok finally captured the issue, something in auto-update seems to leak a container (stucked in removing state for some reason). Auto-update runs them in system units so I guess systemd killing the podman process in the wrong moment for whatever reasons, that will be fun to debug :( |
2a41ccc
to
1c25285
Compare
This is part 1 of major work to try to parallelize as much system tests as we can do speed overall test time up. Problem is we can no longer perfom any leaks check or removal after each run which could be an issue. However as of commit 81c90f5 we no longer do the leak check anyway in CI. There will be tests that cannot be parallelized as they touch global state, i.e. image removal. This commits used the bats -j option to run tests in parallel but also sets --no-parallelize-across-files to make sure we never run parallel across files. This allows us to disable parallel on a per file basis via the BATS_NO_PARALLELIZE_WITHIN_FILE=true option. Right now only 001 and 005 are setup to run in parallel and this alone gives me a 30s total time improvement locally when running both files. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
They can run in parallel as they only create containers with different names so there is no conflict between them. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
I changed a bunch of tests to ensure they all have unique names and don't conflict. This allows them to be run in parallel. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
No changes needed, one container name was changed to make it more future proof. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
No changes needed as they only inspect a single image. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
No changes needed. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
No changes needed. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
21e7057
to
7736a7a
Compare
Dropped "cirrus: use fastvm for system tests" and "test/system: enable parallel selinux tests" from the series, I am pretty sure this should pass |
Well or maybe not, I guess the free port logic is not race free as we just check once so it is causing flakes in the pasta tests. For the two healthcheck flakes I need to take a deeper look, there are sleeps in the test to wait for healthchecks in the background so I expect it is racy. |
The container name chnage is not required but makes more robust for the future. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
The container name change is not required but makes more robust for the future. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Merge 190-run-ipcns into the 195-run-namespaces so that they can run in parallel with the slow "podman test all namespaces" test. In order to do so I had to assign them different container names to make sure they do not conflict. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Assign unique container names to prevent conflicts. As many of them do sleep's this should benefit them a lot. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Cp tests overall seem to be one of the slowest things of all system tests. A local run of this file takes over 4 minutes. As such doing this in parallel will help us a lot and there is no technical reason why we cannot do that. The only issue is the reuse of the same container names for all tests. Thus this commit fixes all names, yes it is pretty bad here so this is a very big diff. Additionally I gave each test its own unique cX num to make debugging leaks easier, with that I know exactly which tests "leaked" compared to only random names. Local test time (12 threads) is now 90s compared to the 4m before. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Check external contianers as well and to improve debugging show the full lines that include the name and image used and not just the id. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Assign unique container names to prevent future conflicts. Not strictly required here. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Assign unique container names to prevent future conflicts. Not strictly required here. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
On a single core system bats errors with `--no-parallelize-across-files requires at least --jobs 2` To fix this make sure we always use at least two jobs. Having more jobs than cpu is not a problem and doesn't seem to really affect the runtime. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Using sleeps is by definition not robust. Unfortunately we want to actually test the healthcheck timer fires in the background so we have to wait as such there is no way around waiting. However one trick we can use instead of sleeping and doing nothing we can (ab)use podman events until --until which will list and wait events until the given time. With that we can read and then directly check events without having to do both sleep and events separately. Hopefully this is enough to make the test stable. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
7736a7a
to
0904d66
Compare
dropped "parallel pasta tests" for now. I think we need to teach the random port logic to assign no conflicting ports in parallel case similar to what you did in e2e but I am not sure if bats exposes the "node" number Also pushed another commit to hopefully fix the healthckeck flake. |
Also one thing to consider this makes it very hard to track flakes for you, currently if a test fails we call cleanup which well removes all containers,etc... so everything running in parallel will likely fail as well because of that, i.e. look at https://api.cirrus-ci.com/v1/artifact/task/5497890087370752/html/sys-podman-fedora-39-root-host-boltdb.log.html I am not sure if there is a way around that, we could try to defer cleanup until the end of the file but it will make the teardown code even more complicated. |
I've been reviewing all day as time allows and I'm feeling a growing concern. This is a lot of complexity, not just for reviewing right now but for all future system-test maintenance. I see new PRs merging with hidden bombs that will cause hard-to-debug flakes: the mere addition of You've invested enormous effort into this. I respect that but I need more time to keep looking at this and see if I can justify it. |
Yes that is true but keep in mind that only a small subset is actually run in parallel. Many slow tests can still be converted but I didn't want to force that all in one mega PR. And right now I decided to keep the slow 2 core VMs here as there is not much benefit in higher core count until I convert more tests. Once we have a large part converting bumping up VMS to 4 cores like done in e2e will result in a nice speedup. That said I totally agree that this will make ongoing maintenance harder, flake tracking will be more difficult, reviewing test that have side effects (--all, --latest, ps/ls commands, etc...) will be hard. The main reason why I like this besides speedup is that we never had actual parallel testing done and I found several actual podman issues so far here. Anyhow that doesn't mean we have to do parallel testing in system tests though. |
This is part 1 of major work to try to parallelize as much system tests
as we can do speed overall test time up.
Problem is we can no longer perfom any leaks check or removal after each
run which could be an issue. However as of commit 81c90f5 we no longer
do the leak check anyway in CI.
There will be tests that cannot be parallelized as they touch global
state, i.e. image removal.
This commits used the bats -j option to run tests in parallel but also
sets --no-parallelize-across-files to make sure we never run parallel
across files. This allows us to disable parallel on a per file basis via
the BATS_NO_PARALLELIZE_WITHIN_FILE=true option.
Right now only 001 and 005 are setup to run in parallel and this alone
gives me a 30s total time improvement locally when running both files.
Does this PR introduce a user-facing change?