Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test/system: run tests in parallel where possible #23048

Closed
wants to merge 18 commits into from

Conversation

Luap99
Copy link
Member

@Luap99 Luap99 commented Jun 19, 2024

This is part 1 of major work to try to parallelize as much system tests
as we can do speed overall test time up.

Problem is we can no longer perfom any leaks check or removal after each
run which could be an issue. However as of commit 81c90f5 we no longer
do the leak check anyway in CI.

There will be tests that cannot be parallelized as they touch global
state, i.e. image removal.

This commits used the bats -j option to run tests in parallel but also
sets --no-parallelize-across-files to make sure we never run parallel
across files. This allows us to disable parallel on a per file basis via
the BATS_NO_PARALLELIZE_WITHIN_FILE=true option.

Right now only 001 and 005 are setup to run in parallel and this alone
gives me a 30s total time improvement locally when running both files.

Does this PR introduce a user-facing change?

None

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none labels Jun 19, 2024
Copy link
Contributor

openshift-ci bot commented Jun 19, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 19, 2024
@edsantiago
Copy link
Member

I don't see this being possible. For instance, any test in 120-load.bats is going to conflict with anything in build or systemd.

@Luap99
Copy link
Member Author

Luap99 commented Jun 19, 2024

I don't see this being possible. For instance, any test in 120-load.bats is going to conflict with anything in build or systemd.

That is what --no-parallelize-across-files is for different files are never executed in parallel, then we just set the BATS_NO_PARALLELIZE_WITHIN_FILE=true option in any file that cannot run in parallel which is what I do right now.
This is really just following your suggestions #22698 (comment) and see where we go from here

So now I just go over all the file and see where I can make it work in parallel, the potential savings are huge. If it turns out to be not worth it or to complex we can abandon this effort and look into other solutions.

@Luap99
Copy link
Member Author

Luap99 commented Jun 19, 2024

type distro user DB local remote container
sys rawhide root 29:26 20:56
sys rawhide rootless 32:39
sys fedora-40-aarch64 root 25:45 17:31
sys fedora-40 root 30:20 19:31
sys fedora-40 rootless 28:34 20:55
sys fedora-39 root boltdb 31:30 21:15
sys fedora-39 rootless boltdb 33:24
sys debian-13 root 30:50 22:23
sys debian-13 rootless 34:11

Just as a base where we are right now. I will push some more changes to see if there is a noticeable speed-up compared to that.

@Luap99
Copy link
Member Author

Luap99 commented Jun 19, 2024

type distro user DB local remote container
sys rawhide root 29:19 18:37
sys rawhide rootless 31:04
sys fedora-40-aarch64 root 22:39 16:40
sys fedora-40 root 28:58 20:04
sys fedora-40 rootless 28:59 19:04
sys fedora-39 root boltdb 31:06 22:30
sys fedora-39 rootless boltdb 29:56
sys debian-13 root 29:07 21:13
sys debian-13 rootless 34:25

Looks better an average I would say. Although not by much, I think the run to run variance is still very high I think I need to cut into the slower tests more to make a noticeable impact and maybe up the cpu cores so we can actually take advantage of the parallel runs.

@edsantiago
Copy link
Member

That is what --no-parallelize-across-files is for

Grrr, I deserve that for trying to do a quick sloppy review on my way out the door. I will be more careful today.

@Luap99
Copy link
Member Author

Luap99 commented Jun 20, 2024

type distro user DB local remote container
sys rawhide root 28:20 17:40
sys rawhide rootless 27:47
sys fedora-40-aarch64 root 24:37 15:53
sys fedora-40 root 27:54 17:53
sys fedora-40 rootless 29:02 16:28
sys fedora-39 root boltdb 30:52 !19:47
sys fedora-39 rootless boltdb 28:38
sys debian-13 root 28:31 20:24
sys debian-13 rootless 29:59

Current timings, the time is going down but not as much as I would have hoped. Many slow tests cannot be run in parallel without major changes and our VMs currently only have two cores. I try to use 4 core VM like the int tests next.

@Luap99
Copy link
Member Author

Luap99 commented Jun 20, 2024

Also one selinux test seem to have flaked which I need to figure out.

@Luap99 Luap99 force-pushed the fast-system-test-5 branch from 7423739 to 3b44c98 Compare June 20, 2024 12:11
@Luap99
Copy link
Member Author

Luap99 commented Jun 20, 2024

So the one things I am trying to understand is when running locally (12 threads) I see a huge delta from run to run, i.e. running the selinux file is around ~13s most of the times but then there are outliers where it is 45-50s.

I like to understand this before I continue pushing more changes into this PR.

@edsantiago
Copy link
Member

Are you running with -T? Can you post sample results?

@Luap99
Copy link
Member Author

Luap99 commented Jun 20, 2024

Yes always with -T (I wonder if we should have this by default in hack/bats?)

$ hack/bats -T --rootless --remote  410
--------------------------------------------------
$ bats  test/system/410-selinux.bats
410-selinux.bats
 ✓ [410] podman selinux: confined container [1680]
 ✓ [410] podman selinux: container with label=disable [2259]
 ✓ [410] podman selinux: privileged container [2432]
 ✓ [410] podman selinux: privileged --userns=host container [2120]
 ✓ [410] podman selinux: --ipc=host container [2808]
 ✓ [410] podman selinux: init container [2241]
 ✓ [410] podman selinux: init container with --security-opt type [2794]
 ✓ [410] podman selinux: init container with --security-opt level&type [3211]
 ✓ [410] podman selinux: init container with --security-opt level [2869]
 ✓ [410] podman selinux: pid=host [3577]
 ✓ [410] podman selinux: container with overridden range [2342]
 - [410] podman selinux: inspect kvm labels (skipped: runtime flag is not passed over remote) [1076]
 ✓ [410] podman selinux: inspect multiple labels [2645]
 ✓ [410] podman selinux: shared context in (some) namespaces [6962]
 ✓ [410] podman selinux: containers in pods share full context [4999]
 ✓ [410] podman selinux: containers in --no-infra pods do not share context [3368]
 ✓ [410] podman with nonexistent labels [2159]
 ✓ [410] podman selinux: check relabel [5928]
 ✓ [410] podman selinux nested [3569]
 ✓ [410] podman EnableLabeledUsers [3553]
 - [410] podman selinux: check unsupported relabel (skipped: not applicable under rootless podman) [959]

21 tests, 0 failures, 2 skipped in 12 seconds

$ hack/bats -T --rootless --remote  410
--------------------------------------------------
$ bats  test/system/410-selinux.bats
410-selinux.bats
 ✓ [410] podman selinux: confined container [1791]
 ✓ [410] podman selinux: container with label=disable [1867]
 ✓ [410] podman selinux: privileged container [40455]
 ✓ [410] podman selinux: privileged --userns=host container [2177]
 ✓ [410] podman selinux: --ipc=host container [2532]
 ✓ [410] podman selinux: init container [2242]
 ✓ [410] podman selinux: init container with --security-opt type [1877]
 ✓ [410] podman selinux: init container with --security-opt level&type [2923]
 ✓ [410] podman selinux: init container with --security-opt level [2824]
 ✓ [410] podman selinux: pid=host [3192]
 ✓ [410] podman selinux: container with overridden range [2505]
 - [410] podman selinux: inspect kvm labels (skipped: runtime flag is not passed over remote) [1205]
 ✓ [410] podman selinux: inspect multiple labels [3443]
 ✓ [410] podman selinux: shared context in (some) namespaces [6226]
 ✓ [410] podman selinux: containers in pods share full context [5480]
 ✓ [410] podman selinux: containers in --no-infra pods do not share context [4180]
 ✓ [410] podman with nonexistent labels [2511]
 ✓ [410] podman selinux: check relabel [6003]
 ✓ [410] podman selinux nested [3538]
 ✓ [410] podman EnableLabeledUsers [3821]
 - [410] podman selinux: check unsupported relabel (skipped: not applicable under rootless podman) [723]

21 tests, 0 failures, 2 skipped in 44 seconds

Note I am using remote here as I have been using it all day to reproduce selinux remote flake I saw but I am pretty sure I have seen this with local and other files as well.

@Luap99
Copy link
Member Author

Luap99 commented Jun 20, 2024

✓ [410] podman selinux: privileged container [40455]

It is always a different tests that is slow so not sure where the pattern is

@edsantiago
Copy link
Member

#22886 enabled -T in Makefile, but I didn't include hack/bats because I couldn't (quickly) think of a way to add a disable-T option. That no longer seems as important now. If you feel like enabling -T in hack/bats here, I'm fine with that.

It is always a different tests that is slow so not sure where the pattern is

Weird. My first assumption was a lock, but that seems unlikely if only one test is hiccuping. I've skimmed the source file and see nothing obvious.

@Luap99 Luap99 force-pushed the fast-system-test-5 branch from 3b44c98 to 8c27e91 Compare June 21, 2024 12:41
@Luap99
Copy link
Member Author

Luap99 commented Jun 21, 2024

Cherry-picked commits from #22831. Given it runs in parallel here maybe IO is a bigger bottleneck so I like to try it out.

@Luap99
Copy link
Member Author

Luap99 commented Jun 21, 2024

Well the good news is we see good speed improvements

type distro user DB local remote container
sys rawhide root 22:48 !15:18
sys rawhide rootless 24:39
sys fedora-40-aarch64 root !20:52 13:23
sys fedora-40 root !25:23 14:54
sys fedora-40 rootless 23:35 13:46
sys fedora-39 root boltdb 25:11 16:00
sys fedora-39 rootless boltdb 25:11
sys debian-13 root 26:43 20:04
sys debian-13 rootless 27:02

The said news is weird flakes...

<+-737873ns> # # podman run --cgroups=disabled --cgroupns=host --rm quay.io/libpod/testimage:20240123 cat /proc/self/cgroup
<+192ms> # time="2024-06-21T08:22:55-05:00" level=error msg="invalid internal status, try resetting the pause process with \"/var/tmp/go/src/github.com/containers/podman/bin/podman system migrate\": cannot read \"/run/containers/storage/overlay-containers/ff81f51b209a791e480cd6668e4e9494cb98d751356359a427a23efd80ba0d22/userdata/conmon.pid\": EOF"
<+008ms> # [ rc=1 (** EXPECTED 0 **) ]
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #| FAIL: exit code is 1; expected 0
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
<+202ms> # # podman system df
<+060ms> # time="2024-06-21T08:22:07-05:00" level=error msg="failed to get root file system size of container 5c2f1c54772b6e0a11447fa19722d2de9a9563a0f6f1fa6464f2395d95d2d4e9: container not known"
         # time="2024-06-21T08:22:07-05:00" level=error msg="failed to get read/write size of container 5c2f1c54772b6e0a11447fa19722d2de9a9563a0f6f1fa6464f2395d95d2d4e9: container not known"
         # TYPE           TOTAL       ACTIVE      SIZE        RECLAIMABLE
         # Images         1           0           11.76MB     11.76MB (100%)
         # Containers     1           0           0B          0B (0%)
         # Local Volumes  0           0           0B          0B (0%)
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #| FAIL: Command succeeded, but issued unexpected warnings
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

^^^ This one is not even part of a parallel test file so this is extremely weird and it failed in two different runs so likely very common.

There is also the selinux remote failure thing I looked at yesterday. I found the root cause for that but it might take a bit before I can fix it.

@Luap99 Luap99 force-pushed the fast-system-test-5 branch 3 times, most recently from e0c07d2 to 3542056 Compare June 24, 2024 12:01
@Luap99 Luap99 force-pushed the fast-system-test-5 branch from 3542056 to ee3fab4 Compare July 1, 2024 10:11
@Luap99
Copy link
Member Author

Luap99 commented Jul 1, 2024

<+     > # # podman rmi -f quay.io/libpod/systemd-image:20240124
<+115ms> # Untagged: quay.io/libpod/systemd-image:20240124
         # Deleted: 77def021e71fdab2c91a22b49cba046d8b1e62842583d2509305c70bec48e36f
         #
<+159ms> # # podman ps -a --external
<+043ms> # CONTAINER ID  IMAGE                         COMMAND     CREATED             STATUS      PORTS       NAMES
         # 65fb253e2a2f  quay.io/libpod/alpine:latest  top -d 120  About a minute ago  Removing                c__B1jcHj6XA0
         #
<+009ms> # # podman system df
<+043ms> # time="2024-07-01T10:41:39Z" level=error msg="failed to get root file system size of container 65fb253e2a2fc055c2eaec94241c9a60f02a2381ee1b16783ac8768c0405a47f: container not known"
         # time="2024-07-01T10:41:39Z" level=error msg="failed to get read/write size of container 65fb253e2a2fc055c2eaec94241c9a60f02a2381ee1b16783ac8768c0405a47f: container not known"
         # TYPE           TOTAL       ACTIVE      SIZE        RECLAIMABLE
         # Images         1           0           13.14MB     13.14MB (100%)
         # Containers     1           0           0B          0B (0%)
         # Local Volumes  0           0           0B          0B (0%)
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #| FAIL: Command succeeded, but issued unexpected warnings
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
         # # [teardown]

Ok finally captured the issue, something in auto-update seems to leak a container (stucked in removing state for some reason). Auto-update runs them in system units so I guess systemd killing the podman process in the wrong moment for whatever reasons, that will be fun to debug :(

@Luap99 Luap99 force-pushed the fast-system-test-5 branch 4 times, most recently from 2a41ccc to 1c25285 Compare July 4, 2024 13:28
Luap99 added 7 commits July 10, 2024 14:12
This is part 1 of major work to try to parallelize as much system tests
as we can do speed overall test time up.

Problem is we can no longer perfom any leaks check or removal after each
run which could be an issue. However as of commit 81c90f5 we no longer
do the leak check anyway in CI.

There will be tests that cannot be parallelized as they touch global
state, i.e. image removal.

This commits used the bats -j option to run tests in parallel but also
sets --no-parallelize-across-files to make sure we never run parallel
across files. This allows us to disable parallel on a per file basis via
the BATS_NO_PARALLELIZE_WITHIN_FILE=true option.

Right now only 001 and 005 are setup to run in parallel and this alone
gives me a 30s total time improvement locally when running both files.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
They can run in parallel as they only create containers with different
names so there is no conflict between them.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
I changed a bunch of tests to ensure they all have unique names and
don't conflict. This allows them to be run in parallel.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
No changes needed, one container name was changed to make it more future
proof.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
No changes needed as they only inspect a single image.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
No changes needed.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
No changes needed.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
@Luap99 Luap99 force-pushed the fast-system-test-5 branch from 21e7057 to 7736a7a Compare July 10, 2024 12:13
@Luap99
Copy link
Member Author

Luap99 commented Jul 10, 2024

Dropped "cirrus: use fastvm for system tests" and "test/system: enable parallel selinux tests" from the series, I am pretty sure this should pass

@Luap99 Luap99 marked this pull request as ready for review July 10, 2024 12:18
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 10, 2024
@Luap99
Copy link
Member Author

Luap99 commented Jul 10, 2024

Well or maybe not, I guess the free port logic is not race free as we just check once so it is causing flakes in the pasta tests.

For the two healthcheck flakes I need to take a deeper look, there are sleeps in the test to wait for healthchecks in the background so I expect it is racy.

Luap99 added 10 commits July 10, 2024 16:20
The container name chnage is not required but makes more robust for the
future.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
The container name change is not required but makes more robust for the
future.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Merge 190-run-ipcns into the 195-run-namespaces so that they can run in
parallel with the slow "podman test all namespaces" test.
In order to do so I had to assign them different container names to make
sure they do not conflict.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Assign unique container names to prevent conflicts. As many of them do
sleep's this should benefit them a lot.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Cp tests overall seem to be one of the slowest things of all system
tests. A local run of this file takes over 4 minutes. As such doing this
in parallel will help us a lot and there is no technical reason why we
cannot do that. The only issue is the reuse of the same container names
for all tests.

Thus this commit fixes all names, yes it is pretty bad here so this is a
very big diff. Additionally I gave each test its own unique cX num to
make debugging leaks easier, with that I know exactly which tests
"leaked" compared to only random names.

Local test time (12 threads) is now 90s compared to the 4m before.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Check external contianers as well and to improve debugging show the
full lines that include the name and image used and not just the id.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Assign unique container names to prevent future conflicts. Not strictly
required here.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Assign unique container names to prevent future conflicts. Not strictly
required here.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
On a single core system bats errors with
`--no-parallelize-across-files requires at least --jobs 2`

To fix this make sure we always use at least two jobs. Having more jobs
than cpu is not a problem and doesn't seem to really affect the runtime.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Using sleeps is by definition not robust. Unfortunately we want to
actually test the healthcheck timer fires in the background so we have
to wait as such there is no way around waiting. However one trick we can
use instead of sleeping and doing nothing we can (ab)use podman events
until --until which will list and wait events until the given time.
With that we can read and then directly check events without having to
do both sleep and events separately.

Hopefully this is enough to make the test stable.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
@Luap99 Luap99 force-pushed the fast-system-test-5 branch from 7736a7a to 0904d66 Compare July 10, 2024 14:21
@Luap99
Copy link
Member Author

Luap99 commented Jul 10, 2024

dropped "parallel pasta tests" for now. I think we need to teach the random port logic to assign no conflicting ports in parallel case similar to what you did in e2e but I am not sure if bats exposes the "node" number

Also pushed another commit to hopefully fix the healthckeck flake.

@Luap99
Copy link
Member Author

Luap99 commented Jul 10, 2024

Also one thing to consider this makes it very hard to track flakes for you, currently if a test fails we call cleanup which well removes all containers,etc... so everything running in parallel will likely fail as well because of that, i.e. look at https://api.cirrus-ci.com/v1/artifact/task/5497890087370752/html/sys-podman-fedora-39-root-host-boltdb.log.html

I am not sure if there is a way around that, we could try to defer cleanup until the end of the file but it will make the teardown code even more complicated.

@edsantiago
Copy link
Member

I've been reviewing all day as time allows and I'm feeling a growing concern. This is a lot of complexity, not just for reviewing right now but for all future system-test maintenance. I see new PRs merging with hidden bombs that will cause hard-to-debug flakes: the mere addition of podman images to any parallel test will very likely blow up one day. The performance gain, as of this writing, seems to be 30-60s.

You've invested enormous effort into this. I respect that but I need more time to keep looking at this and see if I can justify it.

@Luap99
Copy link
Member Author

Luap99 commented Jul 11, 2024

The performance gain, as of this writing, seems to be 30-60s.

Yes that is true but keep in mind that only a small subset is actually run in parallel. Many slow tests can still be converted but I didn't want to force that all in one mega PR. And right now I decided to keep the slow 2 core VMs here as there is not much benefit in higher core count until I convert more tests. Once we have a large part converting bumping up VMS to 4 cores like done in e2e will result in a nice speedup.

That said I totally agree that this will make ongoing maintenance harder, flake tracking will be more difficult, reviewing test that have side effects (--all, --latest, ps/ls commands, etc...) will be hard.

The main reason why I like this besides speedup is that we never had actual parallel testing done and I found several actual podman issues so far here. Anyhow that doesn't mean we have to do parallel testing in system tests though.

@Luap99 Luap99 closed this Aug 13, 2024
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Nov 12, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Nov 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants