[DNM] Debug freeze on CentOS 7 CI #2939

kolyshkin · 2021-05-05T18:57:07Z

GHA CI almost always fails on CentOS 7 (#2907):

=== RUN   TestFreeze
    utils_test.go:85: exec_test.go:539: unexpected error: unable to freeze
        
--- FAIL: TestFreeze (0.71s)

Trying to find out what to do about it.

This is complicated because the kind of mac os x host GHA gives for the test is a lottery. In most cases it's good, and sometimes it's slow and buggy.

kolyshkin · 2021-05-05T19:52:59Z

--- PASS: TestAdditionalGroups (0.30s)
=== RUN   TestFreeze
    utils_test.go:85: exec_test.go:539: unexpected error: unable to freeze (1000 retries, 20 thaws, last state: FREEZING)
        
--- FAIL: TestFreeze (0.63s)

Not sure what to do here. Add more iterations? Increase the timeout?

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

1. These tests can't be run in parallel since they do check a global variable (mbaScEnabled). 2. findIntelRdtMountpointDir() relies on mbaScEnabled to be initially set to the default value (false) and this the test fails if run more than once: > go test -count 2 > ... > intelrdt_test.go:243: expected mbaScEnabled=false, got true > --- FAIL: TestFindIntelRdtMountpointDir/Valid_mountinfo_with_MBA_Software_Controller_disabled (0.00s) Fixes: 2c70d23 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

500x each test (with and without systemd). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

I hate to keep adding those kludges, but lately TestFreeze (and TestSystemdFreeze) from libcontainer/integration fails a lot. The failure comes and goes, and is probably this is caused by a slow host allocated for the test, and a slow VM on top of it. To remediate, add a small sleep on every 25th iteration in between asking the kernel to freeze and checking its status. In the worst case scenario (failure to freeze) this adds 0.4 μs to the duration of the call (nothing compared to that sleep after the temporary thaw). It is hard to measure how this affects CI but (with added debug prints) on a histogram of number of retries I saw peaks at and after numbers 25, 50, 75 etc. meaning this works. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

kolyshkin · 2021-05-06T18:02:56Z

OK, the conclusion is adding an occasional short delay between writing "frozen" and reading the status back helps for this case (very slow system).

kolyshkin · 2021-05-06T18:03:22Z

The fix is #2941

kolyshkin force-pushed the debug-freeze branch 2 times, most recently from 8683ae4 to a59d45a Compare May 5, 2021 22:22

kolyshkin added 3 commits May 5, 2021 15:34

ci/gha: only leave CentOS 7

b0af842

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

freezer: log success with Info level

6e07505

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

kolyshkin force-pushed the debug-freeze branch 5 times, most recently from ca1340e to 92c8cb1 Compare May 6, 2021 01:05

kolyshkin added 2 commits May 6, 2021 09:50

localunittest: only run Freeze 1000 times

8b29a73

500x each test (with and without systemd). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

kolyshkin force-pushed the debug-freeze branch from 92c8cb1 to 3573c5c Compare May 6, 2021 16:50

kolyshkin closed this May 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DNM] Debug freeze on CentOS 7 CI #2939

[DNM] Debug freeze on CentOS 7 CI #2939

kolyshkin commented May 5, 2021 •

edited

Loading

kolyshkin commented May 5, 2021

kolyshkin commented May 6, 2021

kolyshkin commented May 6, 2021

[DNM] Debug freeze on CentOS 7 CI #2939

[DNM] Debug freeze on CentOS 7 CI #2939

Conversation

kolyshkin commented May 5, 2021 • edited Loading

kolyshkin commented May 5, 2021

kolyshkin commented May 6, 2021

kolyshkin commented May 6, 2021

kolyshkin commented May 5, 2021 •

edited

Loading