Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable test retries #3303

Open
apostasie opened this issue Aug 13, 2024 · 0 comments
Open

Disable test retries #3303

apostasie opened this issue Aug 13, 2024 · 0 comments

Comments

@apostasie
Copy link
Contributor

apostasie commented Aug 13, 2024

What is the problem you're trying to solve

Background:

Flaky tests as seen in nerdctl over the past few months come from the following reasons:

  1. CI is belly-up and breaks or cancel a build
  2. a third-party service (like Docker Hub for eg), is temporarily belly-up (or did temporarily blacklist us) and tests fail for being unable to retrieve a resource from said service
  3. tests are ill-written, and do manipulate the same fixtures concurrently - while the app itself is working fine, the test fails because it expects a certain file to be exclusive while it is actually shared with another test manipulating it concurrently
  4. app code is racy/buggy, or otherwise exhibits random behavior in certain circumstances, most likely when concurrency is involved

About 1., which happens rarely, there is obviously nothing we can do (except complain to the provider).

About 2., which happens quite occasionally, we can - and should - certainly be more defensive: curl has neat retry options. For images we depend on, we could/should have an early provisioning method that at least fails fast if it fails, and that all tests could depend on (or even a local registry cache to insulate us).

About 3., that is the shameful kind - there is absolutely no reason to ignore these. They should be fixed, period. Tests must be "isolated". In the rare cases where they cannot be, we should just mark them as such and make sure they are not parallelized and they have clear setup/teardown procedures.

About 4., which is the vast majority of transient failures on the CI, they are obviously the hardest to diagnose and figure out (#3228 is a heinous example which took me weeks of staring at it to figure out). They are also the only ones that actually truly matter to the quality of the app and that show real-life problems that must be fixed.

Since nerdctl has introduced retries for tests (at two different levels), most/all of these issues are just... hidden away.

The consequence is very much that a test can fail multiple times and succeed only once, and still be considered "green".

While this approach is successful in making life better for casual contributors for category 1. (broken CI) or category 2. (hosed third party), it is objectively hurting frequent contributors by making the build longer.
It is also lowering our coding standards for tests by allowing category 3. (broken tests) to creep in.
And of course, and most importantly, it is actively hurting and degrading the quality of the product: we are more likely to accept or ignore code that fails 3 times out of 4, just because our rig is going to retry until it passes, giving us a false sense of confidence that “things work”…
Furthermore, since the only thing this is doing is lowering quality expectations, we are getting caught-up as quality degrades and being brought back to the very initial situation this was supposed to remedy - in a much worse version.

I absolutely appreciate how painful it is to "pay the bills for past sins", and by turning off retries, we might have some tough weeks ahead - but then, the question really is: do we want an always green CI, or do we want a robust product?

A lot of work has already been recently invested in fixing flaky tests and deeply seated concurrency issues, so, I do not think the situation is that dramatic, but there are still deep problems:

  • TestIPFSComposeUp (and most things touching IPFS) is clearly broken, and has been forever (flaky test: Error establishing a database connection in CI  #2690) - maybe the ipfs implementation is just wrong, or the compose implementation - or maybe it is just the test
  • TestRunCustomRootfs has been “fixed” by me by basically hiding away the problem (side-effect of isolating tests…) - Workaround flaky save #3179. Clearly the underlying issue is still there, and my hunch is that it is systemic and taking different forms. We regularly have issues about save, or commit, failing because the image has not been expanded and some layers are not available - they might be related
  • logs are still problematic overall - while test: increase logs max-size #3259 has hidden away some problems, but clearly not fixed the issue (see TestTailFollowRotateLogs history) - there are still a bunch of other regularly failing tests (TestLogsWithForegroundContainers)
  • we generally make assumptions that because a certain “list” method returned some results, said results are going to still be magically available later on for further inspection, and we hard fail when they are not (see Successive calls to client.Containers(ctx) then c.Labels may fail with ErrNotFound (possibly other methods as well) #3167 for example), and this is creeping a lot in the codebase
  • we probably still have some issues with anything that touches the filesystem to implement a form of a “store” - while Volumes should be fixed now, the NameStore is clearly dysfunctional - nerdctl sometimes fail to check if a container name already exists and ends-up creating duplicates #2992 - (also because it is used in ocihooks and their design is problematic to say the least…)
  • the two problems above, compounded, make “networks” not safe to manipulate concurrently (networks related code is racy #3086 and many other) - besides better abstractions, we need a much more rigorous approach around locking (locking to create a volume is fine, as long as you do not expect the volume to be available after you release the lock — most of the time, we should have lock applied not for atomic operations, but for complex, multi-step operations)
  • TestCopyToContainer has been failing for at least two years as well (Flaky: TestCopyToContainer #1257)
  • other…

Describe the solution you'd like

I think we should disable retries ASAP: merge #3189
If it does not hurt, nobody is incentivized to fix things.

We should also stop restarting builds.
Retries are just the automated version of the same idea.

I would love to hear (differing) opinions from the good folks here.

Additional context

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant