-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve test flakiness #864
Comments
Lucas's crazy testsSo far I've tested a few things: Running tests on a dedicated serverThe idea was to isolate any CircleCI related issue. Eventually, we had failures due to timeout (30s). Running the tests one at a timeThe idea was to isolate any resource bottleneck. But eventually, we had tests failing due to timeout (30s). Running tests with an increased timeout (240 seconds)The idea was to check if the problem was related to the slow start of the nodes. Still, we had tests failing due to timeout. Running tests on threads instead of independent java processesThe idea was to check if the problem was related to how we are running the nodes as java proceses. So I'm running them using the flag Conclusion (pending)So whatever we have here doesn't seem to be related to hardware power or a timeout period not long enough. I suspect it might have something to do with the way we are starting the nodes or using the resources. But so far I haven't managed to isolate a single variable that cause the tests to fail. |
I am checking if the problem is related to EIP-1559 in #865 Results in pull request 865
Results in current master branch
|
@abdelhamidbakhta - there was a RocksDB unit test failure on a test earlier today as well - https://circleci.com/gh/hyperledger/besu/15791 |
@lucassaldanha @abdelhamidbakhta (cc @MadelineMurray) - you might try |
If you aren't familiar with |
@lucassaldanha , just saw this so might be too late, but did you check for OOM messages in kernel logs when you tested in your own server? |
Do we need to explicitly turn off nat? I see the tests are auto-detecting Docker NAT, but unless we are testing NAT that seems like an unneeded variable.
|
No OOM errors. |
It might be worth trying it. |
#878 but we should see how it behaves around the clock. |
I got debug turned on for a couple of tests - https://app.circleci.com/pipelines/github/hyperledger/besu/3215/workflows/d178944a-b08c-4e4e-8728-38ff5906c259/jobs/16231/artifacts My take is that peer discovery is failing. It looks like we give bootnodes only one chance. We can either juice the test cluster by adding bootnodes as static nodes, or add a logic piece to always try all bootnodes when peers are empty. |
Don't forget, peer discovery is on UDP and dropping UDP packets arbitrarily is 100% fair game. So the fact that the docker container would be bad keeping UDP packets alive doesn't mean it's the docker container's fault. |
from @shemnon : For the CircleCI issues, what if instead of running acceptance tests in a fleet of docker instances we ran it on one or two bare metal boxes? https://circleci.com/docs/2.0/executor-types/#using-machine Not a 10 minute change I believe, so budget time accordingly if we do try this (ed |
After running 24h of straight acceptance tests on 1.4.2 and master (2020-05-15) we see a huge difference in failure rate. I'm now bisecting to find potential culprits and will update when I have that data. |
Regarding the dropping of UDP packets, Docker seems to consider such a thing a bug (and there are some workarounds). If there is evidence about that being (part of?) the problem I wouldn't mind digging into that myself. |
There is definitely reason to believe that's part of it, thank you for offering! Please do dig into that. @shemnon can probably answer some questions you run into because I haven't looked into that yet. |
I reported in proddev but repeating here to keep the info together:
|
There's been a significant increase in the number and frequency of ATs failing.
This looks to have started about April 22 and a couple of PRs have been made to try and fix this:
The text was updated successfully, but these errors were encountered: