-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI Pipeline is heavily unstable #3921
Comments
the two first linked issues are caused by
|
https://drone.owncloud.com/owncloud/ocis/12313/1/2
In that log output, I don't see what was the actual problem. It "just exits" with Error 1 ??? |
for the node issues, can we implement a retry? |
just ran into https://drone.cernbox.cern.ch/cs3org/reva/7398/11/6 which is the same testsuite as https://drone.cernbox.cern.ch/cs3org/reva/7398/10/5 but against s3 ... the s3ng should never fail if the ocis suite passes ... |
https://drone.cernbox.cern.ch/cs3org/reva/7398/11/6 I guess that "something happened" (TM) to the set of reva services. Maybe the log output has clues - https://drone.cernbox.cern.ch/cs3org/reva/7398/11/4 - but there is a lot of it. For instance, I noticed:
"some service(s) generally falling over and not communicating" might be a flaky thing, and so it could happen with ocis and/or S3NG storage, and in any of the test pipelines. |
https://drone.owncloud.com/owncloud/ocis/12331/53/9
hm, just retry this step / the |
Upstream yarn servers/mirrors (wherever the drone agent goes looking for this stuff) has been giving intermittent I guess that we can put in some scripting to retry it. And I wonder if the retry from the same drone agent will end up hitting the same upstream server and get the same 501? It's all a pest - reliable upstream would be the best thing, if we can work out how to achieve that! |
https://drone.owncloud.com/owncloud/ocis/12334/55/2
And the log output gives no clue about why it exits with "Error 1" |
https://drone.owncloud.com/owncloud/ocis/12333/72/12
hmmm @dragonchaser thinks it might be caused by running out of file descriptors 🤔 |
Here is another build where that happened: https://drone.cernbox.cern.ch/cs3org/reva/7347/11/6 |
https://drone.owncloud.com/owncloud/ocis/12206/39/3 reported by @phil-davis as a dedicated issue: #3900 |
cs3ApiTests-ocis: https://drone.owncloud.com/owncloud/ocis/12381/35/6
with this in the server log https://drone.owncloud.com/owncloud/ocis/12381/35/4
Seems to be related to events and the search service. @aduffeck any idea? a restart made the test pass: https://drone.owncloud.com/owncloud/ocis/12383/35/6 similar failure occured when updating web: https://drone.owncloud.com/owncloud/ocis/12380/35/5 ( |
@phil-davis can we log the requestid when a test fails? does the testsuite send a |
The test suite remembers the line it is in in each scenario as it executes the scenario:
and sends that in 'X-Request-ID' header. so it looks something like At the end of a test run we have a list of the scenarios that failed - we know Do we try to automate this (and output the filtered log entries somewhere)? Or is it enough that I document how the |
@dragonchaser failed in
|
@individual-it I suspect an issue with the underlying storage not a bug on our side. |
the first one seems to be related to the issue discussed above with @dragonchaser etag propagation seems to have an issue on copy/move |
I've created a new issue #3962 to rerun pipelines that fail because of some dependencies could not be downloaded |
Etags not updating is a bug. If the storage fails the decomposedfs should retry the propagation. But implementing that would equal adding journaling to decomposedfs. We currently rely on the filesystem to work. If an error occurs we fail so the admin gets a log message and the user / Clint can retry. We could retry the propagation in process. How often should be configurable... |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 10 days if no further activity occurs. Thank you for your contributions. |
Describe the bug
We see ci pipelines failing randomly. Reason for this unknown, they are not related to flaky tests.
Maybe some infrastructure related issue?
Please link all such problems to this ticket so we can get an overview how often this happens.
Steps to reproduce
Run a CI pipeline
Expected behavior
It gets green or a test fails
Actual behavior
The pipeline crashes because of some unknown problem
The text was updated successfully, but these errors were encountered: