Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Pipeline is heavily unstable #3921

Closed
kobergj opened this issue Jun 3, 2022 · 26 comments
Closed

CI Pipeline is heavily unstable #3921

kobergj opened this issue Jun 3, 2022 · 26 comments
Labels
QA Status:Completed Type:CI Related to our Continouus Integration Solution Type:Orga

Comments

@kobergj
Copy link
Collaborator

kobergj commented Jun 3, 2022

Describe the bug

We see ci pipelines failing randomly. Reason for this unknown, they are not related to flaky tests.
Maybe some infrastructure related issue?

Please link all such problems to this ticket so we can get an overview how often this happens.

Steps to reproduce

Run a CI pipeline

Expected behavior

It gets green or a test fails

Actual behavior

The pipeline crashes because of some unknown problem

@kobergj
Copy link
Collaborator Author

kobergj commented Jun 3, 2022

@kobergj
Copy link
Collaborator Author

kobergj commented Jun 3, 2022

@C0rby
Copy link
Contributor

C0rby commented Jun 3, 2022

@butonic
Copy link
Member

butonic commented Jun 3, 2022

the two first linked issues are caused by make ci-node-check-licenses

% check-license extensions/idp:
make[1]: *** [Makefile:61: node_modules] Error 1
make: *** [Makefile:204: ci-node-check-licenses] Error 1

@phil-davis
Copy link
Contributor

https://drone.owncloud.com/owncloud/ocis/12313/1/2

+ make ci-node-generate
make[1]: Nothing to be done for 'ci-node-generate'.
make[1]: Nothing to be done for 'ci-node-generate'.
make[1]: Nothing to be done for 'ci-node-generate'.
make[1]: Nothing to be done for 'ci-node-generate'.
make[1]: Nothing to be done for 'ci-node-generate'.
make[1]: Nothing to be done for 'ci-node-generate'.
make[1]: Nothing to be done for 'ci-node-generate'.
make[1]: Nothing to be done for 'ci-node-generate'.
make[1]: Nothing to be done for 'ci-node-generate'.
dist/
dist/explorer.js.map
dist/explorer_v1.7.10.js
dist/explorer.js
index.html
node_modules/hellojs/dist/hello.all.js
src/custom.css
node_modules/core-js/client/shim.min.js
node_modules/zone.js/dist/zone.js
node_modules/systemjs/dist/system.src.js
node_modules/moment/min/moment-with-locales.min.js
make[1]: Nothing to be done for 'ci-node-generate'.
make[1]: Nothing to be done for 'ci-node-generate'.
make[1]: *** [Makefile:61: node_modules] Error 1
make: *** [Makefile:158: ci-node-generate] Error 1

In that log output, I don't see what was the actual problem. It "just exits" with Error 1 ???

@individual-it
Copy link
Member

for the node issues, can we implement a retry?

@butonic
Copy link
Member

butonic commented Jun 7, 2022

just ran into https://drone.cernbox.cern.ch/cs3org/reva/7398/11/6 which is the same testsuite as https://drone.cernbox.cern.ch/cs3org/reva/7398/10/5 but against s3 ... the s3ng should never fail if the ocis suite passes ...

@phil-davis
Copy link
Contributor

https://drone.cernbox.cern.ch/cs3org/reva/7398/11/6
Error: PUT request to datagateway failed
Error: expected status code 200 for the file upload, but got 500

I guess that "something happened" (TM) to the set of reva services. Maybe the log output has clues - https://drone.cernbox.cern.ch/cs3org/reva/7398/11/4 - but there is a lot of it. For instance, I noticed:

2022-06-07 10:13:42.209 ERR ../../../pkg/rhttp/datatx/manager/simple/simple.go:107 > error uploading file error="failed to upload file to blostore: could not store object 'ddc2004c-0977-11eb-9d3f-a793888cd0f8/f6/da/24/42/-c306-4370-8925-0570124774e3' into bucket 'test': The specified bucket does not exist." datatx=simple pid=12 pkg=rhttp traceid=a05d2910262671b6f415a6f936c794e2

"some service(s) generally falling over and not communicating" might be a flaky thing, and so it could happen with ocis and/or S3NG storage, and in any of the test pipelines.

@butonic
Copy link
Member

butonic commented Jun 7, 2022

https://drone.owncloud.com/owncloud/ocis/12331/53/9

HEAD is now at 95fe2030d Automated changelog update [skip ci]
+ cd tests/acceptance/
+ yarn install --immutable
➤ YN0000: ┌ Resolution step
➤ YN0000: └ Completed
➤ YN0000: ┌ Fetch step
➤ YN0000: │ Setting the NODE_TLS_REJECT_UNAUTHORIZED environment variable to '0' makes TLS connections and HTTPS requests insecure by disabling certificate verification.
➤ YN0035: │ mime-types@npm:2.1.31: The remote server failed to provide the requested resource
➤ YN0035: │   Response Code: 501 (Not Implemented)
➤ YN0035: │   Request Method: GET
➤ YN0035: │   Request URL: https://registry.yarnpkg.com/mime-types/-/mime-types-2.1.31.tgz
➤ YN0013: │  - one package was already cached, 616 had to be fetched
➤ YN0000: └ Completed in 2s 333ms
➤ YN0000: Failed with errors in 2s 483ms

hm, just retry this step / the yarn install --immutable?

@phil-davis
Copy link
Contributor

Upstream yarn servers/mirrors (wherever the drone agent goes looking for this stuff) has been giving intermittent Response Code: 501 (Not Implemented) in multiple repos for the last few weeks - not limited to oCIS! I suspect that some upstream mirror has been having problems, but I have no way to prove that.

I guess that we can put in some scripting to retry it. And I wonder if the retry from the same drone agent will end up hitting the same upstream server and get the same 501?

It's all a pest - reliable upstream would be the best thing, if we can work out how to achieve that!

@phil-davis
Copy link
Contributor

https://drone.owncloud.com/owncloud/ocis/12334/55/2
another example of a fail of:

make ci-node-check-licenses
...
% check-license extensions/idp:
make[1]: *** [Makefile:61: node_modules] Error 1
make: *** [Makefile:204: ci-node-check-licenses] Error 1

And the log output gives no clue about why it exits with "Error 1"

@butonic
Copy link
Member

butonic commented Jun 7, 2022

https://drone.owncloud.com/owncloud/ocis/12333/72/12

Status: Image is up to date for owncloudci/alpine:latest
+ ocis/bin/ocis init --insecure true
/bin/sh: ocis/bin/ocis: Text file busy

hmmm @dragonchaser thinks it might be caused by running out of file descriptors 🤔

@C0rby
Copy link
Contributor

C0rby commented Jun 7, 2022

@butonic
Copy link
Member

butonic commented Jun 7, 2022

https://drone.owncloud.com/owncloud/ocis/12206/39/3 reported by @phil-davis as a dedicated issue: #3900

@butonic
Copy link
Member

butonic commented Jun 8, 2022

cs3ApiTests-ocis: https://drone.owncloud.com/owncloud/ocis/12381/35/6

  Scenario: Change treesize of personal home and for only one subtree # /var/lib/cs3api-validator/etag-propagation.feature:60
    When user "admin" has uploaded a file "a-folder/a-sub-folder/testfile.txt" with content "text" in the home directory with the alias "testfile.txt" # /var/lib/cs3api-validator/etag-propagation.feature:62
      Error: error statting new file

with this in the server log https://drone.owncloud.com/owncloud/ocis/12381/35/4

{"level":"error","service":"search","statRes":{"status":{"code":6,"message":"stat: error: not found: b0920831-e209-4ae3-9838-a495e07b7c99/testfile.txt","trace":"00000000000000000000000000000000"}},"time":"2022-06-08T09:50:07Z","message":"failed to stat the changed resource"}
{"level":"error","service":"search","error":"entity not found","Id":{"storage_id":"1284d238-aa92-42ce-bdc4-0b0000009157$f6ce24d8-7bf1-4a4b-9c3b-4c7d6dc1b9ef","opaque_id":"4971a84a-d2f3-4e00-9375-1298ff18bead"},"time":"2022-06-08T09:50:07Z","message":"failed to remove item from index"}

Seems to be related to events and the search service. @aduffeck any idea?

a restart made the test pass: https://drone.owncloud.com/owncloud/ocis/12383/35/6

similar failure occured when updating web: https://drone.owncloud.com/owncloud/ocis/12380/35/5 (etag-propagation.feature:70 vs etag-propagation.feature:60)

@butonic
Copy link
Member

butonic commented Jun 8, 2022

@phil-davis can we log the requestid when a test fails? does the testsuite send a X-REQUEST-ID header? we could use it to grep the server log for that request ...

@phil-davis
Copy link
Contributor

@phil-davis can we log the requestid when a test fails? does the testsuite send a X-REQUEST-ID header? we could use it to grep the server log for that request ...

The test suite remembers the line it is in in each scenario as it executes the scenario:

$this->scenarioString = $suiteName . '/' . $featureFileName . ':' . $scenarioLine;

$this->stepLineRef = $this->scenarioString . '-' . $scope->getStep()->getLine();

and sends that in 'X-Request-ID' header.

so it looks something like apiComments/comments.feature:25-31 - the scenario that starts at line 25, and executing the step at line 31.

At the end of a test run we have a list of the scenarios that failed - we know apiComments/comments.feature:25 and so we can search the log for apiComments/comments.feature:25-* to find any log output that mentions an X-Request-ID like that.

Do we try to automate this (and output the filtered log entries somewhere)? Or is it enough that I document how the X-Request-ID string is constructed, and developers can then easily use the failing test scenario apiComments/comments.feature:25 to search the log.

@dragonchaser
Copy link
Contributor

@individual-it
Copy link
Member

@dragonchaser failed in apiWebdavEtagPropagation1/moveFileFolder.feature:184
an etag of a folder that got a file updated to did not change. Bug?
Log output:

{"level":"error","service":"notifications","error":"could not send mail: dial tcp 127.0.0.1:1025: connect: connection refused","event":"ShareCreated","time":"2022-06-08T15:19:19Z","message":"failed to send a message"}
{"level":"error","service":"search","statRes":{"status":{"code":6,"message":"stat: error: not found: f5952ac8-40ed-46e2-83f1-426f52fa88f6/file.txt","trace":"00000000000000000000000000000000"}},"time":"2022-06-08T15:19:20Z","message":"failed to stat the changed resource"}
{"level":"error","service":"search","error":"entity not found","Id":{"storage_id":"1284d238-aa92-42ce-bdc4-0b0000009157$7ca21fba-5aaa-49b7-9efd-934eef532cad","opaque_id":"385b64ee-aac4-4925-8ed8-d8779f9a5bd4"},"time":"2022-06-08T15:19:21Z","message":"failed to remove item from index"}
{"level":"error","service":"notifications","error":"could not send mail: dial tcp 127.0.0.1:1025: connect: connection refused","event":"ShareCreated","time":"2022-06-08T15:19:22Z","message":"failed to send a message"}
{"level":"error","service":"notifications","error":"could not send mail: dial tcp 127.0.0.1:1025: connect: connection refused","event":"ShareCreated","time":"2022-06-08T15:19:24Z","message":"failed to send a message"}
{"level":"error","service":"search","error":"error: not found: unknown client id","authRes":{"status":{"code":6,"message":"unknown client id","trace":"00000000000000000000000000000000"}},"time":"2022-06-08T15:19:25Z","message":"error using machine auth"}

@dragonchaser
Copy link
Contributor

@individual-it I suspect an issue with the underlying storage not a bug on our side.

@C0rby
Copy link
Contributor

C0rby commented Jun 13, 2022

@individual-it
Copy link
Member

the first one seems to be related to the issue discussed above with @dragonchaser etag propagation seems to have an issue on copy/move
I understand that the root of the issue might not be in ocis, but what is the expectation in that case? Is it acceptable that the etags are not propagated? If so we might want to adjust the tests.

@individual-it
Copy link
Member

I've created a new issue #3962 to rerun pipelines that fail because of some dependencies could not be downloaded

@butonic
Copy link
Member

butonic commented Jun 17, 2022

Etags not updating is a bug. If the storage fails the decomposedfs should retry the propagation. But implementing that would equal adding journaling to decomposedfs. We currently rely on the filesystem to work. If an error occurs we fail so the admin gets a log message and the user / Clint can retry.

We could retry the propagation in process. How often should be configurable...

@individual-it
Copy link
Member

etag propagation issue tracked here #3988 @butonic please comment

@micbar micbar added Type:Orga and removed Type:Bug labels Jul 18, 2022
@micbar micbar added QA Type:CI Related to our Continouus Integration Solution labels Jul 18, 2022
@stale
Copy link

stale bot commented Sep 16, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 10 days if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
QA Status:Completed Type:CI Related to our Continouus Integration Solution Type:Orga
Projects
None yet
Development

No branches or pull requests

7 participants