CI Pipeline is heavily unstable #3921

kobergj · 2022-06-03T10:02:41Z

Describe the bug

We see ci pipelines failing randomly. Reason for this unknown, they are not related to flaky tests.
Maybe some infrastructure related issue?

Please link all such problems to this ticket so we can get an overview how often this happens.

Steps to reproduce

Run a CI pipeline

Expected behavior

It gets green or a test fails

Actual behavior

The pipeline crashes because of some unknown problem

kobergj · 2022-06-03T10:02:50Z

https://drone.owncloud.com/owncloud/ocis/12282/74/2

kobergj · 2022-06-03T10:08:03Z

https://drone.owncloud.com/owncloud/ocis/12275/73/2

C0rby · 2022-06-03T11:18:47Z

https://drone.cernbox.cern.ch/cs3org/reva/7371/11/6

butonic · 2022-06-03T13:12:50Z

the two first linked issues are caused by make ci-node-check-licenses

% check-license extensions/idp:
make[1]: *** [Makefile:61: node_modules] Error 1
make: *** [Makefile:204: ci-node-check-licenses] Error 1

phil-davis · 2022-06-07T06:46:54Z

https://drone.owncloud.com/owncloud/ocis/12313/1/2

+ make ci-node-generate
make[1]: Nothing to be done for 'ci-node-generate'.
make[1]: Nothing to be done for 'ci-node-generate'.
make[1]: Nothing to be done for 'ci-node-generate'.
make[1]: Nothing to be done for 'ci-node-generate'.
make[1]: Nothing to be done for 'ci-node-generate'.
make[1]: Nothing to be done for 'ci-node-generate'.
make[1]: Nothing to be done for 'ci-node-generate'.
make[1]: Nothing to be done for 'ci-node-generate'.
make[1]: Nothing to be done for 'ci-node-generate'.
dist/
dist/explorer.js.map
dist/explorer_v1.7.10.js
dist/explorer.js
index.html
node_modules/hellojs/dist/hello.all.js
src/custom.css
node_modules/core-js/client/shim.min.js
node_modules/zone.js/dist/zone.js
node_modules/systemjs/dist/system.src.js
node_modules/moment/min/moment-with-locales.min.js
make[1]: Nothing to be done for 'ci-node-generate'.
make[1]: Nothing to be done for 'ci-node-generate'.
make[1]: *** [Makefile:61: node_modules] Error 1
make: *** [Makefile:158: ci-node-generate] Error 1

In that log output, I don't see what was the actual problem. It "just exits" with Error 1 ???

individual-it · 2022-06-07T07:31:50Z

for the node issues, can we implement a retry?

butonic · 2022-06-07T10:25:23Z

just ran into https://drone.cernbox.cern.ch/cs3org/reva/7398/11/6 which is the same testsuite as https://drone.cernbox.cern.ch/cs3org/reva/7398/10/5 but against s3 ... the s3ng should never fail if the ocis suite passes ...

phil-davis · 2022-06-07T10:34:16Z

https://drone.cernbox.cern.ch/cs3org/reva/7398/11/6
Error: PUT request to datagateway failed
Error: expected status code 200 for the file upload, but got 500

I guess that "something happened" (TM) to the set of reva services. Maybe the log output has clues - https://drone.cernbox.cern.ch/cs3org/reva/7398/11/4 - but there is a lot of it. For instance, I noticed:

2022-06-07 10:13:42.209 ERR ../../../pkg/rhttp/datatx/manager/simple/simple.go:107 > error uploading file error="failed to upload file to blostore: could not store object 'ddc2004c-0977-11eb-9d3f-a793888cd0f8/f6/da/24/42/-c306-4370-8925-0570124774e3' into bucket 'test': The specified bucket does not exist." datatx=simple pid=12 pkg=rhttp traceid=a05d2910262671b6f415a6f936c794e2

"some service(s) generally falling over and not communicating" might be a flaky thing, and so it could happen with ocis and/or S3NG storage, and in any of the test pipelines.

butonic · 2022-06-07T11:17:47Z

https://drone.owncloud.com/owncloud/ocis/12331/53/9

HEAD is now at 95fe2030d Automated changelog update [skip ci]
+ cd tests/acceptance/
+ yarn install --immutable
➤ YN0000: ┌ Resolution step
➤ YN0000: └ Completed
➤ YN0000: ┌ Fetch step
➤ YN0000: │ Setting the NODE_TLS_REJECT_UNAUTHORIZED environment variable to '0' makes TLS connections and HTTPS requests insecure by disabling certificate verification.
➤ YN0035: │ mime-types@npm:2.1.31: The remote server failed to provide the requested resource
➤ YN0035: │   Response Code: 501 (Not Implemented)
➤ YN0035: │   Request Method: GET
➤ YN0035: │   Request URL: https://registry.yarnpkg.com/mime-types/-/mime-types-2.1.31.tgz
➤ YN0013: │  - one package was already cached, 616 had to be fetched
➤ YN0000: └ Completed in 2s 333ms
➤ YN0000: Failed with errors in 2s 483ms

hm, just retry this step / the yarn install --immutable?

phil-davis · 2022-06-07T11:25:31Z

Upstream yarn servers/mirrors (wherever the drone agent goes looking for this stuff) has been giving intermittent Response Code: 501 (Not Implemented) in multiple repos for the last few weeks - not limited to oCIS! I suspect that some upstream mirror has been having problems, but I have no way to prove that.

I guess that we can put in some scripting to retry it. And I wonder if the retry from the same drone agent will end up hitting the same upstream server and get the same 501?

It's all a pest - reliable upstream would be the best thing, if we can work out how to achieve that!

phil-davis · 2022-06-07T11:29:11Z

https://drone.owncloud.com/owncloud/ocis/12334/55/2
another example of a fail of:

make ci-node-check-licenses
...
% check-license extensions/idp:
make[1]: *** [Makefile:61: node_modules] Error 1
make: *** [Makefile:204: ci-node-check-licenses] Error 1

And the log output gives no clue about why it exits with "Error 1"

butonic · 2022-06-07T12:26:22Z

https://drone.owncloud.com/owncloud/ocis/12333/72/12

Status: Image is up to date for owncloudci/alpine:latest
+ ocis/bin/ocis init --insecure true
/bin/sh: ocis/bin/ocis: Text file busy

hmmm @dragonchaser thinks it might be caused by running out of file descriptors 🤔

C0rby · 2022-06-07T13:55:34Z

https://drone.cernbox.cern.ch/cs3org/reva/7371/11/6

Here is another build where that happened: https://drone.cernbox.cern.ch/cs3org/reva/7347/11/6

butonic · 2022-06-07T14:57:40Z

https://drone.owncloud.com/owncloud/ocis/12206/39/3 reported by @phil-davis as a dedicated issue: #3900

butonic · 2022-06-08T10:29:23Z

cs3ApiTests-ocis: https://drone.owncloud.com/owncloud/ocis/12381/35/6

  Scenario: Change treesize of personal home and for only one subtree # /var/lib/cs3api-validator/etag-propagation.feature:60
    When user "admin" has uploaded a file "a-folder/a-sub-folder/testfile.txt" with content "text" in the home directory with the alias "testfile.txt" # /var/lib/cs3api-validator/etag-propagation.feature:62
      Error: error statting new file

with this in the server log https://drone.owncloud.com/owncloud/ocis/12381/35/4

{"level":"error","service":"search","statRes":{"status":{"code":6,"message":"stat: error: not found: b0920831-e209-4ae3-9838-a495e07b7c99/testfile.txt","trace":"00000000000000000000000000000000"}},"time":"2022-06-08T09:50:07Z","message":"failed to stat the changed resource"}
{"level":"error","service":"search","error":"entity not found","Id":{"storage_id":"1284d238-aa92-42ce-bdc4-0b0000009157$f6ce24d8-7bf1-4a4b-9c3b-4c7d6dc1b9ef","opaque_id":"4971a84a-d2f3-4e00-9375-1298ff18bead"},"time":"2022-06-08T09:50:07Z","message":"failed to remove item from index"}

Seems to be related to events and the search service. @aduffeck any idea?

a restart made the test pass: https://drone.owncloud.com/owncloud/ocis/12383/35/6

similar failure occured when updating web: https://drone.owncloud.com/owncloud/ocis/12380/35/5 (etag-propagation.feature:70 vs etag-propagation.feature:60)

butonic · 2022-06-08T10:31:39Z

@phil-davis can we log the requestid when a test fails? does the testsuite send a X-REQUEST-ID header? we could use it to grep the server log for that request ...

phil-davis · 2022-06-08T11:28:22Z

@phil-davis can we log the requestid when a test fails? does the testsuite send a X-REQUEST-ID header? we could use it to grep the server log for that request ...

The test suite remembers the line it is in in each scenario as it executes the scenario:

$this->scenarioString = $suiteName . '/' . $featureFileName . ':' . $scenarioLine;

$this->stepLineRef = $this->scenarioString . '-' . $scope->getStep()->getLine();

and sends that in 'X-Request-ID' header.

so it looks something like apiComments/comments.feature:25-31 - the scenario that starts at line 25, and executing the step at line 31.

At the end of a test run we have a list of the scenarios that failed - we know apiComments/comments.feature:25 and so we can search the log for apiComments/comments.feature:25-* to find any log output that mentions an X-Request-ID like that.

Do we try to automate this (and output the filtered log entries somewhere)? Or is it enough that I document how the X-Request-ID string is constructed, and developers can then easily use the failing test scenario apiComments/comments.feature:25 to search the log.

dragonchaser · 2022-06-09T06:58:05Z

https://drone.owncloud.com/owncloud/ocis/12405/46/7

individual-it · 2022-06-09T07:42:10Z

@dragonchaser failed in apiWebdavEtagPropagation1/moveFileFolder.feature:184
an etag of a folder that got a file updated to did not change. Bug?
Log output:

{"level":"error","service":"notifications","error":"could not send mail: dial tcp 127.0.0.1:1025: connect: connection refused","event":"ShareCreated","time":"2022-06-08T15:19:19Z","message":"failed to send a message"}
{"level":"error","service":"search","statRes":{"status":{"code":6,"message":"stat: error: not found: f5952ac8-40ed-46e2-83f1-426f52fa88f6/file.txt","trace":"00000000000000000000000000000000"}},"time":"2022-06-08T15:19:20Z","message":"failed to stat the changed resource"}
{"level":"error","service":"search","error":"entity not found","Id":{"storage_id":"1284d238-aa92-42ce-bdc4-0b0000009157$7ca21fba-5aaa-49b7-9efd-934eef532cad","opaque_id":"385b64ee-aac4-4925-8ed8-d8779f9a5bd4"},"time":"2022-06-08T15:19:21Z","message":"failed to remove item from index"}
{"level":"error","service":"notifications","error":"could not send mail: dial tcp 127.0.0.1:1025: connect: connection refused","event":"ShareCreated","time":"2022-06-08T15:19:22Z","message":"failed to send a message"}
{"level":"error","service":"notifications","error":"could not send mail: dial tcp 127.0.0.1:1025: connect: connection refused","event":"ShareCreated","time":"2022-06-08T15:19:24Z","message":"failed to send a message"}
{"level":"error","service":"search","error":"error: not found: unknown client id","authRes":{"status":{"code":6,"message":"unknown client id","trace":"00000000000000000000000000000000"}},"time":"2022-06-08T15:19:25Z","message":"error using machine auth"}

dragonchaser · 2022-06-09T08:46:22Z

@individual-it I suspect an issue with the underlying storage not a bug on our side.

C0rby · 2022-06-13T12:23:19Z

https://drone.owncloud.com/owncloud/ocis/12483/47/7
Second run:
https://drone.owncloud.com/owncloud/ocis/12491/58/3

individual-it · 2022-06-14T07:51:59Z

the first one seems to be related to the issue discussed above with @dragonchaser etag propagation seems to have an issue on copy/move
I understand that the root of the issue might not be in ocis, but what is the expectation in that case? Is it acceptable that the etags are not propagated? If so we might want to adjust the tests.

individual-it · 2022-06-14T07:56:01Z

I've created a new issue #3962 to rerun pipelines that fail because of some dependencies could not be downloaded

butonic · 2022-06-17T18:30:36Z

Etags not updating is a bug. If the storage fails the decomposedfs should retry the propagation. But implementing that would equal adding journaling to decomposedfs. We currently rely on the filesystem to work. If an error occurs we fail so the admin gets a log message and the user / Clint can retry.

We could retry the propagation in process. How often should be configurable...

individual-it · 2022-06-20T08:58:42Z

etag propagation issue tracked here #3988 @butonic please comment

stale · 2022-09-16T22:56:05Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 10 days if no further activity occurs. Thank you for your contributions.

kobergj added the Type:Bug label Jun 3, 2022

kulmann mentioned this issue Jun 8, 2022

[full-ci] Update web to v5.5.0-rc.9 for ocis beta.3 #3927

Merged

individual-it mentioned this issue Jun 14, 2022

build retries for ocis pipelines that are unstable because of external dependencies #3962

Closed

micbar added Type:Orga and removed Type:Bug labels Jul 18, 2022

micbar added QA Type:CI Related to our Continouus Integration Solution labels Jul 18, 2022

stale bot added the Status:Stale label Sep 16, 2022

stale bot closed this as completed Sep 27, 2022

micbar added Status:Completed and removed Status:Stale labels Dec 13, 2022

butonic mentioned this issue Sep 5, 2023

apiAntivirus/antivirus.feature:279 is flaky #7219

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Pipeline is heavily unstable #3921

CI Pipeline is heavily unstable #3921

kobergj commented Jun 3, 2022

kobergj commented Jun 3, 2022

kobergj commented Jun 3, 2022

C0rby commented Jun 3, 2022

butonic commented Jun 3, 2022

phil-davis commented Jun 7, 2022

individual-it commented Jun 7, 2022

butonic commented Jun 7, 2022

phil-davis commented Jun 7, 2022

butonic commented Jun 7, 2022 •

edited

Loading

phil-davis commented Jun 7, 2022

phil-davis commented Jun 7, 2022

butonic commented Jun 7, 2022 •

edited

Loading

C0rby commented Jun 7, 2022

butonic commented Jun 7, 2022

butonic commented Jun 8, 2022 •

edited

Loading

butonic commented Jun 8, 2022

phil-davis commented Jun 8, 2022

dragonchaser commented Jun 9, 2022

individual-it commented Jun 9, 2022

dragonchaser commented Jun 9, 2022

C0rby commented Jun 13, 2022 •

edited

Loading

individual-it commented Jun 14, 2022

individual-it commented Jun 14, 2022

butonic commented Jun 17, 2022

individual-it commented Jun 20, 2022

stale bot commented Sep 16, 2022

CI Pipeline is heavily unstable #3921

CI Pipeline is heavily unstable #3921

Comments

kobergj commented Jun 3, 2022

Describe the bug

Steps to reproduce

Expected behavior

Actual behavior

kobergj commented Jun 3, 2022

kobergj commented Jun 3, 2022

C0rby commented Jun 3, 2022

butonic commented Jun 3, 2022

phil-davis commented Jun 7, 2022

individual-it commented Jun 7, 2022

butonic commented Jun 7, 2022

phil-davis commented Jun 7, 2022

butonic commented Jun 7, 2022 • edited Loading

phil-davis commented Jun 7, 2022

phil-davis commented Jun 7, 2022

butonic commented Jun 7, 2022 • edited Loading

C0rby commented Jun 7, 2022

butonic commented Jun 7, 2022

butonic commented Jun 8, 2022 • edited Loading

butonic commented Jun 8, 2022

phil-davis commented Jun 8, 2022

dragonchaser commented Jun 9, 2022

individual-it commented Jun 9, 2022

dragonchaser commented Jun 9, 2022

C0rby commented Jun 13, 2022 • edited Loading

individual-it commented Jun 14, 2022

individual-it commented Jun 14, 2022

butonic commented Jun 17, 2022

individual-it commented Jun 20, 2022

stale bot commented Sep 16, 2022

butonic commented Jun 7, 2022 •

edited

Loading

butonic commented Jun 7, 2022 •

edited

Loading

butonic commented Jun 8, 2022 •

edited

Loading

C0rby commented Jun 13, 2022 •

edited

Loading