Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flaky test-process-exit-recursive #18323

Closed
BridgeAR opened this issue Jan 23, 2018 · 18 comments
Closed

flaky test-process-exit-recursive #18323

BridgeAR opened this issue Jan 23, 2018 · 18 comments
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI. test Issues and PRs related to the tests.

Comments

@BridgeAR
Copy link
Member

https://ci.nodejs.org/job/node-test-commit-linux/15696/nodes=alpine34-container-x64/console

not ok 1180 parallel/test-process-exit-recursive
  ---
  duration_ms: 0.250
  severity: crashed
  stack: |-
    oh no!
    exit code: CRASHED (Signal: 11)
@BridgeAR BridgeAR added test Issues and PRs related to the tests. flaky-test Issues and PRs related to the tests with unstable failures on the CI. labels Jan 23, 2018
@gireeshpunathil
Copy link
Member

  • Ran a 1000 times on a normal linux x64 box, could not reproduce.
  • The test is trivial, just makes sure that process exit handler does not recurse.
  • I don't have an alpine container to try this. I can debug in CI if provided access.

/cc @gibfahn

@gibfahn
Copy link
Member

gibfahn commented Jan 24, 2018

I don't have an alpine container to try this. I can debug in CI if provided access.

I think our alpine containers are just docker images. We can get you access, but it might be easier for you to just spin up an alpine image and test in there.

@gireeshpunathil
Copy link
Member

thanks @gibfahn . Would you mind getting one for me in one of our (IBM) dev boxes? (I am sure I will take much more time than you for this)

@bnoordhuis
Copy link
Member

That might be because of alpine's ridiculously low default stack size. Can you reproduce when you do ulimit -s 80 first?

@gireeshpunathil
Copy link
Member

Created an alpine image with @gibfahn 's help and attempted a 2K times, but no luck.
Also tried reducing ulimit as @bnoordhuis without success.

these are the image attributes, fwiw:

bash-4.3# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 7864
max locked memory       (kbytes, -l) 82000
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 80
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
bash-4.3#  
bash-4.3# cat /etc/alpine-release 
3.4.6
bash-4.3#  uname -a
Linux c1cfc92d5933 4.9.60-linuxkit-aufs #1 SMP Mon Nov 6 16:00:12 UTC 2017 x86_64 Linux
bash-4.3# 

@gireeshpunathil
Copy link
Member

At this point I stop working on this and wait to see if this show up again in the CI runs.

@apapirovski
Copy link
Contributor

@gireeshpunathil @bnoordhuis any further thoughts here? This has happened on a few other tests, all are very short running tests that don't do much. For example: #18585

It's making me think that this must be something related to the bootstrap / tear down where the code is clearly overreaching. For example, we could be manipulating V8 objects after Dispose or something of that nature? I first considered problematic atexit handlers but we have only 1, I believe, and it's safe.

@apapirovski
Copy link
Contributor

The shortest running example was #18505 which literally exits after requiring common and checking that it's not a platform that test supports.

@gireeshpunathil
Copy link
Member

I stopped working on this, but thanks for the collated information - that shows a pattern for the failure, and gives further opportunity to investigate.

Let me further look into these.

One challenge is the intermittent nature, even in the CI. If you ever spot a consistent failure in CI, the way forward would be to set ulimit -c unlimited to get a core dump, or run it under strace

@gireeshpunathil
Copy link
Member

The shortest running example was #18505

@apapirovski - that provides sufficient proof point that the issue can be outside of node itself.

the practical code that is run in #18505 on alpine is:

'use strict';
const common = require('../common');
if (!(common.isOSX || common.isWindows))
  common.skip('recursive option is darwin/windows specific');

@gibfahn :

  • are these images persistent (created once and reused over) versus built on the fly from the Dockerfile, on a per CI basis? If it is persistent, I would guess it is peculiar to the image, and the issue may not be reproducible locally.
  • what is the physical host on which the alpine images are deployed? I would like to try the same combination (all my attempts were on alpine34 on a MacOS)

@gibfahn
Copy link
Member

gibfahn commented Feb 14, 2018

are these images persistent (created once and reused over) versus built on the fly from the Dockerfile, on a per CI basis? If it is persistent, I would guess it is peculiar to the image, and the issue may not be reproducible locally.

I think the containers are persistent, but I'm not familiar with the specifics. That's more a question for @rvagg.

@rvagg
Copy link
Member

rvagg commented Feb 14, 2018

Dockerfiles are here https://github.com/nodejs/build/tree/master/ansible/roles/docker/templates
They are persistent but get updated fairly regularly as I tweak them (mostly tweaking the sharedlibs one).

Also, Alpine 3.4 if kind of old now, what do ppl in here think about just removing it? I'd like to do it at least by the time that 3.8 is released.

@rvagg
Copy link
Member

rvagg commented Feb 14, 2018

pull requests welcome to the Dockerfiles of course! especially from people that actually use Alpine and have more of a clue than me about how it's generally configured for general use.

@gireeshpunathil
Copy link
Member

thanks @rvagg . In that case is it possible to zip images alpine34-container-x64 and alpine35-container-x64 and send it for debugging? The fresh images I created using the dockerfile template never cause the noted failures. If we are able to recreate, we can conclude that these are issues with the images themselves, and refresh them.

docker save -o <filename> <image>

@bnoordhuis
Copy link
Member

Also, Alpine 3.4 if kind of old now, what do ppl in here think about just removing it?

I think that would fix a lot of issues people have reported, both test failures and build issues. I'm testing against 3.7 and it's been pretty smooth.

It's a stock install in a KVM virtual machine, not a Docker image; don't know if that makes a difference.

@gibfahn
Copy link
Member

gibfahn commented Feb 16, 2018

Also, Alpine 3.4 if kind of old now, what do ppl in here think about just removing it? I'd like to do it at least by the time that 3.8 is released.

Seems reasonable to me, I assume 99% of people running Node in Alpine are doing it through docker, so they're unlikely to be pulling new versions of Node into existing images.

@apapirovski
Copy link
Contributor

I'm leaning towards closing this because we've fixed some segfaults in the last 3 months and this test hasn't failed in a long time. In addition we've had the V8 team help us with running Node.js with some debug parameters that would uncover segfaults and we've fixed all the findings of that.

@BridgeAR
Copy link
Member Author

BridgeAR commented Jun 3, 2018

I am fine with that

@BridgeAR BridgeAR closed this as completed Jun 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI. test Issues and PRs related to the tests.
Projects
None yet
Development

No branches or pull requests

6 participants