flaky test-process-exit-recursive #18323

BridgeAR · 2018-01-23T15:57:38Z

https://ci.nodejs.org/job/node-test-commit-linux/15696/nodes=alpine34-container-x64/console

not ok 1180 parallel/test-process-exit-recursive
  ---
  duration_ms: 0.250
  severity: crashed
  stack: |-
    oh no!
    exit code: CRASHED (Signal: 11)

The text was updated successfully, but these errors were encountered:

gireeshpunathil · 2018-01-24T06:46:59Z

Ran a 1000 times on a normal linux x64 box, could not reproduce.
The test is trivial, just makes sure that process exit handler does not recurse.
I don't have an alpine container to try this. I can debug in CI if provided access.

/cc @gibfahn

gibfahn · 2018-01-24T09:31:39Z

I don't have an alpine container to try this. I can debug in CI if provided access.

I think our alpine containers are just docker images. We can get you access, but it might be easier for you to just spin up an alpine image and test in there.

gireeshpunathil · 2018-01-24T09:48:47Z

thanks @gibfahn . Would you mind getting one for me in one of our (IBM) dev boxes? (I am sure I will take much more time than you for this)

bnoordhuis · 2018-01-24T12:59:23Z

That might be because of alpine's ridiculously low default stack size. Can you reproduce when you do ulimit -s 80 first?

gireeshpunathil · 2018-01-24T16:50:29Z

Created an alpine image with @gibfahn 's help and attempted a 2K times, but no luck.
Also tried reducing ulimit as @bnoordhuis without success.

these are the image attributes, fwiw:

bash-4.3# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 7864
max locked memory       (kbytes, -l) 82000
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 80
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
bash-4.3#  
bash-4.3# cat /etc/alpine-release 
3.4.6
bash-4.3#  uname -a
Linux c1cfc92d5933 4.9.60-linuxkit-aufs #1 SMP Mon Nov 6 16:00:12 UTC 2017 x86_64 Linux
bash-4.3#

gireeshpunathil · 2018-01-25T05:07:05Z

At this point I stop working on this and wait to see if this show up again in the CI runs.

apapirovski · 2018-02-12T23:29:42Z

@gireeshpunathil @bnoordhuis any further thoughts here? This has happened on a few other tests, all are very short running tests that don't do much. For example: #18585

It's making me think that this must be something related to the bootstrap / tear down where the code is clearly overreaching. For example, we could be manipulating V8 objects after Dispose or something of that nature? I first considered problematic atexit handlers but we have only 1, I believe, and it's safe.

apapirovski · 2018-02-12T23:30:41Z

The shortest running example was #18505 which literally exits after requiring common and checking that it's not a platform that test supports.

gireeshpunathil · 2018-02-13T01:53:48Z

I stopped working on this, but thanks for the collated information - that shows a pattern for the failure, and gives further opportunity to investigate.

Let me further look into these.

One challenge is the intermittent nature, even in the CI. If you ever spot a consistent failure in CI, the way forward would be to set ulimit -c unlimited to get a core dump, or run it under strace

gireeshpunathil · 2018-02-13T09:13:17Z

The shortest running example was #18505

@apapirovski - that provides sufficient proof point that the issue can be outside of node itself.

the practical code that is run in #18505 on alpine is:

'use strict';
const common = require('../common');
if (!(common.isOSX || common.isWindows))
  common.skip('recursive option is darwin/windows specific');

@gibfahn :

are these images persistent (created once and reused over) versus built on the fly from the Dockerfile, on a per CI basis? If it is persistent, I would guess it is peculiar to the image, and the issue may not be reproducible locally.
what is the physical host on which the alpine images are deployed? I would like to try the same combination (all my attempts were on alpine34 on a MacOS)

gibfahn · 2018-02-14T12:11:11Z

are these images persistent (created once and reused over) versus built on the fly from the Dockerfile, on a per CI basis? If it is persistent, I would guess it is peculiar to the image, and the issue may not be reproducible locally.

I think the containers are persistent, but I'm not familiar with the specifics. That's more a question for @rvagg.

rvagg · 2018-02-14T19:53:36Z

Dockerfiles are here https://github.com/nodejs/build/tree/master/ansible/roles/docker/templates
They are persistent but get updated fairly regularly as I tweak them (mostly tweaking the sharedlibs one).

Also, Alpine 3.4 if kind of old now, what do ppl in here think about just removing it? I'd like to do it at least by the time that 3.8 is released.

rvagg · 2018-02-14T19:54:18Z

pull requests welcome to the Dockerfiles of course! especially from people that actually use Alpine and have more of a clue than me about how it's generally configured for general use.

gireeshpunathil · 2018-02-16T06:53:17Z

thanks @rvagg . In that case is it possible to zip images alpine34-container-x64 and alpine35-container-x64 and send it for debugging? The fresh images I created using the dockerfile template never cause the noted failures. If we are able to recreate, we can conclude that these are issues with the images themselves, and refresh them.

docker save -o <filename> <image>

bnoordhuis · 2018-02-16T09:04:40Z

Also, Alpine 3.4 if kind of old now, what do ppl in here think about just removing it?

I think that would fix a lot of issues people have reported, both test failures and build issues. I'm testing against 3.7 and it's been pretty smooth.

It's a stock install in a KVM virtual machine, not a Docker image; don't know if that makes a difference.

gibfahn · 2018-02-16T21:03:59Z

Also, Alpine 3.4 if kind of old now, what do ppl in here think about just removing it? I'd like to do it at least by the time that 3.8 is released.

Seems reasonable to me, I assume 99% of people running Node in Alpine are doing it through docker, so they're unlikely to be pulling new versions of Node into existing images.

apapirovski · 2018-06-03T11:30:05Z

I'm leaning towards closing this because we've fixed some segfaults in the last 3 months and this test hasn't failed in a long time. In addition we've had the V8 team help us with running Node.js with some debug parameters that would uncover segfaults and we've fixed all the findings of that.

BridgeAR · 2018-06-03T11:54:00Z

I am fine with that

BridgeAR added test Issues and PRs related to the tests. flaky-test Issues and PRs related to the tests with unstable failures on the CI. labels Jan 23, 2018

apapirovski mentioned this issue Feb 5, 2018

Intermittent? - parallel/test-net-timeout-no-handle crash on alpine #18585

Closed

This was referenced Jun 1, 2018

Flaky parallel/test-http-get-pipeline-problem #18731

Closed

Flaky test-fs-watch-recursive #18505

Closed

BridgeAR closed this as completed Jun 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flaky test-process-exit-recursive #18323

flaky test-process-exit-recursive #18323

BridgeAR commented Jan 23, 2018

gireeshpunathil commented Jan 24, 2018

gibfahn commented Jan 24, 2018

gireeshpunathil commented Jan 24, 2018

bnoordhuis commented Jan 24, 2018

gireeshpunathil commented Jan 24, 2018

gireeshpunathil commented Jan 25, 2018

apapirovski commented Feb 12, 2018

apapirovski commented Feb 12, 2018

gireeshpunathil commented Feb 13, 2018

gireeshpunathil commented Feb 13, 2018

gibfahn commented Feb 14, 2018

rvagg commented Feb 14, 2018

rvagg commented Feb 14, 2018

gireeshpunathil commented Feb 16, 2018

bnoordhuis commented Feb 16, 2018

gibfahn commented Feb 16, 2018

apapirovski commented Jun 3, 2018

BridgeAR commented Jun 3, 2018

flaky test-process-exit-recursive #18323

flaky test-process-exit-recursive #18323

Comments

BridgeAR commented Jan 23, 2018

gireeshpunathil commented Jan 24, 2018

gibfahn commented Jan 24, 2018

gireeshpunathil commented Jan 24, 2018

bnoordhuis commented Jan 24, 2018

gireeshpunathil commented Jan 24, 2018

gireeshpunathil commented Jan 25, 2018

apapirovski commented Feb 12, 2018

apapirovski commented Feb 12, 2018

gireeshpunathil commented Feb 13, 2018

gireeshpunathil commented Feb 13, 2018

gibfahn commented Feb 14, 2018

rvagg commented Feb 14, 2018

rvagg commented Feb 14, 2018

gireeshpunathil commented Feb 16, 2018

bnoordhuis commented Feb 16, 2018

gibfahn commented Feb 16, 2018

apapirovski commented Jun 3, 2018

BridgeAR commented Jun 3, 2018