Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate flaky parallel/test-http-pipeline-regr-3332 on smartos15-64 #15418

Closed
mscdex opened this issue Sep 14, 2017 · 11 comments
Closed

Investigate flaky parallel/test-http-pipeline-regr-3332 on smartos15-64 #15418

mscdex opened this issue Sep 14, 2017 · 11 comments
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI. http Issues or PRs related to the http subsystem. smartos Issues and PRs related to the SmartOS platform. test Issues and PRs related to the tests.

Comments

@mscdex
Copy link
Contributor

mscdex commented Sep 14, 2017

  • Version: master
  • Platform: smartos
  • Subsystem: test
not ok 695 parallel/test-http-pipeline-regr-3332
  ---
  duration_ms: 3.348
  severity: crashed
  stack: |-
    oh no!
    exit code: CRASHED (Signal: 4)
@mscdex mscdex added http Issues or PRs related to the http subsystem. smartos Issues and PRs related to the SmartOS platform. test Issues and PRs related to the tests. labels Sep 14, 2017
@misterdjules
Copy link

There should be a core file on the machine where that test ran. @nodejs/build: Is it possible to grab that core file and upload it somewhere people can download it?

@misterdjules
Copy link

/cc @nodejs/platform-smartos

@gibfahn
Copy link
Member

gibfahn commented Sep 18, 2017

Do you have a link to the job that failed? Couldn't see it with a quick check.

@misterdjules what's the easiest way to get you a core file? Is there any sensitive data on there, or can it just go on ci.nodejs.org/downloads (see nodejs/build#871 (comment)) ?

@mscdex
Copy link
Contributor Author

mscdex commented Sep 18, 2017

@gibfahn
Copy link
Member

gibfahn commented Sep 18, 2017

@misterdjules or we could just put your key on the machine and you could scp it off.

@misterdjules
Copy link

@gibfahn There should be no sensitive data in the core file. ci.nodejs.org/downloads is fine for now, but it'd be good to think about a cleaner long term approach for storing and cleaning up those files.

@refack
Copy link
Contributor

refack commented Sep 18, 2017

@misterdjules I think it's this one https://expirebox.com/download/7c459d040bf2d407ddd80908463fc665.html
But it's an educated guess
image

@mscdex
Copy link
Contributor Author

mscdex commented Sep 18, 2017

The test started at 5:17pm and took 22 minutes so it's hard to say which core file it is.

Perhaps we should either start including the child process pid in error output (wouldn't work for some situations with cluster tests for example) or we could change the coredump handler to include the pid and the test filename? I know for the latter you can accomplish that pretty easily in Linux by setting kernel.core_pattern appropriately, but I don't know about SmartOS.

@refack
Copy link
Contributor

refack commented Sep 18, 2017

Times are in UTC, so 5:17pm => 21:40UTC, but that's when the tests ended (the rest of the suite continued)
image
Follow up issues:

  1. Try not to generate cores for abort tests - test: run abort tests #14013 (comment)
  2. Add pid to crash message in TAP report
  3. Add timestamps to job/console output

@misterdjules
Copy link

misterdjules commented Sep 19, 2017

@refack Thank you very much for uploading the core file!

After downloading it, I was able to verify that it was generated from a node process running the test mentioned in this issue with:

[root@headnode (coal) ~]# pargs /var/tmp/core.node.69705 
core '/var/tmp/core.node.69705' of 69705:	out/Release/node /home/iojs/build/workspace/node-test-commit-smartos/nodes/smar
argv[0]: out/Release/node
argv[1]: /home/iojs/build/workspace/node-test-commit-smartos/nodes/smartos15-64/test/parallel/test-http-pipeline-regr-3332.js
[root@headnode (coal) ~]#

Looking at the stacks:

[root@headnode (coal) ~]# pstack /var/tmp/core.node.69705 
core '/var/tmp/core.node.69705' of 69705:	out/Release/node /home/iojs/build/workspace/node-test-commit-smartos/n
-----------------  lwp# 1 / thread# 1  --------------------
 ffffbf7fff2113fe mutex_unlock_queue (2a3e110, 0) + 6e
 ffffbf7fff212b92 mutex_unlock (2a3e110) + 172
 ffffbf7ffec2cdcd umem_slab_alloc (2a3e028, 0) + cd
 ffffbf7ffec2de63 umem_cache_alloc (2a3e028, 0) + 143
 ffffbf7ffec2e214 umem_alloc (20, 0) + 64
 ffffbf7ffec2a9bf umem_malloc (10) + 3f
 ffffbf7fee5d6780 _Znwm () + 20
 00000000015ab6fc _ZN2v88internal4Heap22RegisterNewArrayBufferEPNS0_13JSArrayBufferE () + ac
 00000000016aac73 _ZN2v88internal13JSArrayBuffer5SetupENS0_6HandleIS1_EEPNS0_7IsolateEbPvmNS0_10SharedFlagE () + 13
 00000000011bb19b _ZN2v811ArrayBuffer3NewEPNS_7IsolateEPvmNS_23ArrayBufferCreationModeE () + ab
 0000000000c8f5bd _ZN4node6Buffer3NewEPNS_11EnvironmentEPcm () + 4d
 0000000000cf3ad6 _ZN4node10StreamWrap10OnReadImplElPK8uv_buf_t14uv_handle_typePv () + d6
 0000000000cf3f9b _ZN4node10StreamWrap6OnReadEP11uv_stream_slPK8uv_buf_t () + 7b
 00000000010d209f uv__read () + 1ff
 00000000010d2768 uv__stream_io () + 2e8
 00000000010d7b99 uv__io_poll () + 249
 00000000010c6f8a uv_run () + 15a
 0000000000c7c275 _ZN4node5StartEP9uv_loop_siPKPKciS5_ () + 975
 0000000000c74416 _ZN4node5StartEiPPc () + 156
 0000000000c3ff6c _start () + 6c
-----------------  lwp# 2 / thread# 2  --------------------
 ffffbf7fff21a617 __lwp_park () + 17
 ffffbf7fff213b6b cond_wait_queue (2ad8c18, 2ad8c00, 0) + 5b
 ffffbf7fff214011 cond_wait_common (2ad8c18, 2ad8c00, 0) + 1d1
 ffffbf7fff2141c2 __cond_wait (2ad8c18, 2ad8c00) + 52
 ffffbf7fff21425a cond_wait (2ad8c18, 2ad8c00) + 2a
 ffffbf7fff214295 pthread_cond_wait (2ad8c18, 2ad8c00) + 15
 00000000010d4d19 uv_cond_wait () + 9
 0000000000cc4ffb _ZN4nodeL16BackgroundRunnerEPv () + ab
 ffffbf7fff21a2ba _thrp_setup (ffffbf7ffe7e0240) + 8a
 ffffbf7fff21a5d0 _lwp_start ()
-----------------  lwp# 3 / thread# 3  --------------------
 ffffbf7fff21a617 __lwp_park () + 17
 ffffbf7fff213b6b cond_wait_queue (2ad8c18, 2ad8c00, 0) + 5b
 ffffbf7fff214011 cond_wait_common (2ad8c18, 2ad8c00, 0) + 1d1
 ffffbf7fff2141c2 __cond_wait (2ad8c18, 2ad8c00) + 52
 ffffbf7fff21425a cond_wait (2ad8c18, 2ad8c00) + 2a
 ffffbf7fff214295 pthread_cond_wait (2ad8c18, 2ad8c00) + 15
 00000000010d4d19 uv_cond_wait () + 9
 0000000000cc4ffb _ZN4nodeL16BackgroundRunnerEPv () + ab
 ffffbf7fff21a2ba _thrp_setup (ffffbf7ffe7e0a40) + 8a
 ffffbf7fff21a5d0 _lwp_start ()
-----------------  lwp# 4 / thread# 4  --------------------
 0000000001991949 _ZN2v84base2OS5AbortEv () + 9
 000000000119dcbc _ZN2v88internal2V823FatalProcessOutOfMemoryEPKcb () + 1fc
 000000000190cd57 _ZN2v88internal4Zone3NewEm () + a7
 00000000013e3536 _ZNSt6vectorIPKN2v88internal8compiler15LoadElimination13AbstractStateENS1_13ZoneAllocatorIS6_EEE14_M_fill_insertEN9__gnu_cxx17__normal_iteratorIPS6_S9_EEmRKS6_ () + 336
 00000000013dd9ce _ZN2v88internal8compiler15LoadElimination11UpdateStateEPNS1_4NodeEPKNS2_13AbstractStateE () + ce
 00000000013dffdf _ZN2v88internal8compiler15LoadElimination15ReduceLoadFieldEPNS1_4NodeE () + 2df
 00000000013e24c5 _ZN2v88internal8compiler15LoadElimination6ReduceEPNS1_4NodeE () + 2c5
 000000000136b18e _ZN2v88internal8compiler12GraphReducer6ReduceEPNS1_4NodeE () + 3e
 000000000136b57b _ZN2v88internal8compiler12GraphReducer9ReduceTopEv () + 13b
 000000000136ba61 _ZN2v88internal8compiler12GraphReducer10ReduceNodeEPNS1_4NodeE () + 81
 000000000142d3a2 _ZN2v88internal8compiler20LoadEliminationPhase3RunEPNS1_12PipelineDataEPNS0_4ZoneE () + 1e2
 00000000014289d1 _ZN2v88internal8compiler12PipelineImpl13OptimizeGraphEPNS1_7LinkageE () + 921
 0000000001428a50 _ZN2v88internal8compiler22PipelineCompilationJob14ExecuteJobImplEv () + 20
 00000000014bf3b4 _ZN2v88internal14CompilationJob10ExecuteJobEv () + 44
 00000000014be4ec _ZN2v88internal27OptimizingCompileDispatcher11CompileTask3RunEv () + 12c
 0000000000cc4fcd _ZN4nodeL16BackgroundRunnerEPv () + 7d
 ffffbf7fff21a2ba _thrp_setup (ffffbf7ffe7e1240) + 8a
 ffffbf7fff21a5d0 _lwp_start ()
-----------------  lwp# 5 / thread# 5  --------------------
 ffffbf7fff21a617 __lwp_park () + 17
 ffffbf7fff213b6b cond_wait_queue (2ad8c18, 2ad8c00, 0) + 5b
 ffffbf7fff214011 cond_wait_common (2ad8c18, 2ad8c00, 0) + 1d1
 ffffbf7fff2141c2 __cond_wait (2ad8c18, 2ad8c00) + 52
 ffffbf7fff21425a cond_wait (2ad8c18, 2ad8c00) + 2a
 ffffbf7fff214295 pthread_cond_wait (2ad8c18, 2ad8c00) + 15
 00000000010d4d19 uv_cond_wait () + 9
 0000000000cc4ffb _ZN4nodeL16BackgroundRunnerEPv () + ab
 ffffbf7fff21a2ba _thrp_setup (ffffbf7ffe7e1a40) + 8a
 ffffbf7fff21a5d0 _lwp_start ()
-----------------  lwp# 6 / thread# 6  --------------------
 ffffbf7fff21a617 __lwp_park () + 17
 ffffbf7fff20cb63 sema_wait (29f2300) + 13
 ffffbf7fff200a78 sem_wait (29f2300) + 38
 00000000010d4bb4 uv_sem_wait () + 24
 0000000000d0d81a _ZN4node9inspector12_GLOBAL__N_117StartIoThreadMainEPv () + 1a
 ffffbf7fff21a2ba _thrp_setup (ffffbf7ffe7e2240) + 8a
 ffffbf7fff21a5d0 _lwp_start ()
-----------------  lwp# 7 / thread# 7  --------------------
 ffffbf7fff21a617 __lwp_park () + 17
 ffffbf7fff213b6b cond_wait_queue (29f5a50, 29f5a30, 0) + 5b
 ffffbf7fff214011 cond_wait_common (29f5a50, 29f5a30, 0) + 1d1
 ffffbf7fff2141c2 __cond_wait (29f5a50, 29f5a30) + 52
 ffffbf7fff21425a cond_wait (29f5a50, 29f5a30) + 2a
 ffffbf7fff214295 pthread_cond_wait (29f5a50, 29f5a30) + 15
 00000000010d4d19 uv_cond_wait () + 9
 00000000010c4428 worker () + 48
 ffffbf7fff21a2ba _thrp_setup (ffffbf7ffe7e2a40) + 8a
 ffffbf7fff21a5d0 _lwp_start ()
-----------------  lwp# 8 / thread# 8  --------------------
 ffffbf7fff21a617 __lwp_park () + 17
 ffffbf7fff213b6b cond_wait_queue (29f5a50, 29f5a30, 0) + 5b
 ffffbf7fff214011 cond_wait_common (29f5a50, 29f5a30, 0) + 1d1
 ffffbf7fff2141c2 __cond_wait (29f5a50, 29f5a30) + 52
 ffffbf7fff21425a cond_wait (29f5a50, 29f5a30) + 2a
 ffffbf7fff214295 pthread_cond_wait (29f5a50, 29f5a30) + 15
 00000000010d4d19 uv_cond_wait () + 9
 00000000010c4428 worker () + 48
 ffffbf7fff21a2ba _thrp_setup (ffffbf7ffe7e3240) + 8a
 ffffbf7fff21a5d0 _lwp_start ()
-----------------  lwp# 9 / thread# 9  --------------------
 ffffbf7fff21a617 __lwp_park () + 17
 ffffbf7fff213b6b cond_wait_queue (29f5a50, 29f5a30, 0) + 5b
 ffffbf7fff214011 cond_wait_common (29f5a50, 29f5a30, 0) + 1d1
 ffffbf7fff2141c2 __cond_wait (29f5a50, 29f5a30) + 52
 ffffbf7fff21425a cond_wait (29f5a50, 29f5a30) + 2a
 ffffbf7fff214295 pthread_cond_wait (29f5a50, 29f5a30) + 15
 00000000010d4d19 uv_cond_wait () + 9
 00000000010c4428 worker () + 48
 ffffbf7fff21a2ba _thrp_setup (ffffbf7ffe7e3a40) + 8a
 ffffbf7fff21a5d0 _lwp_start ()
-----------------  lwp# 10 / thread# 10  --------------------
 ffffbf7fff21a617 __lwp_park () + 17
 ffffbf7fff213b6b cond_wait_queue (29f5a50, 29f5a30, 0) + 5b
 ffffbf7fff214011 cond_wait_common (29f5a50, 29f5a30, 0) + 1d1
 ffffbf7fff2141c2 __cond_wait (29f5a50, 29f5a30) + 52
 ffffbf7fff21425a cond_wait (29f5a50, 29f5a30) + 2a
 ffffbf7fff214295 pthread_cond_wait (29f5a50, 29f5a30) + 15
 00000000010d4d19 uv_cond_wait () + 9
 00000000010c4428 worker () + 48
 ffffbf7fff21a2ba _thrp_setup (ffffbf7ffe7e4240) + 8a
 ffffbf7fff21a5d0 _lwp_start ()
-----------------  lwp# 11 / thread# 11  --------------------
 ffffbf7fff21a617 __lwp_park () + 17
 ffffbf7fff213b6b cond_wait_queue (ffffbf7ffec84840, ffffbf7ffec84860, ffffbf7ffd5b1e70) + 5b
 ffffbf7fff214011 cond_wait_common (ffffbf7ffec84840, ffffbf7ffec84860, ffffbf7ffd5b1e70) + 1d1
 ffffbf7fff21431d __cond_timedwait (ffffbf7ffec84840, ffffbf7ffec84860, ffffbf7ffd5b1f40) + 6d
 ffffbf7fff2143e6 cond_timedwait (ffffbf7ffec84840, ffffbf7ffec84860, ffffbf7ffd5b1f40) + 36
 ffffbf7ffec287cc umem_update_thread (0) + 1cc
 ffffbf7fff21a2ba _thrp_setup (ffffbf7ffe7e4a40) + 8a
 ffffbf7fff21a5d0 _lwp_start ()
You have new mail in /var/mail/root
[root@headnode (coal) ~]#

It seems that the optimizing compiler tried to allocate memory and failed. At the same time, the main thread that runs the libuv event loop was also allocating memory.

Looking at the cumulated size of anonymous segments (JS heap), we find:

[root@headnode (coal) ~]# pmap /var/tmp/core.node.69705 | grep anon | tr -s ' ' | cut -d ' ' -f 2 | tr -d 'K' | awk '{s+=$1} END {print s}'
51552
[root@headnode (coal) ~]#

So the process was using ~51MBs of memory for the JS heap, and about the same of the native heap:

root@headnode (coal) ~]# pmap /var/tmp/core.node.69705 | grep heap | tr -s ' ' | cut -d ' ' -f 2 | tr -d 'K' | awk '{s+=$1} END {print s}'
50184
[root@headnode (coal) ~]#

Which seems consistent with the small size (144MBs) of the core file itself.

This looks a lot like the processes running when this test was running were using all the memory available to the VM.

@maclover7 maclover7 added the flaky-test Issues and PRs related to the tests with unstable failures on the CI. label Dec 25, 2017
@cjihrig
Copy link
Contributor

cjihrig commented Feb 1, 2018

I'm going to close this as it hasn't shown up recently, and it seems like a lack of available memory was the underlying cause.

@cjihrig cjihrig closed this as completed Feb 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI. http Issues or PRs related to the http subsystem. smartos Issues and PRs related to the SmartOS platform. test Issues and PRs related to the tests.
Projects
None yet
Development

No branches or pull requests

6 participants