-
Notifications
You must be signed in to change notification settings - Fork 30.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alpine 3.8 cluster failures #22308
Comments
Quick thing to try: Does moving the tests to sequential cause them to pass in CI on this platform? (I would create a branch and try myself but I've got non-computer things to focus on for the next few hours.) |
(Also: Our code now prints stdout and stderr on timeouts so someone can try adding a bunch of |
Testing move to |
https://ci.nodejs.org/job/node-test-commit-linux/20716/nodes=alpine-latest-x64/console |
Running with some logging: https://ci.nodejs.org/job/node-test-commit-linux/20725/nodes=alpine-latest-x64/console |
The logging shows that the test succeeds except for that |
@nodejs/docker |
|
Also, @Trott (or anybody else with the |
Since Alpine runs with |
Reporting from within the danger zone:
more like twilight zone. |
Is the failure specific to running the test in a docker container? |
We test Alpine only running as a Docker container, so I guess we don't know... |
I would love to help but it looks like a pretty steep learning curve figuring out all the build stuff. |
Btw, where is the test suite to test? |
|
which in turn runs:
(it assumes the node binary to test is |
Is there a way to specify the binary path to test? |
AFAIR, no. When needed I copy/symlink a binary to that location. |
New observation: after running a CI job there seem to be multiple
That might make the polling to fail... |
Ok, I created the following dockerfile to run the testsuit FROM node:10-alpine
RUN apk add --no-cache --update \
curl \
python \
&& curl -L --compressed https://api.github.com/repos/nodejs/node/tarball -o node.tar.gz
RUN mkdir -p /node/out/Release \
&& tar -xf node.tar.gz --strip 1 -C /node \
&& ln -s /usr/local/bin/node /node/out/Release/node
RUN cd /node \
&& python tools/test.py -j 4 -p tap --logfile test.tap \
--mode=release --flaky-tests=run \
default addons addons-napi doctool |
FYI this runs on top of Ubuntu 16.04 in our infra. These containers are fresh as I've just reprovisioned all of our Docker infra over the last couple of days, so this isn't about the process table filling up. I can't reproduce locally on 18.04 using the same container config. One other difference is that we run from within Jenkins, so there's an additional layer to the process tree, although I'm not sure why that would matter. |
OK, can repro locally, it's because of the process hierarchy inside the container. You need to remove ./configure:
make
run the two tests
yields this output after waiting for timeouts:
This reinforces the need to fix this as launching your application with minimal layers inside your minimal container is a thing that folks do with Docker / Alpine. (Aside from the fact that you shouldn't be using cluster). |
Running the tests directly works too btw, you just need to kill it manually:
|
(Unassigning refack. Don't want to discourage others from jumping in on this.) |
|
Both tests use Here's the source for function isAlive(pid) {
try {
process.kill(pid, 'SIGCONT');
return true;
} catch {
return false;
}
} Any chance |
The addition of EDIT: Changing to |
Use signal value `0` rather than `'SIGCONT'` as the latter appears to behave unexpectedly on Alpine 3.8. Fixes: nodejs#22308
Probably too optimistic to think that #24756 will fix it without introducing other issues, but let's see... EDIT: Indeed, too optimistic...didn't work... |
Whee! I have Docker installed and setup and I can replicate this. It's like I'm living in the FUTURE or something. Or at least the RECENT PAST. Can't do it right now, but will investigate more in a little bit if no one beats me to it (which: please do!). |
As far as I can tell, the worker really does exit, but in the master process, iojs 19 3.7 0.0 0 0 pts/0 Z+ 23:00 0:00 [node] <defunct> This is I guess basically what refack noted on August 15. |
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
4 S 1000 1 0 4 80 0 - 64478 - pts/0 00:00:00 node
0 Z 1000 19 1 1 80 0 - 0 - pts/0 00:00:00 node <defunct> |
I think
I'm not sure if this is a bug in Node.js or an artifact of the way the test is run as rvagg described above. Is there some system call Node.js should be making to let the OS know to remove the pid from the process table? (If so, why does it only matter in this one edge case?) Or is this just what happens when you kinda sorta bypass some normal operating system stuff? @nodejs/cluster @nodejs/child_process |
I guess probably not a terrible time to re-ping @nodejs/docker too... |
I bet some clever libuv folks might have a clue here too since we're getting pretty low @cjihrig @bnoordhuis |
@nodejs/libuv |
So I think this is all about the face that the killed child processes aren't reaped properly because we don't have init or a similar reaping parent process in the chain. I believe I can fix this by putting
Here's the bit I quite know the answer to: is this bypassing something that should be Node's responsibility? Why are we not experiencing this on any of our other Docker containers where we do the same thing but execute Jenkins in the same way? I can't find anything special about Alpine 3.9 that would lead to different behaviour. I don't want to be putting a bandaid on something that's a genuine problem on our end. |
actually, I solved a similar problem of non-reaping on the ARM machines running docker by using |
I wouldn't mind thoughts from folks more expert in system-level concerns on why this might be a unique problem on a specific distrio+version; and does this suggest problems on our end? Otherwise, we can probably close this and wait to see if we get issues reported about it. |
ARM Docker use already has this, this expands it to the rest of our Docker usage. It helps with reping defunct processes when running bare commands (i.e. jenkins). Ref: nodejs/node#22308
ARM Docker use already has this, this expands it to the rest of our Docker usage. It helps with reping defunct processes when running bare commands (i.e. jenkins). Ref: nodejs/node#22308
docker containers not having a functional init is a common source of problems, at least when the container runs code that creates sub-sub-processes that don't get waited on by the sub-process. I recall reviewing C code for a mini-reaper of @rmg that was used as a runner. I wonder if the Hypothetically, if the order of process termination varies, if the sub-process is allowed to run just marginally past the sub-sub-process death, it will get the chance to wait on its child processes, and they won't become orphaned. If the sub-process terminates before the sub-sub-process exit statuses are available, then they become orphaned, and reparented to init, maybe process run/scheduling differences is what we are seeing. |
I can buy that as an explanation I suppose. It's just strange that were only seeing this on one of the dockerised platforms. Granted, Ubuntu 16.04 is used for the majority but we've also had alpine in there for a while now without seeing this. Maybe it's a minor musl change that's impacted timing in some subtle but reliable way. |
I just added Alpine 3.8 to CI and removed 3.6 in the process, shifting
alpine-last-latest-x64
to Alpine 3.7 and givingalpine-latest-x64
to Alpine 3.8. It tested well on my local machine across our major branches, but now it's enabled in CI we're getting consistent errors with on all test runs:What should we do with this? Do I remove it from CI for now or can someone spend some time investigating?
The Dockerfile is here minus the template strings which you can easily stub out to set this up locally if you want to give that a go. I'm not sure what the difference is with my local machine but perhaps I wasn't testing it on the latest
master
and there's something new or perhaps there's a Docker differential.The text was updated successfully, but these errors were encountered: