Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: fix fork-dgram on AIX #8549

Closed
wants to merge 1 commit into from
Closed

Conversation

gibfahn
Copy link
Member

@gibfahn gibfahn commented Sep 15, 2016

Checklist
  • make -j4 test (UNIX), or vcbuild test nosign (Windows) passes
  • tests and/or benchmarks are included
  • commit message follows commit guidelines
Affected core subsystem(s)

test

Description of change

Closes the server as soon as the child receives a message

I'm not sure that this is the correct fix, but I've tried this with some debugging information (see this gist) and both parent and child do seem to be receiving a message on AIX.

Fixes: #8271

Refs:
Previous PR (to fix on SmartOS): nodejs/node-v0.x-archive#8045
Previous Issue (ditto): nodejs/node-v0.x-archive#8046
Original Commit: bdf7ac2
Original PR: nodejs/node-v0.x-archive#4864

@nodejs-github-bot nodejs-github-bot added the test Issues and PRs related to the tests. label Sep 15, 2016
@gibfahn
Copy link
Member Author

gibfahn commented Sep 15, 2016

Failure on AIX (before)

 /home/gib/node $ tools/test.py --repeat=10 --timeout=5 parallel/test-dgram-debug
=== release test-dgram-debug ===                    
Path: parallel/test-dgram-debug
||| Child server
Parent Listening
child msg 1 = server
parentGotMessage: 0 childGotMessage: 0
=== Sending msg 1
### Parent server:
{ address: '127.0.0.1', family: 'IPv4', port: 42024 }
Child msg   2 = 1
parent: gotMessage from child
parentGotMessage: 0 childGotMessage: 1
=== Sending msg 2
Child msg   3 = 2
parentGotMessage: 0 childGotMessage: 1
=== Sending msg 3
Child msg   4 = 3
parentGotMessage: 0 childGotMessage: 1
=== Sending msg 4
Child msg   5 = 4
parentGotMessage: 0 childGotMessage: 1
=== Sending msg 5
Child msg   6 = 5
parentGotMessage: 0 childGotMessage: 1
=== Sending msg 6
Child msg   7 = 6
parentGotMessage: 0 childGotMessage: 1
=== Sending msg 7
Child msg   8 = 7
parentGotMessage: 0 childGotMessage: 1
=== Sending msg 8
Child msg   9 = 8
parentGotMessage: 0 childGotMessage: 1
=== Sending msg 9
Child msg   10 = 9
parentGotMessage: 0 childGotMessage: 1
=== Sending msg 10
Child msg   1000 = 999
parentGotMessage: 0 childGotMessage: 1
=== Sending msg 1000
Child msg   2000 = 1999
parentGotMessage: 0 childGotMessage: 1
=== Sending msg 2000
Child msg   3000 = 2999
parentGotMessage: 0 childGotMessage: 1
=== Sending msg 3000
Child msg   4000 = 3999
parentGotMessage: 0 childGotMessage: 1
=== Sending msg 4000
Command: out/Release/node /home/gib/node/test/parallel/test-dgram-debug.js
--- TIMEOUT ---

Passing on AIX (after)

=== release test-dgram-debug [negative] ===         
Path: parallel/test-dgram-debug
||| Child server
Parent Listening
parentGotMessage: 0 childGotMessage: 0
=== Sending msg 1
### Parent server:
child msg 1 = server
{ address: '127.0.0.1', family: 'IPv4', port: 42142 }
Child msg   2 = 1
parent: gotMessage from child
parentGotMessage: 0 childGotMessage: 1
=== Sending msg 2
parent msg  1 = 2
parentGotMessage: 1 childGotMessage: 1
=== Sending msg 3
Message received by child && parent
Shutting down
child msg 3 = stop
Child: stop message received
Command: out/Release/node /home/gib/node/test/parallel/test-dgram-debug.js

@mscdex mscdex added the dgram Issues and PRs related to the dgram subsystem / UDP. label Sep 15, 2016
@santigimeno
Copy link
Member

Not too sure. If the intention of the test was that messages reach both the parent and the child while both are listening, then probably the fix is not correct. If not, probably it is.

@santigimeno santigimeno added the child_process Issues and PRs related to the child_process subsystem. label Sep 15, 2016
@gibfahn
Copy link
Member Author

gibfahn commented Sep 15, 2016

@santigimeno Right, I agree. The strange thing is that I haven't managed to reproduce the issue I'm seeing on any other machines (except the community AIX ones). The test runs fine on all our AIX boxes.

Looking at nodejs/node-v0.x-archive#8045 (comment), it seems that this change shouldn't make any difference.

cc. @AndreasMadsen @misterdjules

@AndreasMadsen
Copy link
Member

Not too sure. If the intention of the test was that messages reach both the parent and the child while both are listening, then probably the fix is not correct. If not, probably it is.

Agreed.

@misterdjules
Copy link

Testing that UDP messages are distributed among all processes from a given set of processes whose members do not change over time and that share a UDP socket seems inherently brittle. We might be able to decrease the likeliness of such failures by tweaking various settings (number of messages, number of child processes, etc.) but it seems the results will always depend on the underlying platform.

Changing the state (preventing them from being able to receive messages) of the processes as soon as they receive one message like this PR does changes the nature of this test, so it would not be acceptable if we wanted the test to keep its original purpose.

I see two possible ways forward:

  1. Mark this test as flaky on AIX, and watch for flaky tests that fail all the time to identify actual problems.
  2. Change the nature of this test to just test that we can share a UDP socket among two related processes, and not test that the load is distributed. We could thus just check that UDP messages sent to a single endpoint are eventually received by both processes and make it reliable by making the first process that received a message stop receiving messages.

I think I have a preference for 2), and it seems in line with the current caveats of using the cluster module without using the round robin scheduler, where one cannot rely on the load being distributed evenly (or in any way) among worker processes.

@gibfahn
Copy link
Member Author

gibfahn commented Sep 16, 2016

What I'm not sure about is whether we can expect the load to be at all distributed between the two processes.

I'd suggest that there are two possibilities (mirroring those in #8549 (comment)):

  1. It's possible that there is a subtle problem on AIX (possibly due to a race condition) which means that the parent isn't properly connecting sometimes, which is why all the messages (~60,000) go to the child not the parent. If this is true then the test is useful as is (and the AIX failures are a symptom of a real condition).
  2. It's also possible that we can't expect the load to be distributed, so it's fine for one process to receive all the messages. In this case the test needs to be changed to reflect the underlying behaviour (as suggested by @misterdjules above).

@AndreasMadsen I wasn't sure from #8549 (comment) whether we can expect messages to be distributed to both child and parent. Could you clarify? I noticed that you said in nodejs/node-v0.x-archive#8045 (comment) that:

The purpose of this test is to ensure at that there is some level of distributed behavior.

I also had a look at what server.close() does, and (assuming I'm looking in the right place), server.close() will close the socket directly. If this is true then I'm not sure how the modified version of this test in the PR actually works (the parent hasn't received any messages, the child closes the server, then the parent receives a message).

I also remember that calling server.close() will close the entire server across all workers.

@AndreasMadsen
Copy link
Member

whether we can expect messages to be distributed to both child and parent

We can definitely expect that, that is what is being tested. But I guess it could be platform dependent.

If this is true then I'm not sure how the modified version of this test in the PR actually works (the parent hasn't received any messages, the child closes the server, then the parent receives a message).

I wouldn't trust the documentation for how cluster UDP acts, since it's a rarely used feature.


To be honest I don't remember much, I worked with this stuff 4 years ago, but haven't since. Sorry :/

@mhdawson
Copy link
Member

Somehow I missed this PR so I created #8271. Looking at the discussion here I think it does option 2) as suggested by @misterdjules which would also be my preference.

It also fixes the issue of the test leaving processes hanging around.

@mhdawson
Copy link
Member

@gibfahn I don't think "which means that the parent isn't properly connecting sometimes, " as which my version we do see both getting at least one message.

@mhdawson
Copy link
Member

In terms of: "server.close() does, and (assuming I'm looking in the right place), server.close() will close the socket directly."

I think that after a fork each process has its own handle to the share socket and can close their handle independently which would explain the behavior seen.

@gibfahn
Copy link
Member Author

gibfahn commented Sep 21, 2016

Closing in favour of #8697 which does the same thing but closes the parent on received message as well (and stops processes being left around if the test fails).

@gibfahn gibfahn closed this Sep 21, 2016
@gibfahn gibfahn deleted the test-dgram-pr branch December 16, 2016 11:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
child_process Issues and PRs related to the child_process subsystem. dgram Issues and PRs related to the dgram subsystem / UDP. test Issues and PRs related to the tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test-child-process-fork-dgram Intermittent timeout on AIX
7 participants