Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECONNRESET problem with http.Server and https.Server (node 6.3, 6.2.2 and poss earlier) #7776

Closed
fuzing opened this issue Jul 17, 2016 · 15 comments
Labels
http Issues or PRs related to the http subsystem. question Issues that look for answers.

Comments

@fuzing
Copy link

fuzing commented Jul 17, 2016

Node: 6.3 (and I believe all 6.x)
AWS instance (zone us-west-2) - t2.small, m4.large - Centos 7

receiving the following error:
Error: read ECONNRESET
at exports._errnoException (util.js:1008:11)
at TCP.onread (net.js:563:26)

Similar results with different instance variants, with different loads (from extremely light to extremely heavy).

I believe these may be idle sockets, and therefore the error should be handled by node.

The errors come through on both http and https (tls). They can be trapped in http using .on('clientError'), or on https/tls using 'clientError' and 'tlsClientError' (comes through on both variants).

It's unclear how a user application can safely recover (so am process.exit()'ing for now - about twice a minute :(.

For context: here is the code from the node net.js module: (see my annotation re: line 563 of the code below)....... I haven't traced this back through libuv.

Thanks!

Peter B.

// if we didn't get any bytes, that doesn't necessarily mean EOF.
// wait for the next one.
if (nread === 0) {
debug('not any data, keep waiting');
return;
}

// Error, possibly EOF.
if (nread !== uv.UV_EOF) {
--------> next is line 563 which results in throwing exception ECONNRESET (peter b)
return self._destroy(errnoException(nread, 'read'));
}

debug('EOF');

if (self._readableState.length === 0) {
self.readable = false;
maybeDestroy(self);
}

// push a null to signal the end of data.
self.push(null);

// internal end event so that we know that the actual socket
// is no longer readable, and we can start the shutdown
// procedure. No need to wait for all the data to be consumed.
self.emit('_socketEnd');
}

@addaleax addaleax added question Issues that look for answers. http Issues or PRs related to the http subsystem. labels Jul 17, 2016
@fuzing
Copy link
Author

fuzing commented Jul 20, 2016

I'm not sure about the classification of this as a 'question'.

There is definitely a problem with node's http and https server handling of ECONNRESET's.

With node 6.2.2 and 6.3 (maybe earlier) - On linux, unless I trap ECONNRESET's in my client app using on(clientError) and on(tlsClientError), and then process.exit(), the app will eventually stop accepting requests (i.e. it is hung up as a server) - (haven't troubleshot to figure out what's going on - don't know if the listen socket is failing to accept(), or what else could be going on - unclear whether this is a node or libuv issue).

Also, in attempting to display the remote ip address and port of the connected socket that receives the ECONNRESET, sometimes a valid ip/port is available - while other times 'remoteAddress' resolves to 'undefined'.

On AWS I appear to be getting several early ECONNRESET's, and these look to be the culprits in terms of their lack of ip/port info (is there a time race condition with respect to annotating the js socket?).

There seems to be a lot of debate about how to handle ECONNRESET's - and lots of finger pointing. The fact is that while it is true that ECONNRESET's were often triggered in the past by badly behaved software/hardware (i.e. interrupted or no FIN_WAIT_X, TIME_WAIT sequence)...... today it should be seen as "normal" (think about mobile devices abruptly powered off).

Thank you.

@addaleax
Copy link
Member

I'm not sure about the classification of this as a 'question'.

The reason I tagged this as a question is that I couldn’t make out any specific behaviour here that you would like to see changed on Node’s side, so by exclusion I understood this as a question about how to handle ECONNRESET errors; feel free to correct me if I’m wrong.

@bnoordhuis
Copy link
Member

I believe these may be idle sockets

It's not entirely clear from your description whether you are referring to agent-created sockets with http.request() or remote-initiated sockets delivered to your http.createServer() callback.

is there a time race condition with respect to annotating the js socket?

The address and port are queried from the operating system, and cached thereafter. If you call socket.remoteAddress before you get a ECONNRESET, it will be available after the connection error.

@fuzing
Copy link
Author

fuzing commented Jul 21, 2016

Anna - sorry for the confusion - and thanks for forwarding.

Ben - to clarify:

I'm referring to servers (http/https). By 'idle' I was talking about sockets that are open but idle for some time, as opposed to idle sockets in an agent pool (sorry for the confusion).

My app is a web server, using http.server (and the https variant) - so these are incoming connections. The reason I spoke of time-race is due to the fact that I'm querying the remoteAddress of the socket 'after' the ECONNRESET exception is thrown, but very often this comes back 'undefined' (about 2/3 of the time).

Here's the exception handler code:
httpServer.on('clientError', function(e, socket) {
console.log(${new Date()} httpServer:clientError(${process.pid}) from ${socket.remoteAddress}, e, socket);
process.exit(1); // we will bail out of here!
});

ditto for the handler I'm using for the https 'clientError' and 'tlsClientError' (the exception is sometimes thrown on 'clientError', sometimes on 'tlsClientError'.

(I've attached a trace with the output of one such instance).

This exception appears to be thrown in net.js at line 492 inside the _destroy() method, whereas anything that includes a valid IP address seems to be thrown at net.js line 563. Maybe the lack of IP address makes sense in the former case, given that the socket/connection is being destroyed??

As I said in my last note - if I don't 'handle' the clientError/tlsClientError and do the process.exit(), then the node server eventually stops accepting connections and we're dead in the water.

For further information, I'll selectively do the process.exit() on each variant (i.e. IP address and no IP address), and I'll see if I can narrow down if it's one or both variants causing the node server/s to ultimately cease accepting connections.

Stand by for further.....

debug.txt

@fuzing
Copy link
Author

fuzing commented Jul 21, 2016

It appears that there are two paths through the net.js code that originate the ECONNRESET error.

line 492, which indicates a "socket hangup" error (with a code of ECONNRESET).

and line 563, which indicates an 'ECONNRESET' error.

traversing the path where the exception originates from line 492 seems to work fine.

traversing the path where the exception originates from line 563 hangs the server (i.e. the path within the node socket onread() handler results in the node http/https.server not accepting any more connections).

Interestingly, both the line 492 traversal and the line 563 traversal often have 'undefined' remoteAddress on the socket that is passed up to my clientError/tlsClientError handler.

I got confused by references to 'self' in the onread() handler which seem to refer to the 'handle' (not the 'this' of the current object)...... the _destroy() call on line 563 would thus be being called on the 'handle' (Maybe this is correct, I haven't looked further and don't want to create any confusion if this is not an issue).

This is replicating frequently (every few minutes) on my server, so should be pretty easy to figure out what's going on - please let me know what additional information you'd like me to collect.

In the meantime, my workaround is to http/https.setTimeout(15000) - to make sure that node is initiating most socket close()'s instead of waiting for far end to disconnect. Then I'm still process.exit()'ing the remaining ECONNRESET's (my cluster code takes care of respawning a new server to take over).

(Despite using clustering, my testing to collect the above data was done with a single process).

Many thanks.

@fuzing fuzing closed this as completed Jul 21, 2016
@fuzing
Copy link
Author

fuzing commented Jul 21, 2016

Hi Anna:

Re: node issue #7776

I cannot change the labels - I’m guessing I do not have permission to do
that.

I annotated the issue with more information, so hopefully this will be
forwarded to ben,

Thank you,

Peter B.

On Wed, Jul 20, 2016 at 5:48 PM, Anna Henningsen notifications@github.com
wrote:

I'm not sure about the classification of this as a 'question'.

The reason I tagged this as a question is that I couldn’t make out any
specific behaviour here that you would like to see changed on Node’s side,
so by exclusion I understood this as a question about how to handle
ECONNRESET errors; feel free to correct me if I’m wrong.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#7776 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ALHBFYPC200YftBHwf-Svmki-FF1wyqeks5qXrPZgaJpZM4JOUzR
.

@fuzing fuzing reopened this Jul 21, 2016
@addaleax
Copy link
Member

@fuzing I think you didn’t mean to send that last message to this thread (replying to github mails does that)?

Anyway, it’s still not quite clear to me what you are actually trying to say. I know it may be hard to come up with that, but it sounds to me like a full reproduction of the behaviour you are seeing and a description of the behaviour you want/expect to see would be very helpful.

@fuzing
Copy link
Author

fuzing commented Jul 21, 2016

Hi Anna:

I answered Ben's questions with two pretty complete descriptions of the problem - see above.

If you need an expectations statement hopefully this sums things up:

When one instantiates a node http or https server, one would hope that receiving an ECONNRESET would not hang the server. As I tried to explain above, this is not the case. The server hangs when one of the code traversals out of the net.js code is triggered.

I would call this a bug.

I'm happy to provide whatever additional information you need in order to resolve the issue.

@fuzing
Copy link
Author

fuzing commented Jul 22, 2016

Further update on ECONNRESET bug.

While both http.Server and https.Server instances are generating the 'clientError' exception, there is no problem when 'clientError' is thrown in either of these instances.

The problem appears to be with the https.Server variant when a 'tlsClientError' is thrown.

I investigated the node _tls_wrap.js module where the 'tlsClientError' event originates from, and it's clear that 'tlsClientError' can be thrown under two circumstances:
a) on a close of the socket (see _tls_wrap.js line 812), and
b) on receipt of a '_tlsError' event (see _tls_wrap.js line 826).

the https.Server will hang EVERY TIME that the 'tlsClientError' is thrown from the on.close() function starting at _tls_wrap.js line 812 (the actual tlsClientError is emitted on line 822).

I will continue to investigate.

@bnoordhuis
Copy link
Member

I'm referring to servers (http/https). By 'idle' I was talking about sockets that are open but idle for some time, as opposed to idle sockets in an agent pool

Just so we're on the same page, the phases for incoming HTTP connections w.r.t. error handling are as follows:

  1. Before the request headers have been fully received, node takes care of parse errors and connection errors. When they happen, they are emitted as 'clientError' events on the server object.
  2. Afterwards, i.e., on the 'connection' event, you are responsible for dealing with errors.
  3. Between requests (when HTTP keep-alive is enabled), responsibility moves back to node. This is what I call 'idle' but I infer that you were referring to phase 2.

Now the catch: when you install a 'clientError' event listener, you become responsible for closing the socket; it is passed as the second argument to the event listener.

From your description, it sounds like you aren't closing it, making the server run out of file descriptors. The listener should look something like this:

server.on('clientError', function(err, socket) {
  // ...
  socket.destroy();
});

Aside: HTTPS servers simply re-emit 'tlsClientError' events as 'clientError' events, you don't have to install a listener for the former.

@fuzing
Copy link
Author

fuzing commented Jul 22, 2016

Thanks for your response.

Makes sense, however as mentioned above, when a tlsClientError is emitted from the on.close() method (from _tls_wrap.js line 822) the https.Server stops accepting additional connections 'immediately' (almost as if the listen socket were closed/destroyed) - i.e. this does not appear to be descriptor starvation. I would expect descriptor starvation to occur some time later, after a number of such instances.

In earlier versions of my handlers I was doing the socket.destroy() [it didn't help], but for posterity I will try to prove up what I'm seeing and will update further.

@fuzing
Copy link
Author

fuzing commented Jul 22, 2016

I take back what I said in my preceeding post about the https.Server stopping accepts of new connections immediately (I am now unsure). However, descriptor starvation is not an issue. Examining /proc//fd reveals only 60 - 100 file descriptors in use when the Server discontinues accepting connections. I'll continue investigating.

@fuzing fuzing closed this as completed Jul 25, 2016
@fuzing
Copy link
Author

fuzing commented Jul 26, 2016

SOLVED:
Issue turns out to be a problem with the latest native mongodb driver (v2.2.4) - sometimes never returning when the server is under load. Once in this mode, no further calls to the driver make it back from C++ land to node, and therefore clients (bored with no response) start hanging up en-masse. The solution for now is to regress to v2.1.21 of the mongodb driver (and report the bug to the folks over there) - totally cleans up the problem.

Thank you for all your help, and sorry for wasting your time!

@bnoordhuis
Copy link
Member

No problem. Thanks for following up.

@andiris29
Copy link

@fuzing You comments really helped me, great. Could you share the link of the issue in mongodb driver?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
http Issues or PRs related to the http subsystem. question Issues that look for answers.
Projects
None yet
Development

No branches or pull requests

4 participants