Some proposed fixes for "Poor connection never attaches and reports connection to be connected #195" #218

SimonWoolf · 2016-01-26T00:05:45Z

Addresses #195.

The questions from that issue, with discussion & questions:

(1) we should change the Error to an ErrorInfo and verify we get a meaningful error message.

Already addressed by #197

(2) does it correctly handle the timeout of the attach()? how come the channel remains in the attaching state?

Because previous strategy on timeout was just to retry. 96e65de changes that to fail channel ops (leaving the channel in the state it was in beforehand) if the request fails.

I'm not certain this is the correct thing to do (in particular, since realtimeRequestTimeout is 10s and httpRequestTimeout is 15s, this leaves a period of 5s after which the channel considers the request failed, but during which it could still succeed, which is kinda weird). Opinions?

(3) what should be done about this ws transport that's orphaned, associated with an upgrade that's gone bad. Can we just arrange for this to become the active transport? Is the behaviour here actually ok, and the problem is that we keep going with the second xhr connection even though we're now connected?

96f83a6 changes the deactivateTransport logic such that if there is a pending transport which is connected and scheduled for activation -- just waiting for the active transport to finish what its
doing - then, if the current active transport fails, it just lets the pending one take over, avoiding going into the disconnected state. If that's not the case, we continue as we do at the moment.

(This technically leaves a period in which the state is connected but activeProtocol is null -- I think should be at most one cycle of the event loop -- not sure if this is a problem?)

(4) why this error - is this reproducible?

Because transport#abort was calling requestClose(true). I don't understand why that was like that -- AFAIK the only reason to do a /close (rather than a /disconnect) is if we want realtime to discard the connection state, which surely isn't something we should be doing unless the user requests a close. @paddybyers am I missing something?

7a593e2 changes it to requestClose(false).

(5) is it right that setSuspended() is called as a result of a non-fatal upgrade error?

In that case I think it was right -- realtime lost the connection state (due to the previous point), so Matt was no longer attached.

mattheworiordan · 2016-01-27T19:25:40Z

I'm not certain this is the correct thing to do (in particular, since realtimeRequestTimeout is 10s and httpRequestTimeout is 15s, this leaves a period of 5s after which the channel considers the request failed, but during which it could still succeed, which is kinda weird). Opinions?

ok, but I assumed HTTP request timeout applies to normal REST HTTP requests as opposed to Comet based HTTP requests. Is that not the case?

Looks good 👍

mattheworiordan · 2016-01-27T19:26:47Z

Is there no way to simulate the problem we had with two transports by adding a test to prevent a regression? I would prefer to see this if possible.

SimonWoolf · 2016-01-27T22:24:15Z

I assumed HTTP request timeout applies to normal REST HTTP requests as opposed to Comet based HTTP requests. Is that not the case?

Currently, no. At the moment in ably-js, the two timeouts work at different levels of the lib:

The httpRequestTimeout determines how long an http request in SEND mode lasts before giving up. At that point the request object doesn't know or care if it's been created as part of a comet transport, a REST request, an authUrl call to a client server, or anything else, it just knows its url and request mode.

The realtimeRequestTimeout is at a much higher level, and determine how long the Connection stays in the connecting state before going to disconnected, how long the Channel stays in the attaching state, etc.

There's an open issue which discusses changing how the timeouts interact in the case of opening a realtime connection specifically. We should probably discuss how timeouts should work more generally.

Is there no way to simulate the problem we had with two transports by adding a test to prevent a regression? I would prefer to see this if possible.

I had a play around with trying to simulate the problem, but couldn't see any nice way of doing it from within ably-js (the failure is an http request that never responds, so would need to figure out the http transport being used, them shim XHRequest#exec for a version which figures out if a request is the right one and replaces the xhr.onreadystatechange handler with a noop if so; and be able to do an equivalent thing for all the other comet transports so the test works everywhere).

So, yes, it's possible. But I figured not much point if we're shortly going to remove the upgrade mechanism anyway...

(Also: if we do want to do general realtime request failure tests: per Paddy's suggestion here, might it be better (more flexible, not so tied to the implementation of each transport) to do those by making realtime fail to respond, rather than fake it with complicated shims? Could even maybe have a set of bad-realtime-endpoints that fails in fixed ways simulating network conditions, and can test all the client libraries against them.)

mattheworiordan · 2016-01-27T23:54:46Z

So, yes, it's possible. But I figured not much point if we're shortly going to remove the upgrade mechanism anyway...

Didn't think we had decided on that, I thought we would see the impact of the new router changes first.

Could even maybe have a set of bad-realtime-endpoints that fails in fixed ways simulating network conditions

Sure, would be useful, perhaps even a header that can be passed in to sandbox instances that triggers this. Saying that, I wouldn't want to add too much of that sort of thing into the realtime system adding more complexity unnecessarily.

An alternative would instead be to simply start up a local WS & HTTP server that proxies to realtime and achieves that thus keeping this complexity in the test suite itself where it belongs

SimonWoolf · 2016-01-29T00:38:04Z

@paddybyers can you confirm whether you agree with the other changes in this? In particular, whether you think changing of the attaching timeout to revert the channel to detached rather than to retry indefinitely is correct. (I implemented that change per the discussion in #195, but I'm not at all sure that the retry-indefinitely behaviour (in the case of requests vanishing into the ether) wasn't better -- it's not like the user suddenly isn't going to want to be attached to the channel after all. (Also the weirdness caused by the difference between the two timeouts discussed above)).

SimonWoolf · 2016-02-02T23:33:42Z

@paddybyers see prev comment

paddybyers · 2016-02-04T22:11:42Z

(in particular, since realtimeRequestTimeout is 10s and httpRequestTimeout is 15s, this leaves a period of 5s after which the channel considers the request failed, but during which it could still succeed, which is kinda weird). Opinions?

There's no real reason for them to be different timeouts. However, it doesn't really introduce any new case that needs to be handled, because of course the operation might have succeeded in the first 10s but the indication of that failed to reach us. So the 10s elapses, so we indicate failure and fail the pending ops; now in the next few seconds we find out that the operation did in fact succeed and we are attached after all. That's just too bad; the server response always takes precedence in determining our state, so we are attached, and it's just a shame that we failed those messages.

paddybyers · 2016-02-04T22:17:56Z

96f83a6 changes the deactivateTransport logic such that if there is a pending transport which is connected and scheduled for activation -- just waiting for the active transport to finish what its
doing - then, if the current active transport fails, it just lets the pending one take over, avoiding going into the disconnected state. If that's not the case, we continue as we do at the moment.

This looks ok but it would be even better if we could test it :)

(This technically leaves a period in which the state is connected but activeProtocol is null -- I think should be at most one cycle of the event loop -- not sure if this is a problem?)

Is there no way to avoid this? We're synchronously deciding what the state should be, so why can't we make it so? I've not understood the problem here I think

Because transport#abort was calling requestClose(true). I don't understand why that was like that -- AFAIK the only reason to do a /close (rather than a /disconnect) is if we want realtime to discard the connection state, which surely isn't something we should be doing unless the user requests a close.

Correct.

Otherwise LGTM, thanks.

Rather than indefinitely retry

To match xhrrequest, and so that connectionManager isFatalErr helper responds appropriately

… state change If there is a pending transports which is connected and scheduled for activation, just waiting for the active transport to finish what its doing, then when that transport fails, can just let the pending one seamlessly take over, rather than going into the disconnected state

Only difference between a disconnect and a close is that a close tells realtime to discard the connection state, which should be avoided unless the user requests it

SimonWoolf · 2016-02-09T16:07:38Z

We're synchronously deciding what the state should be, so why can't we make it so?

So: when this happens, this closure is registered as a listener for the nopending event. We could unregister that and run it manually (extracting it out from scheduleTransportActivation). The reason I didn't do that as I didn't want to bypass your nopending mechanism, as I wasn't sure what the consequences would be.

Would there be any bad consequences? Once the new transport is active, Channels#onTransportActive calls checkPendingState on all in-progress channels, which refreshes pending channel operations and presence syncs. If that's sufficient for continuity of pending channel ops and presence syncs across transport changes, then TBH I don't understand why the nopending mechanism is needed at all...?

paddybyers · 2016-02-09T16:23:45Z

So: when this happens, this closure is registered as a listener for the nopending event. We could unregister that and run it manually (extracting it out from scheduleTransportActivation). The reason I didn't do that as I didn't want to bypass your nopending mechanism, as I wasn't sure what the consequences would be.

What's asynchronous though?

SimonWoolf · 2016-02-11T09:55:01Z

@paddybyers (reiterating slack discussion): Sorry, you're right, we could do it just be emitting nopending there and then.

But, again: that would bypass the nopending mechanism. Which I don't object to, except that I don't understand why it exists at all, given that Channels#onTransportActive deals with pending channel operations across transport changes (I'm obviously a bit wary of bypassing something whose purpose I don't understand...)

mattheworiordan closed this Jan 27, 2016

mattheworiordan reopened this Jan 27, 2016

SimonWoolf force-pushed the attachfail2 branch from 505e37b to 632d9c3 Compare January 28, 2016 17:41

mattheworiordan mentioned this pull request Feb 9, 2016

Improve test coverage for poor connectivity fixes #230

Closed

SimonWoolf added 6 commits February 9, 2016 11:46

Nothing checks the return value of checkPendingState

53ef64f

Allow Channel ops to fail when pending timeout expires

32c5665

Rather than indefinitely retry

Nodecomettransport: change req abort errcode to 80000

a7ddbc7

To match xhrrequest, and so that connectionManager isFatalErr helper responds appropriately

Transport#abort should always disconnect, not close

a77bb31

Only difference between a disconnect and a close is that a close tells realtime to discard the connection state, which should be avoided unless the user requests it

Transport#requestClose(closing -> requestClose() and requestDisconnect()

f82f559

mattheworiordan force-pushed the attachfail2 branch from 632d9c3 to f82f559 Compare February 9, 2016 10:46

mattheworiordan merged commit f82f559 into master Feb 9, 2016

mattheworiordan deleted the attachfail2 branch February 9, 2016 10:46

SimonWoolf mentioned this pull request Apr 6, 2016

Poor connection never attaches and reports connection to be connected #195

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some proposed fixes for "Poor connection never attaches and reports connection to be connected #195" #218

Some proposed fixes for "Poor connection never attaches and reports connection to be connected #195" #218

SimonWoolf commented Jan 26, 2016

mattheworiordan commented Jan 27, 2016

mattheworiordan commented Jan 27, 2016

SimonWoolf commented Jan 27, 2016

mattheworiordan commented Jan 27, 2016

SimonWoolf commented Jan 29, 2016

SimonWoolf commented Feb 2, 2016

paddybyers commented Feb 4, 2016

paddybyers commented Feb 4, 2016

SimonWoolf commented Feb 9, 2016

paddybyers commented Feb 9, 2016

SimonWoolf commented Feb 11, 2016

Some proposed fixes for "Poor connection never attaches and reports connection to be connected #195" #218

Some proposed fixes for "Poor connection never attaches and reports connection to be connected #195" #218

Conversation

SimonWoolf commented Jan 26, 2016

mattheworiordan commented Jan 27, 2016

mattheworiordan commented Jan 27, 2016

SimonWoolf commented Jan 27, 2016

mattheworiordan commented Jan 27, 2016

SimonWoolf commented Jan 29, 2016

SimonWoolf commented Feb 2, 2016

paddybyers commented Feb 4, 2016

paddybyers commented Feb 4, 2016

SimonWoolf commented Feb 9, 2016

paddybyers commented Feb 9, 2016

SimonWoolf commented Feb 11, 2016