Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some proposed fixes for "Poor connection never attaches and reports connection to be connected #195" #218

Merged
merged 6 commits into from
Feb 9, 2016

Conversation

SimonWoolf
Copy link
Member

Addresses #195.

The questions from that issue, with discussion & questions:

(1) we should change the Error to an ErrorInfo and verify we get a meaningful error message.

Already addressed by #197

(2) does it correctly handle the timeout of the attach()? how come the channel remains in the attaching state?

Because previous strategy on timeout was just to retry. 96e65de changes that to fail channel ops (leaving the channel in the state it was in beforehand) if the request fails.

I'm not certain this is the correct thing to do (in particular, since realtimeRequestTimeout is 10s and httpRequestTimeout is 15s, this leaves a period of 5s after which the channel considers the request failed, but during which it could still succeed, which is kinda weird). Opinions?

(3) what should be done about this ws transport that's orphaned, associated with an upgrade that's gone bad. Can we just arrange for this to become the active transport? Is the behaviour here actually ok, and the problem is that we keep going with the second xhr connection even though we're now connected?

96f83a6 changes the deactivateTransport logic such that if there is a pending transport which is connected and scheduled for activation -- just waiting for the active transport to finish what its
doing - then, if the current active transport fails, it just lets the pending one take over, avoiding going into the disconnected state. If that's not the case, we continue as we do at the moment.

(This technically leaves a period in which the state is connected but activeProtocol is null -- I think should be at most one cycle of the event loop -- not sure if this is a problem?)

(4) why this error - is this reproducible?

Because transport#abort was calling requestClose(true). I don't understand why that was like that -- AFAIK the only reason to do a /close (rather than a /disconnect) is if we want realtime to discard the connection state, which surely isn't something we should be doing unless the user requests a close. @paddybyers am I missing something?

7a593e2 changes it to requestClose(false).

(5) is it right that setSuspended() is called as a result of a non-fatal upgrade error?

In that case I think it was right -- realtime lost the connection state (due to the previous point), so Matt was no longer attached.

@mattheworiordan
Copy link
Member

I'm not certain this is the correct thing to do (in particular, since realtimeRequestTimeout is 10s and httpRequestTimeout is 15s, this leaves a period of 5s after which the channel considers the request failed, but during which it could still succeed, which is kinda weird). Opinions?

ok, but I assumed HTTP request timeout applies to normal REST HTTP requests as opposed to Comet based HTTP requests. Is that not the case?

Looks good 👍

@mattheworiordan
Copy link
Member

Is there no way to simulate the problem we had with two transports by adding a test to prevent a regression? I would prefer to see this if possible.

@SimonWoolf
Copy link
Member Author

I assumed HTTP request timeout applies to normal REST HTTP requests as opposed to Comet based HTTP requests. Is that not the case?

Currently, no. At the moment in ably-js, the two timeouts work at different levels of the lib:

The httpRequestTimeout determines how long an http request in SEND mode lasts before giving up. At that point the request object doesn't know or care if it's been created as part of a comet transport, a REST request, an authUrl call to a client server, or anything else, it just knows its url and request mode.

The realtimeRequestTimeout is at a much higher level, and determine how long the Connection stays in the connecting state before going to disconnected, how long the Channel stays in the attaching state, etc.

There's an open issue which discusses changing how the timeouts interact in the case of opening a realtime connection specifically. We should probably discuss how timeouts should work more generally.

Is there no way to simulate the problem we had with two transports by adding a test to prevent a regression? I would prefer to see this if possible.

I had a play around with trying to simulate the problem, but couldn't see any nice way of doing it from within ably-js (the failure is an http request that never responds, so would need to figure out the http transport being used, them shim XHRequest#exec for a version which figures out if a request is the right one and replaces the xhr.onreadystatechange handler with a noop if so; and be able to do an equivalent thing for all the other comet transports so the test works everywhere).

So, yes, it's possible. But I figured not much point if we're shortly going to remove the upgrade mechanism anyway...

(Also: if we do want to do general realtime request failure tests: per Paddy's suggestion here, might it be better (more flexible, not so tied to the implementation of each transport) to do those by making realtime fail to respond, rather than fake it with complicated shims? Could even maybe have a set of bad-realtime-endpoints that fails in fixed ways simulating network conditions, and can test all the client libraries against them.)

@mattheworiordan
Copy link
Member

So, yes, it's possible. But I figured not much point if we're shortly going to remove the upgrade mechanism anyway...

Didn't think we had decided on that, I thought we would see the impact of the new router changes first.

Could even maybe have a set of bad-realtime-endpoints that fails in fixed ways simulating network conditions

Sure, would be useful, perhaps even a header that can be passed in to sandbox instances that triggers this. Saying that, I wouldn't want to add too much of that sort of thing into the realtime system adding more complexity unnecessarily.

An alternative would instead be to simply start up a local WS & HTTP server that proxies to realtime and achieves that thus keeping this complexity in the test suite itself where it belongs

@SimonWoolf
Copy link
Member Author

@paddybyers can you confirm whether you agree with the other changes in this? In particular, whether you think changing of the attaching timeout to revert the channel to detached rather than to retry indefinitely is correct. (I implemented that change per the discussion in #195, but I'm not at all sure that the retry-indefinitely behaviour (in the case of requests vanishing into the ether) wasn't better -- it's not like the user suddenly isn't going to want to be attached to the channel after all. (Also the weirdness caused by the difference between the two timeouts discussed above)).

@SimonWoolf
Copy link
Member Author

@paddybyers see prev comment

@paddybyers
Copy link
Member

(in particular, since realtimeRequestTimeout is 10s and httpRequestTimeout is 15s, this leaves a period of 5s after which the channel considers the request failed, but during which it could still succeed, which is kinda weird). Opinions?

There's no real reason for them to be different timeouts. However, it doesn't really introduce any new case that needs to be handled, because of course the operation might have succeeded in the first 10s but the indication of that failed to reach us. So the 10s elapses, so we indicate failure and fail the pending ops; now in the next few seconds we find out that the operation did in fact succeed and we are attached after all. That's just too bad; the server response always takes precedence in determining our state, so we are attached, and it's just a shame that we failed those messages.

@paddybyers
Copy link
Member

96f83a6 changes the deactivateTransport logic such that if there is a pending transport which is connected and scheduled for activation -- just waiting for the active transport to finish what its
doing - then, if the current active transport fails, it just lets the pending one take over, avoiding going into the disconnected state. If that's not the case, we continue as we do at the moment.

This looks ok but it would be even better if we could test it :)

(This technically leaves a period in which the state is connected but activeProtocol is null -- I think should be at most one cycle of the event loop -- not sure if this is a problem?)

Is there no way to avoid this? We're synchronously deciding what the state should be, so why can't we make it so? I've not understood the problem here I think

Because transport#abort was calling requestClose(true). I don't understand why that was like that -- AFAIK the only reason to do a /close (rather than a /disconnect) is if we want realtime to discard the connection state, which surely isn't something we should be doing unless the user requests a close.

Correct.

Otherwise LGTM, thanks.

To match xhrrequest, and so that connectionManager isFatalErr helper
responds appropriately
… state change

If there is a pending transports which is connected and scheduled for
activation, just waiting for the active transport to finish what its
doing, then when that transport fails, can just let the pending one
seamlessly take over, rather than going into the disconnected state
Only difference between a disconnect and a close is that a close tells
realtime to discard the connection state, which should be avoided unless
the user requests it
@mattheworiordan mattheworiordan merged commit f82f559 into master Feb 9, 2016
@mattheworiordan mattheworiordan deleted the attachfail2 branch February 9, 2016 10:46
@SimonWoolf
Copy link
Member Author

We're synchronously deciding what the state should be, so why can't we make it so?

So: when this happens, this closure is registered as a listener for the nopending event. We could unregister that and run it manually (extracting it out from scheduleTransportActivation). The reason I didn't do that as I didn't want to bypass your nopending mechanism, as I wasn't sure what the consequences would be.

Would there be any bad consequences? Once the new transport is active, Channels#onTransportActive calls checkPendingState on all in-progress channels, which refreshes pending channel operations and presence syncs. If that's sufficient for continuity of pending channel ops and presence syncs across transport changes, then TBH I don't understand why the nopending mechanism is needed at all...?

@paddybyers
Copy link
Member

So: when this happens, this closure is registered as a listener for the nopending event. We could unregister that and run it manually (extracting it out from scheduleTransportActivation). The reason I didn't do that as I didn't want to bypass your nopending mechanism, as I wasn't sure what the consequences would be.

What's asynchronous though?

@SimonWoolf
Copy link
Member Author

@paddybyers (reiterating slack discussion): Sorry, you're right, we could do it just be emitting nopending there and then.

But, again: that would bypass the nopending mechanism. Which I don't object to, except that I don't understand why it exists at all, given that Channels#onTransportActive deals with pending channel operations across transport changes (I'm obviously a bit wary of bypassing something whose purpose I don't understand...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants