Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock when stopping and starting connection [reconnect] #86

Closed
moozzyk opened this issue Jan 18, 2020 · 20 comments
Closed

Deadlock when stopping and starting connection [reconnect] #86

moozzyk opened this issue Jan 18, 2020 · 20 comments

Comments

@moozzyk
Copy link
Owner

moozzyk commented Jan 18, 2020

As per #71 (comment) there it is possible to trigger a deadlock when stopping and starting connection.

I just faced a problem, when I am stopping and starting connection 10 times in a small period of time app freezes( for example when app moves to background and moves back to foreground), can you suggest anything related to this?
Thanks in advance
UPD 1
Problem is with
self.startDispatchGroup.wait(), if I got it correct, when I am stopping and moving app to background and quickly reopen it, doing this few times, self.startDispatchGroup.wait() just freezes UI and endlessly waits and leave() doesn't invoke in any situation.
What if I comment this line self.startDispatchGroup.wait(), should we really need anything to be finished to be able to close the connection?
UPD 2
If I got it correct, problem is if during negotiation request we move app to background because it doesn't use URLSessionConfiguration.background, so we don't get result and it never leaves DispatchGroup.

@Narek1994
Copy link

Basically, to support reconnection you will need to create a new connection each time when its disconnected(for example, network connection change, moving app to background and move back to foreground).
So if you start new connection each time when app moves to foreground and stop it when it moves to background, doing this few times and quickly, freezes UI, if I understood it correctly when we start a new connection we enter a DispatchGroup, make a negotiation request, after that leave the group, and in stop method it states to wait to that DispatchGroup, so if after the start you move app to background negotiation completion will not be invoked(because it doesn't use background fetch) and DispatchGroup.wait() will freeze the app( DispatchGroup.leave()) will not be invoked.
Should we really need that DispatchGroup? what will happen if during negotiation we stop the connection, for example because of network connection lost?
Thanks.

@moozzyk
Copy link
Owner Author

moozzyk commented Jan 21, 2020

@Narek1994 - Could you capture and attach logs?

@Narek1994
Copy link

I will try, but could you please say do we need that DispatchGroup? I have removed it and seems, everything works well, now I am afraid that maybe there is any special need to put that DispatchGroup there.

@moozzyk
Copy link
Owner Author

moozzyk commented Jan 21, 2020 via email

@Narek1994
Copy link

Narek1994 commented Jan 21, 2020

here are the logs
2020-01-21T15:52:38.804Z debug: Changing state to: 'connecting' succeeded
2020-01-21T15:52:39.764Z debug: Changing state to: 'connected' succeeded
2020-01-21T15:53:18.496Z debug: Changing state to: 'stopped' succeeded
2020-01-21T15:53:18.507Z debug: Changing state to: 'stopped' succeeded
2020-01-21T15:53:19.459Z debug: Changing state to: 'connecting' succeeded
2020-01-21T15:53:19.862Z debug: Changing state to: 'stopped' succeeded
2020-01-21T15:53:20.688Z debug: Changing state to: 'connecting' succeeded
2020-01-21T15:53:21.085Z debug: Changing state to: 'stopped' succeeded
as you can see from logs, at the some point it moves to connecting state and enters the DispatchGroup, and if after that I switch the app to background or connection lost, it state changes to stopped, and because DispatchGroup doesn't leave at any point(it didn't get error, it didn't connected successfully, and also because your HttpClient doesn't support background configuration result of the negotiation request is lost) app freezes.
if previousState == .stopped {
logger.log(logLevel: .info, message: "Connection already stopped")
return
}
if previousState == .initial {
logger.log(logLevel: .warning, message: "Connection not yet started")
return
}
in the stop method you are checking for these states and return, maybe we could do the same for .connecting state, and also while making negotiation request, you should capture self weakly, because at the time of getting the result self can be nil.(in the case of when we are creating and invalidating connections, to support reconnection).

@datwelk
Copy link

datwelk commented Feb 13, 2020

We are experiencing the same issue. We do not reuse HubConnection instances, but we do stop a previous HubConnection when we want to reconnect with a new one. The deadlock arises in this case as well. Exactly how @Narek1994 mentions, it occurs when there you call stop on the main thread, before the HttpConnection has had the chance to leave the startDispatchGroup.

@datwelk
Copy link

datwelk commented Feb 13, 2020

These are the logs until the deadlock:

2020-02-13T15:23:59.431Z info: Starting hub connection
2020-02-13T15:23:59.432Z info: Starting connection
2020-02-13T15:23:59.432Z debug: Attempting to chage state from: 'initial' to: 'connecting'
2020-02-13T15:23:59.432Z debug: Changing state to: 'connecting' succeeded
2020-02-13T15:23:59.448Z debug: Negotiate completed with OK status code
2020-02-13T15:23:59.448Z debug: Negotiate response: {"url":<Redacted>,"availableTransports":[]}
2020-02-13T15:23:59.448Z debug: Negotiate redirects to <Redacted>
2020-02-13T15:23:59.449Z debug: Overriding accessToken
2020-02-13T15:23:59.466Z debug: Negotiate completed with OK status code
2020-02-13T15:23:59.467Z debug: Negotiate response: {"connectionId":"abc","availableTransports":[{"transport":"WebSockets","transferFormats":["Text","Binary"]},{"transport":"ServerSentEvents","transferFormats":["Text"]},{"transport":"LongPolling","transferFormats":["Text","Binary"]}]}
2020-02-13T15:23:59.467Z debug: Negotation response received
2020-02-13T15:23:59.468Z info: Starting WebSocket transport
2020-02-13T15:23:59.540Z info: Stopping hub connection
2020-02-13T15:23:59.541Z info: Stopping connection
2020-02-13T15:23:59.541Z debug: Attempting to chage state from: '(nil)' to: 'stopped'
2020-02-13T15:23:59.541Z debug: Changing state to: 'stopped' succeeded
2020-02-13T15:23:59.545Z debug: HubConnection deinit
2020-02-13T15:23:59.545Z debug: HttpConnection deinit
2020-02-13T15:24:00.617Z info: Stopping hub connection
2020-02-13T15:24:00.618Z info: Stopping connection
2020-02-13T15:24:00.620Z debug: Attempting to chage state from: '(nil)' to: 'stopped'
2020-02-13T15:24:00.620Z debug: Changing state to: 'stopped' succeeded

@datwelk
Copy link

datwelk commented Feb 13, 2020

Would a temporary solution be to call stop on a different thread than the main thread?

@moozzyk
Copy link
Owner Author

moozzyk commented Feb 13, 2020 via email

@Narek1994
Copy link

Narek1994 commented Feb 13, 2020

Hi @datwelk, for about month I am testing the version where I have removed DispatchGroup wait and it seems everything works well. It is a little bit hard to find out why we need that DispatchGroup(enter, leave).

@GuidoHendriks
Copy link

I think it will not help.

Looks like it does and it makes sense also. It currently makes you wait on the main thread, which blocks the main thread and causes it to never finish.

@datwelk
Copy link

datwelk commented Feb 14, 2020

Hi @Narek1994 & @moozzyk, we have tested our work-around (calling HubConnection.stop() on a global background thread, and this works well. It is unclear to us what the dispatchGroup is needed for (perhaps paving the way to reusing an existing connection).

@moozzyk
Copy link
Owner Author

moozzyk commented Feb 29, 2020

The startDispatchGroup is used to prevent a race where a connection that was stopped is still running. If stop is called when the connection is being started then without this synchronization stop can finish before start finishes (very likely since start makes requests) and then the client can be connected to the server or left in undefined state.

@datwelk - is my understanding is that you are not on the reconnect branch and you just stop a connection, create a new instance and start? If so, this is unexpected - separate connection instances should not be interfering.

@moozzyk
Copy link
Owner Author

moozzyk commented Mar 1, 2020

For @datwelk's issue after looking at his logs I am wondering if the client is not running into a problem similar to //github.com/tidwall/SwiftWebSocket/issues/114.

Looking at the logs there is only one log entry from the webSocket transport:

2020-02-13T15:23:59.468Z info: Starting WebSocket transport
2020-02-13T15:23:59.540Z info: Stopping hub connection

There is no other log entries even though open, close and error event handlers do log. My hypothesis is that hubConnection.stop is invoked at an "unfortunate" time (when the transport initiated starting the webSocket but before any of the start related events were handled) and waits for the transport to do something. The transport, however, is not able to invoke any of the event handlers because the stop invocation blocked the thread. This might be the reason why invoking stop on a background thread works around this issue.

@moozzyk
Copy link
Owner Author

moozzyk commented Mar 1, 2020

@datwelk - I created #91 for your issue because I think the root cause is different than what @Narek1994 reported
@Narek1994 - I think I have a test that reproes what you are seeing. I have not investigated it yet but I think it's different from what @datwelk reported.

@moozzyk
Copy link
Owner Author

moozzyk commented Mar 2, 2020

#91 is now fixed in master. As expected, it did not fix the hang happening when restarting the connection.

@moozzyk
Copy link
Owner Author

moozzyk commented Mar 9, 2020

@Narek1994 - I finally had some time to investigate the restart issue. Looking at the bugs I found I am not quite sure how you were even able to restart the connection once. I think the bugs made it impossible to restart the connection at all. This makes me wonder how your restart code actually looks like. If you care to share it would be helpful - maybe there are more bugs I did not find. Just in case here is a test case where I stop and restart the same connection 5 time that would constantly fail before my fixes:

func testThatConnectionCanBeRestarted() {
let connectionDidCloseExpectation = expectation(description: "connection closed")
connectionDidCloseExpectation.expectedFulfillmentCount = 5
let hubConnectionDelegate = TestHubConnectionDelegate()
let hubConnection = HubConnectionBuilder(url: URL(string: "\(BASE_URL)/testhub")!)
.withLogging(minLogLevel: .debug)
.withHubConnectionDelegate(delegate: hubConnectionDelegate)
.build()
hubConnectionDelegate.connectionDidOpenHandler = { hubConnection in
hubConnection.stop()
}
var numRestarts = 4 // initial start + 4 restarts = 5 (expectedFulfillmentCount)
hubConnectionDelegate.connectionDidCloseHandler = { error in
XCTAssertNil(error)
if (numRestarts > 0) {
numRestarts -= 1
hubConnection.start()
}
connectionDidCloseExpectation.fulfill()
}
hubConnection.start()
waitForExpectations(timeout: 5 /*seconds*/)
}

@moozzyk
Copy link
Owner Author

moozzyk commented Apr 12, 2020

Reconnect is now released in 0.7.0. Any new issues should be tracking separately.

@moozzyk moozzyk closed this as completed Apr 12, 2020
@ibrahimyilmaz7
Copy link

ibrahimyilmaz7 commented May 15, 2020

@datwelk - is my understanding is that you are not on the reconnect branch and you just stop a connection, create a new instance and start? If so, this is unexpected - separate connection instances should not be interfering.

Hi, is it unexpected to create multiple hubConnection instance? I have multiple widgets which creates it's own hub, and start/stop connection related to appearance on screen.

@moozzyk
Copy link
Owner Author

moozzyk commented May 15, 2020

It depends. Do you have multiple hubs or just one? Do you run multiple connections at the same time or just one? Is there any reason why you don't want to/can't maintain the connection open all the time?

As a side note - please do not hijack unrelated issues. Create a new one if there isn't one directly related to the problem you are seeing. This particular issue has been closed as fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants