Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconnect policy #266

Merged
merged 4 commits into from
Jul 13, 2022
Merged

Reconnect policy #266

merged 4 commits into from
Jul 13, 2022

Conversation

taskbit
Copy link
Contributor

@taskbit taskbit commented Jun 12, 2022

My motivation was that the existing reconnection logic gave up reconnecting after the LiveKit node was forcefully restarted.
Also in response to #261 I revisited the reconnection handling.

I run into an issue where the websocket connection is being closed after trying to resume with ?reconnect=1
probably initiated on the server side.

I added a workaround here

@changeset-bot
Copy link

changeset-bot bot commented Jun 12, 2022

⚠️ No Changeset found

Latest commit: 607039f

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@CLAassistant
Copy link

CLAassistant commented Jun 12, 2022

CLA assistant check
All committers have signed the CLA.

Copy link
Member

@davidzhao davidzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution. I have a few suggestions and questions about parts of it.

nextRetryDelayInMs(context: IReconnectContext): number | null;
}

export interface IReconnectContext {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would remove this layer of indirection. If the class is supposed to be a storage class. instead, it would be great to make ReconnectContext an interface. Interfaces in Typescript are more like a "data class" in other languages.

7 * 7 * 300,
8 * 8 * 300,
9 * 9 * 300,
null,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is a bit confusing to have nulls in arrays. would be preferable to perform a length check and exit if out of attempts.

src/room/RTCEngine.ts Show resolved Hide resolved
src/room/RTCEngine.ts Show resolved Hide resolved
if (this.reconnectAttempts >= maxReconnectRetries || duration > maxReconnectDuration) {
recoverable = false;
}
duration = Date.now() - this.reconnectStart;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic is now repeated from the top. can you think of a way to avoid this duplication?

if (reconnectRequired && !this.fullReconnectOnNext) {
this.fullReconnectOnNext = true;
this.reconnectAttempts = 0;
let requireSignalEvents = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this logic is correct. we should only emit a reconnecting/resuming event on the very first attempt. I think this alters the behavior quite a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously code set reconnectAttempts to 0 to force re-emiting the events if fullReconnectOnNext is false and reconnectRequired is true.
I was wondering about that as well as this can emit the events twice. First & second reconnect attempt.

I decided to keep that behavior as I'm not aware of the original reason.
But instead of setting reconnectAttempts to 0 after the first attempt I found it cleaner to signal the events in a different way because it's the 2nd reconnect attempt we should not reset the counter as this alters the logic in the reconnect policy by re-evaluating retryCount with "0" another time.

I don't see an alteration of the previous behavior regarding events emitted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked the listeners and can confirm that we want to fire the signal again going from resume to reconnect. So your implementation is right.


this.engine.client.signalLatency = this.options.expSignalLatency;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is moving this required by this PR? It would be good to keep PRs focused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to the PR but not a functional requirement. I passed the RoomOptions to the RTCEngine constructor to get hold of the reconnectPolicy.

As RTCEngine is now aware of the RoomOptions it makes more sense in my opinion to apply the configuration setting of the SignalClient where it's created.

@davidzhao
Copy link
Member

thanks for your clarifications, I'll give it a try tomorrow and will keep you posted!

Copy link
Member

@davidzhao davidzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great progress here, added a few notes.

src/room/IReconnectPolicy.ts Outdated Show resolved Hide resolved
src/room/IReconnectPolicy.ts Outdated Show resolved Hide resolved
if (reconnectRequired && !this.fullReconnectOnNext) {
this.fullReconnectOnNext = true;
this.reconnectAttempts = 0;
let requireSignalEvents = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked the listeners and can confirm that we want to fire the signal again going from resume to reconnect. So your implementation is right.

.then(() => {
waitForPcCompleted = true;
})
.catch(() => Error('failed to wait for peer connection'));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this intending to throw? I'm unfamiliar with this syntax.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks good catch, there is throw missing.

.catch(() => Error('failed to wait for peer connection'));

while (!waitForPcCompleted) {
// TODO: this is a workaround to check if the socket was closed while we are reconnecting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you describe a repro step to trigger a socket closure while reconnecting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to revert as it's not related to the changes in the PR, I would classify it as a separate bug.

// TODO: this is a workaround to check if the socket was closed while we are reconnecting
// if signal client reconnects with ?reconnect=1 ws.onclose is almost immediately called after ws.onopen
if (!this.client.isConnected) {
this.fullReconnectOnNext = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it need to be set here again? throwing should be sufficient to move this into the right state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct no need to set here again.

@davidzhao
Copy link
Member

@taskbit just checking in. do you think you have cycles to get this one across the finish line? I'd love to get this merged.

Copy link
Member

@davidzhao davidzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fantastic work here! this is a meaty PR and really improves the flexibility of reconnection. Thank you for making LiveKit better!

@davidzhao davidzhao merged commit 10dcbef into livekit:main Jul 13, 2022
@taskbit
Copy link
Contributor Author

taskbit commented Jul 13, 2022

@davidzhao thank you for working on this awesome project and the detailed code review, appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants