-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No VPN connectivity after domain/api restart #6648
Comments
If anyone else is facing a similar issue, for the moment I kind of mitigated it by increasing the number of replicas of the API and Domain components and applied podAntiAffinities so that the replicas are guaranteed to run in different nodes and availability zones. This way it is more unlikely that the Gateway Pods loose the connectivity completely. |
The connection to the portal is designed to not affect the data plane and we have integration tests to check that. Can you post some logs of client and gateways? Ideally with |
It took me some time to sanitize the logs from personal information (IPs, passwords, etc.) but I finally made it. Here are the gateway logs: Hope this helps! |
Unfortunately, those seem to be the logs of the GUI, not the IPC service. |
This seems to be where the issue starts. For some reason, the relay credentials we have locally don't match anymore. Are you restart the relay too as part of restarting the API? They should be restarted in sequence, not in parallel, otherwise the API can't detect that the relay restarted and push new credentials to the clients. |
I think what might be happening here is that we skip TURN server because it is still the same ID but the credentials might have changed. We might need an additional check here. I am surprised this doesn't come up in our production setup though. The relevant code is here: firezone/rust/connlib/snownet/src/node.rs Lines 519 to 522 in b657c18
Instead of just checking for the ID being present, we should also check whether we have the same credentials. Shouldn't be too difficult! Are you willing to send a PR? :) |
Hello @thomaseizinger, thanks for the quick response! My bad, when you said client I thought you meant the gui client. Regarding the relay, I did not restart it, I just restarted the api and domain Pods. But I think the relay restarted on its own. I see the Pod has 1 restart and the old logs show this (the time is consistent with when I did the experiment): Regarding the PR, sorry but I have zero experience with Rust. Best regards, Sebastian |
That looks like a stacktrace. Can you upload the entire relay log please? |
Here is the whole log (which ends with the stack trace I showed earlier): |
Thanks for uploading that! First, I would highly recommend to lower the log level. Running the relay with I'd recommend to check your setup though why the portal was unreachable for 15 minutes. We only tolerate a certain amount of network partitioning. |
Thanks for the feedback! I do not think the portal was unreachable for 15 minutes though, the Pods were up and running within a couple of minutes tops. I deleted them manually and was monitoring the EKS cluster the whole time. I have not configured the re-connection backoff, is there a way to increase it, maybe with some environment variable? |
The logs suggest otherwise although it is hard to say because they only cover about 3 minutes. If you can reproduce the problem with
It is currently not configurable and I wouldn't recommend to set it to anything longer than 15 minutes. The portal is what authorizes connection flows and thus having too long of a network partition is a risk. |
Our relays are essential for connectivity because they also perform STUN for us through which we learn our server-reflexive address. Thus, we must at all times have at least one relay that we can reach in order to establish a connection. The portal tracks the connectivity to the relays for us and in case any of them go down, sends us a `relays_presence` message, meaning we can stop using that relay and migrate any relayed connections to a new one. This works well for as long as we are connected to the portal while the relay is rebooting / going-down. If we are not currently connected to the portal and a relay we are using reboots, we don't learn about it. At least if we are actively using it, the connection will fail and further attempted communication with the relay will time-out and we will stop using it. In case we aren't currently using the relay, this gets a bit trickier. If we aren't using the relay but it rebooted while we were partitioned from the portal, logging in again might return the same relay to us in the `init` message, but this time with different credentials. The first bug that we are fixing in this PR is that we previously ignored those credentials because we already knew about the relay, thinking that we can still use our existing credentials. The fix here is to also compare the credentials and ditch the local state if they differ. The second bug identified during fixing the first one is that we need to pro-actively probe whether all other relays that we know about are actually still responsive. For that, we issue a `REFRESH` message to them. If that one times-out or fails otherwise, we will remove that one from our list of `Allocation`s too. To fix the 2nd bug, several changes were necessary: 1. We lower the log-level of `Disconnecting from relay` from ERROR to WARN. Any ERROR emitted during a test-run fails our test-suite which is what partially motivated this. The test suite builds on the assumption that ERRORs are fatal and thus should never happen during our tests. This change surfaces that disconnecting from a relay can indeed happen during normal operation, which justifies lowering this to WARN. Users should at the minimum monitor on WARN to be alerted about problems. 2. We reduce the total backoff duration for requests to relays from 60s to 10s. The current 60s result in total of 8 retries. UDP is unreliable but it isn't THAT unreliable to justify retrying everything for 60s. We also use a 10s timeout for ICE, which means these are now aligned to better match each other. We had to change the max backoff duration because we only idle-spin for at most 10s in the tests and thus the current 60s were too long to detect that a relay actually disappeared. 3. We had to shuffle around some function calls to make sure all intermediary event buffers are emptied at the right point in time to make the test deterministic. Fixes: #6648.
Hello @thomaseizinger, I saw you fixed this issue, is it released yet? Thanks for the support! |
Yes it is part of the latest client releases. Let me know if it is working for you! :) |
I upgraded the gateway to v1.3.2 and the gui client to 1.3.7. For the other components I am running the tag "081c4471136b5104dc79cda8e64a3f51c9ed37fa", latest avaliable version that came out ~10 days ago. The issue is still happening, and today I had to restart the gateways a couple of times. With client, which component do you mean exactly? |
Maybe that didn't make it into the release. Can you try the latest draft releases (both gateway and gui-client)? https://github.com/firezone/firezone/releases? No changes to the portal should be necessary. If it still happens, can you post logs with You might be hitting a different issue! :) |
Sorry could you clarify what you mean with draft release?
Il sab 5 ott 2024, 00:17 Thomas Eizinger ***@***.***> ha
scritto:
… Maybe that didn't make it into the release. Can you try the latest draft
releases (both gateway and gui-client)?
https://github.com/firezone/firezone/releases?
No changes to the portal should be necessary. If it still happens, can you
post logs with RUST_LOG=debug? Thanks!
You might be hitting a different issue! :)
—
Reply to this email directly, view it on GitHub
<#6648 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A25BQKCC6HUBRLBJVRYCSCDZZ4AT7AVCNFSM6AAAAABN6C3UWWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJUG4ZTQMRXGA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Specifically this one if you are using the gui-client: https://github.com/firezone/firezone/releases/tag/untagged-5991635f7729556e5a11 Not sure what it is in Spanish, el barrador? :) |
I think those are only visible to members of the organisation, I don't have
permissions to see them.
Il sab 5 ott 2024, 02:19 Thomas Eizinger ***@***.***> ha
scritto:
… Specifically this one:
https://github.com/firezone/firezone/releases/tag/untagged-5991635f7729556e5a11
Not sure what it is in Spanish, el barrador? :)
—
Reply to this email directly, view it on GitHub
<#6648 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A25BQKFWKENYQLFC7TPI56LZZ4PAHAVCNFSM6AAAAABN6C3UWWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJUHAYTEMBZG4>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Oh really? That is weird, and kind of useless lol. Let me see if the direct links work. Are you on Windows or Linux? You can download a build from |
Hello @thomaseizinger, I checked the source code in the latest gateway release (1.3.2) and your PR was included.
Here is also the full gateway log: I would like to help with this as much as I can, let me know how else I can help. |
Was the API online and the gateway connected when you did that? How many relays are you deploying? I think you may need more than 1 so that the portal can perform a fail-over where it switches the gateway from one relay to the other. |
@AndrewDryga I think what may be happening here is that if we only have a single relay in the portal, we don't get a |
The API was online and the gateway connected when I restarted the relay. If I increase the replicas of the relay, each replica should have its own static public IP address right? Or can they be replicated like the gateway pods? |
By the way, restarting the gateway fixes the issue. |
They need to have their own public IP unless you split the port-range for the allocations to not be overlapping. It doesn't need to be static, just shouldn't change throughout the uptime of the relay and needs to be routable. |
Yeah that is expected. Essentially I think the problem here is that without at least a 2nd relay, we have nothing to fail over when it goes down. The restart fixes it because then then the portal assigns new relays to the gateway out of the available pool (only 1). |
Thanks for the clarification!
Is this behaviour a "bug" which you plan to fix, or should I just change
the way I manage the relays?
Il mar 8 ott 2024, 22:20 Thomas Eizinger ***@***.***> ha
scritto:
… By the way, restarting the gateway fixes the issue.
Yeah that is expected. Essentially I think the problem here is that
without at least a 2nd relay, we have nothing to fail over when it goes
down.
The restart fixes it because then then the portal assigns new relays to
the gateway out of the available pool (only 1).
—
Reply to this email directly, view it on GitHub
<#6648 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A25BQKAIC62WPGLV3JXJR2DZ2Q5AZAVCNFSM6AAAAABN6C3UWWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBQG42DONZTGE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I think it is a bug. I am not sure what a good fix for the bug is and when we can fix it. It is an edge-case that we haven't designed for because our production setup has zero downtime with multiple relays during deployments. If you don't want downtime, I'd suggest adding more relays. If downtime is okay, I'd always restart the gateway with the relay. |
Final question, can I use the same token for multiple relay instances (like
with gateways) or does each relay need it's own token?
Il mar 8 ott 2024, 23:05 Thomas Eizinger ***@***.***> ha
scritto:
… Is this behaviour a "bug" which you plan to fix, or should I just change
the way I manage the relays?
I think it is a bug. I am not sure what a good fix for the bug is and when
we can fix it.
It is an edge-case that we haven't designed for because our production
setup aims has zero downtime during deployments. If you don't want
downtime, I'd suggest adding more relays. If downtime is okay, I'd always
restart the gateway with the relay.
—
Reply to this email directly, view it on GitHub
<#6648 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A25BQKBIBH2R6PEMXYSBOG3Z2RCIVAVCNFSM6AAAAABN6C3UWWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBQHAYTSMBYGY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I think you should be able to use the same token. |
Describe the bug
I am self-hosting Firezone (latest version) on Kubernetes using the community Helm Chart, and I noticed that, if the domain/api Pods are restarted, VPN connectivity is lost. The DNS names are correctly resolved (I checked with nslookup from terminal) but then the actual requests to the resources time out. Connectivity is restored once I manually restart the gateway Pods.
To Reproduce
In Kubernetes, try killing the domain and api Pods and the connections to the resources under VPN will time out.
Expected behavior
The gateway pod should have a healthcheck that detects that the domain/api resources have restarted and force the gateway Pod to restart. Or the gateway pod should have a more resilient re-connection mechanism.
Screenshots / Logs
I did not see any error logs in the gateway pod.
Platform (please complete the following information)
The text was updated successfully, but these errors were encountered: