No VPN connectivity after domain/api restart #6648

sdaberdaku · 2024-09-10T07:24:33Z

Describe the bug

I am self-hosting Firezone (latest version) on Kubernetes using the community Helm Chart, and I noticed that, if the domain/api Pods are restarted, VPN connectivity is lost. The DNS names are correctly resolved (I checked with nslookup from terminal) but then the actual requests to the resources time out. Connectivity is restored once I manually restart the gateway Pods.

To Reproduce

In Kubernetes, try killing the domain and api Pods and the connections to the resources under VPN will time out.

Expected behavior

The gateway pod should have a healthcheck that detects that the domain/api resources have restarted and force the gateway Pod to restart. Or the gateway pod should have a more resilient re-connection mechanism.

Screenshots / Logs

I did not see any error logs in the gateway pod.

Platform (please complete the following information)

_Component (i.e. macOS client / Linux client / Gateway / Admin portal): Gateway
_Firezone Version (e.g. 1.0.0 or N/A): Gateway 1.3.1; Firezone API: ghcr.io/firezone/api:235c2f3b161348c0bbf6fa21747e11c17f5ca7f5; Firezone Domain: ghcr.io/firezone/domain:235c2f3b161348c0bbf6fa21747e11c17f5ca7f5
_OS and version: (e.g. Ubuntu 22.04 or N/A): Tested with latest gui-client 1.3.2 on Linux Mint 21.2 and Windows 11
_Deployment method: (e.g. Docker / Systemd / App Store or N/A): Helm Chart on Kubernetes (AWS EKS)

sdaberdaku · 2024-09-10T07:28:35Z

If anyone else is facing a similar issue, for the moment I kind of mitigated it by increasing the number of replicas of the API and Domain components and applied podAntiAffinities so that the replicas are guaranteed to run in different nodes and availability zones. This way it is more unlikely that the Gateway Pods loose the connectivity completely.

thomaseizinger · 2024-09-10T14:07:43Z

The gateway pod should have a healthcheck that detects that the domain/api resources have restarted and force the gateway Pod to restart. Or the gateway pod should have a more resilient re-connection mechanism.

The connection to the portal is designed to not affect the data plane and we have integration tests to check that.

Can you post some logs of client and gateways? Ideally with RUST_LOG=debug,wire::api=trace.

sdaberdaku · 2024-09-10T18:40:51Z

It took me some time to sanitize the logs from personal information (IPs, passwords, etc.) but I finally made it.

Here are the gateway logs:
firezone-gateway-nocolor.log
And here are the client logs:
connlib.2024-09-10-17-24-36.log
connlib.2024-09-10-17-22-15.log

Hope this helps!

thomaseizinger · 2024-09-10T20:04:14Z

And here are the client logs:
connlib.2024-09-10-17-24-36.log
connlib.2024-09-10-17-22-15.log

Unfortunately, those seem to be the logs of the GUI, not the IPC service.

thomaseizinger · 2024-09-10T20:09:10Z

Here are the gateway logs:
firezone-gateway-nocolor.log

2024-09-10T17:38:10.826909Z DEBUG handle_input{from=x.x.x.x:3478 tid=DCFF5F89AD85162E7F186707 method=channel bind class=error response rtt=2.259626ms}: snownet::allocation: Request failed, re-authenticating error="Stale Nonce"
2024-09-10T17:38:10.827071Z DEBUG handle_input{from=x.x.x.x:3478 tid=739F3E67ABF2CF509264DACE method=channel bind class=error response rtt=2.293111ms}: snownet::allocation: Request failed, re-authenticating error="Stale Nonce"
2024-09-10T17:38:10.828714Z WARN handle_input{from=x.x.x.x:3478 tid=806AF275D85FEEB25DDA353C method=channel bind class=error response rtt=1.822374ms}: snownet::allocation: Invalid credentials, refusing to re-authenticate channel bind
2024-09-10T17:38:10.828818Z DEBUG snownet::node: Packet was a STUN message but not accepted
2024-09-10T17:38:10.911027Z INFO connection{cid=5dd08d79-37fb-4fce-84c6-9f3b4b1ab806}: str0m::ice_::agent: Local candidate to discard Candidate(relay=x.x.x.x:55035/udp prio=37748479)
2024-09-10T17:38:10.911135Z ERROR snownet::node: Disconnecting from relay; authentication error rid=14f57985-8321-43dd-a300-76d1fa4e04e9

This seems to be where the issue starts. For some reason, the relay credentials we have locally don't match anymore. Are you restart the relay too as part of restarting the API? They should be restarted in sequence, not in parallel, otherwise the API can't detect that the relay restarted and push new credentials to the clients.

thomaseizinger · 2024-09-10T20:13:48Z

2024-09-10T17:36:59.580053Z TRACE wire::api::recv: {"event":"init","ref":null,"topic":"gateway","payload":{"config":{"ipv4_masquerade_enabled":true,"ipv6_masquerade_enabled":true},"interface":{"ipv6":"fd00:2021:1111::17:68ef","ipv4":"100.81.88.43"},"relays":[{"id":"14f57985-8321-43dd-a300-76d1fa4e04e9","type":"turn","addr":"x.x.x.x:3478","username":"1727199419:tQvGIkcyomo9USZrs4cDLQ","password":"xxx","expires_at":1727199419}]}}
2024-09-10T17:36:59.580123Z INFO snownet::node: Skipping known TURN server rid=14f57985-8321-43dd-a300-76d1fa4e04e9 address=V4(x.x.x.x:3478)

I think what might be happening here is that we skip TURN server because it is still the same ID but the credentials might have changed. We might need an additional check here.

I am surprised this doesn't come up in our production setup though.

The relevant code is here:

firezone/rust/connlib/snownet/src/node.rs

Lines 519 to 522 in b657c18

    
           if self.allocations.contains_key(rid) { 
        
               tracing::info!(%rid, address = ?server, "Skipping known TURN server"); 
        
               continue; 
        
           }

Instead of just checking for the ID being present, we should also check whether we have the same credentials. Shouldn't be too difficult! Are you willing to send a PR? :)

sdaberdaku · 2024-09-11T06:38:14Z

And here are the client logs:
connlib.2024-09-10-17-24-36.log
connlib.2024-09-10-17-22-15.log

Unfortunately, those seem to be the logs of the GUI, not the IPC service.

Hello @thomaseizinger, thanks for the quick response!

My bad, when you said client I thought you meant the gui client.

Regarding the relay, I did not restart it, I just restarted the api and domain Pods. But I think the relay restarted on its own. I see the Pod has 1 restart and the old logs show this (the time is consistent with when I did the experiment):
2024-09-10T19:36:36+02:00 at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/thread/local.rs:283:12 2024-09-10T19:36:36+02:00 17: std::thread::local::LocalKey<T>::with 2024-09-10T19:36:36+02:00 at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/thread/local.rs:260:9 2024-09-10T19:36:36+02:00 18: tokio::runtime::context::set_scheduler 2024-09-10T19:36:36+02:00 at /cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.2/src/runtime/context.rs:180:17 2024-09-10T19:36:36+02:00 19: tokio::runtime::scheduler::current_thread::CoreGuard::enter 2024-09-10T19:36:36+02:00 at /cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.2/src/runtime/scheduler/current_thread/mod.rs:751:27 2024-09-10T19:36:36+02:00 20: tokio::runtime::scheduler::current_thread::CoreGuard::block_on 2024-09-10T19:36:36+02:00 at /cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.2/src/runtime/scheduler/current_thread/mod.rs:660:19 2024-09-10T19:36:36+02:00 21: tokio::runtime::scheduler::current_thread::CurrentThread::block_on::{{closure}} 2024-09-10T19:36:36+02:00 at /cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.2/src/runtime/scheduler/current_thread/mod.rs:180:28 2024-09-10T19:36:36+02:00 22: tokio::runtime::context::runtime::enter_runtime 2024-09-10T19:36:36+02:00 at /cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.2/src/runtime/context/runtime.rs:65:16 2024-09-10T19:36:36+02:00 23: tokio::runtime::scheduler::current_thread::CurrentThread::block_on 2024-09-10T19:36:36+02:00 at /cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.2/src/runtime/scheduler/current_thread/mod.rs:168:9 2024-09-10T19:36:36+02:00 24: tokio::runtime::runtime::Runtime::block_on_inner 2024-09-10T19:36:36+02:00 at /cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.2/src/runtime/runtime.rs:361:47 2024-09-10T19:36:36+02:00 25: tokio::runtime::runtime::Runtime::block_on 2024-09-10T19:36:36+02:00 at /cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.2/src/runtime/runtime.rs:335:13 2024-09-10T19:36:36+02:00 26: firezone_relay::main 2024-09-10T19:36:36+02:00 at /project/relay/src/main.rs:170:5 2024-09-10T19:36:36+02:00 27: core::ops::function::FnOnce::call_once 2024-09-10T19:36:36+02:00 at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:250:5 2024-09-10T19:36:36+02:00 28: std::sys_common::backtrace::__rust_begin_short_backtrace 2024-09-10T19:36:36+02:00 at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/sys_common/backtrace.rs:155:18 2024-09-10T19:36:36+02:00 29: main

Regarding the PR, sorry but I have zero experience with Rust.
If I can help in any other way please let me know. We could also set up a meeting if you want and I could demonstrate this live.

Best regards,

Sebastian

thomaseizinger · 2024-09-11T11:52:38Z

That looks like a stacktrace. Can you upload the entire relay log please?

sdaberdaku · 2024-09-11T12:21:33Z

Here is the whole log (which ends with the stack trace I showed earlier):
firezone-relay.log

thomaseizinger · 2024-09-11T13:04:03Z

Thanks for uploading that!

First, I would highly recommend to lower the log level. Running the relay with TRACE enabled will impact the performance. Second, the relay exited because the reconnect backoff expired which suggests that the portal was down for longer than 15 minutes. Due to the restart, the credentials changed but we didn't detect that on the client. That is the bug I mentioned above and needs to be fixed.

I'd recommend to check your setup though why the portal was unreachable for 15 minutes. We only tolerate a certain amount of network partitioning.

sdaberdaku · 2024-09-11T13:13:19Z

Thanks for the feedback!
I will set the log level to info.

I do not think the portal was unreachable for 15 minutes though, the Pods were up and running within a couple of minutes tops. I deleted them manually and was monitoring the EKS cluster the whole time. I have not configured the re-connection backoff, is there a way to increase it, maybe with some environment variable?

thomaseizinger · 2024-09-11T14:30:06Z

I do not think the portal was unreachable for 15 minutes though

The logs suggest otherwise although it is hard to say because they only cover about 3 minutes. If you can reproduce the problem with INFO log, we will have less spam and can look at what actually happened.

I have not configured the re-connection backoff, is there a way to increase it, maybe with some environment variable?

It is currently not configurable and I wouldn't recommend to set it to anything longer than 15 minutes. The portal is what authorizes connection flows and thus having too long of a network partition is a risk.

Our relays are essential for connectivity because they also perform STUN for us through which we learn our server-reflexive address. Thus, we must at all times have at least one relay that we can reach in order to establish a connection. The portal tracks the connectivity to the relays for us and in case any of them go down, sends us a `relays_presence` message, meaning we can stop using that relay and migrate any relayed connections to a new one. This works well for as long as we are connected to the portal while the relay is rebooting / going-down. If we are not currently connected to the portal and a relay we are using reboots, we don't learn about it. At least if we are actively using it, the connection will fail and further attempted communication with the relay will time-out and we will stop using it. In case we aren't currently using the relay, this gets a bit trickier. If we aren't using the relay but it rebooted while we were partitioned from the portal, logging in again might return the same relay to us in the `init` message, but this time with different credentials. The first bug that we are fixing in this PR is that we previously ignored those credentials because we already knew about the relay, thinking that we can still use our existing credentials. The fix here is to also compare the credentials and ditch the local state if they differ. The second bug identified during fixing the first one is that we need to pro-actively probe whether all other relays that we know about are actually still responsive. For that, we issue a `REFRESH` message to them. If that one times-out or fails otherwise, we will remove that one from our list of `Allocation`s too. To fix the 2nd bug, several changes were necessary: 1. We lower the log-level of `Disconnecting from relay` from ERROR to WARN. Any ERROR emitted during a test-run fails our test-suite which is what partially motivated this. The test suite builds on the assumption that ERRORs are fatal and thus should never happen during our tests. This change surfaces that disconnecting from a relay can indeed happen during normal operation, which justifies lowering this to WARN. Users should at the minimum monitor on WARN to be alerted about problems. 2. We reduce the total backoff duration for requests to relays from 60s to 10s. The current 60s result in total of 8 retries. UDP is unreliable but it isn't THAT unreliable to justify retrying everything for 60s. We also use a 10s timeout for ICE, which means these are now aligned to better match each other. We had to change the max backoff duration because we only idle-spin for at most 10s in the tests and thus the current 60s were too long to detect that a relay actually disappeared. 3. We had to shuffle around some function calls to make sure all intermediary event buffers are emptied at the right point in time to make the test deterministic. Fixes: #6648.

sdaberdaku · 2024-10-04T20:51:39Z

Hello @thomaseizinger,

I saw you fixed this issue, is it released yet?

Thanks for the support!

thomaseizinger · 2024-10-04T21:50:02Z

Yes it is part of the latest client releases. Let me know if it is working for you! :)

sdaberdaku · 2024-10-04T22:09:42Z

I upgraded the gateway to v1.3.2 and the gui client to 1.3.7. For the other components I am running the tag "081c4471136b5104dc79cda8e64a3f51c9ed37fa", latest avaliable version that came out ~10 days ago. The issue is still happening, and today I had to restart the gateways a couple of times. With client, which component do you mean exactly?

thomaseizinger · 2024-10-04T22:16:25Z

Maybe that didn't make it into the release. Can you try the latest draft releases (both gateway and gui-client)? https://github.com/firezone/firezone/releases?

No changes to the portal should be necessary. If it still happens, can you post logs with RUST_LOG=debug? Thanks!

You might be hitting a different issue! :)

sdaberdaku · 2024-10-04T22:48:46Z

Sorry could you clarify what you mean with draft release? Il sab 5 ott 2024, 00:17 Thomas Eizinger ***@***.***> ha scritto:

…

Maybe that didn't make it into the release. Can you try the latest draft releases (both gateway and gui-client)? https://github.com/firezone/firezone/releases? No changes to the portal should be necessary. If it still happens, can you post logs with RUST_LOG=debug? Thanks! You might be hitting a different issue! :) — Reply to this email directly, view it on GitHub <#6648 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A25BQKCC6HUBRLBJVRYCSCDZZ4AT7AVCNFSM6AAAAABN6C3UWWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJUG4ZTQMRXGA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

thomaseizinger · 2024-10-05T00:16:24Z

I am on my phone now, but these are what I mean:

thomaseizinger · 2024-10-05T00:19:10Z

Specifically this one if you are using the gui-client: https://github.com/firezone/firezone/releases/tag/untagged-5991635f7729556e5a11

Not sure what it is in Spanish, el barrador? :)

sdaberdaku · 2024-10-05T07:50:55Z

I think those are only visible to members of the organisation, I don't have permissions to see them. Il sab 5 ott 2024, 02:19 Thomas Eizinger ***@***.***> ha scritto:

…

Specifically this one: https://github.com/firezone/firezone/releases/tag/untagged-5991635f7729556e5a11 Not sure what it is in Spanish, el barrador? :) — Reply to this email directly, view it on GitHub <#6648 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A25BQKFWKENYQLFC7TPI56LZZ4PAHAVCNFSM6AAAAABN6C3UWWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJUHAYTEMBZG4> . You are receiving this because you authored the thread.Message ID: ***@***.***>

thomaseizinger · 2024-10-05T12:29:42Z

I think those are only visible to members of the organisation, I don't have permissions to see them.

Oh really? That is weird, and kind of useless lol. Let me see if the direct links work.

Are you on Windows or Linux? You can download a build from main directly from our CI: https://github.com/firezone/firezone/actions/runs/11181621928/job/31086100209

sdaberdaku · 2024-10-08T08:06:36Z

Hello @thomaseizinger,

I checked the source code in the latest gateway release (1.3.2) and your PR was included.
Now I repeated the experiment and manually killed the api pod and the gateway could re-establish the connection once the api pod restarted. I also restarted the domain pod and saw no issues in the connection. Finally, I manually restarted the relay pod, and the gateway seems to enter a loop:

2024-10-08T08:03:56.982417Z  INFO accept_connection{cid=5dd08d79-37fb-4fce-84c6-9f3b4b1ab806}: str0m::ice_::agent: State change (new connection): New -> Checking
2024-10-08T08:03:56.982423Z  WARN accept_connection{cid=5dd08d79-37fb-4fce-84c6-9f3b4b1ab806}: snownet::node: No TURN servers connected; connection will very likely fail to establish
2024-10-08T08:03:56.982496Z  INFO accept_connection{cid=5dd08d79-37fb-4fce-84c6-9f3b4b1ab806}: snownet::node: Created new connection
2024-10-08T08:03:56.982550Z  INFO firezone_tunnel::gateway: Allowing access to resource client=5dd08d79-37fb-4fce-84c6-9f3b4b1ab806 resource=eb0d10b9-d83f-47bb-8415-f109a6026b25 expires=Some("2024-10-11T22:13:34+00:00")
2024-10-08T08:03:57.166491Z  INFO add_remote_candidate{cid=5dd08d79-37fb-4fce-84c6-9f3b4b1ab806}: str0m::ice_::agent: Add remote candidate: Candidate(host=192.168.1.24:46012/udp prio=2130706175)
2024-10-08T08:04:06.233930Z  INFO handle_timeout{cid=5dd08d79-37fb-4fce-84c6-9f3b4b1ab806}: str0m::ice_::agent: State change (no possible pairs): Checking -> Disconnected
2024-10-08T08:04:06.233960Z  INFO handle_timeout{cid=5dd08d79-37fb-4fce-84c6-9f3b4b1ab806}: snownet::node: Connection failed (ICE timeout)
2024-10-08T08:04:12.208786Z  INFO accept_connection{cid=5dd08d79-37fb-4fce-84c6-9f3b4b1ab806}: str0m::ice_::agent: Set remote credentials: IceCreds { ufrag: "IZI8", pass: "3zsHBOvNObvtKILccMTEZg" }
2024-10-08T08:04:12.208811Z  INFO accept_connection{cid=5dd08d79-37fb-4fce-84c6-9f3b4b1ab806}: str0m::ice_::agent: Add local candidate: Candidate(host=10.254.160.60:56073/udp prio=2130706175)
2024-10-08T08:04:12.208827Z  INFO accept_connection{cid=5dd08d79-37fb-4fce-84c6-9f3b4b1ab806}: str0m::ice_::agent: State change (new connection): New -> Checking
2024-10-08T08:04:12.208833Z  WARN accept_connection{cid=5dd08d79-37fb-4fce-84c6-9f3b4b1ab806}: snownet::node: No TURN servers connected; connection will very likely fail to establish
2024-10-08T08:04:12.208961Z  INFO accept_connection{cid=5dd08d79-37fb-4fce-84c6-9f3b4b1ab806}: snownet::node: Created new connection
2024-10-08T08:04:12.209028Z  INFO firezone_tunnel::gateway: Allowing access to resource client=5dd08d79-37fb-4fce-84c6-9f3b4b1ab806 resource=eb0d10b9-d83f-47bb-8415-f109a6026b25 expires=Some("2024-10-11T22:13:34+00:00")
2024-10-08T08:04:12.337016Z  INFO add_remote_candidate{cid=5dd08d79-37fb-4fce-84c6-9f3b4b1ab806}: str0m::ice_::agent: Add remote candidate: Candidate(host=192.168.1.24:46012/udp prio=2130706175)
2024-10-08T08:04:21.459552Z  INFO handle_timeout{cid=5dd08d79-37fb-4fce-84c6-9f3b4b1ab806}: str0m::ice_::agent: State change (no possible pairs): Checking -> Disconnected
2024-10-08T08:04:21.459580Z  INFO handle_timeout{cid=5dd08d79-37fb-4fce-84c6-9f3b4b1ab806}: snownet::node: Connection failed (ICE timeout)
2024-10-08T08:04:23.604449Z  INFO accept_connection{cid=5dd08d79-37fb-4fce-84c6-9f3b4b1ab806}: str0m::ice_::agent: Set remote credentials: IceCreds { ufrag: "ltpj", pass: "h4y1HaOBdxNRvhqgshjR5t" }
2024-10-08T08:04:23.604548Z  INFO accept_connection{cid=5dd08d79-37fb-4fce-84c6-9f3b4b1ab806}: str0m::ice_::agent: Add local candidate: Candidate(host=10.254.160.60:56073/udp prio=2130706175)
2024-10-08T08:04:23.604588Z  INFO accept_connection{cid=5dd08d79-37fb-4fce-84c6-9f3b4b1ab806}: str0m::ice_::agent: State change (new connection): New -> Checking
2024-10-08T08:04:23.604595Z  WARN accept_connection{cid=5dd08d79-37fb-4fce-84c6-9f3b4b1ab806}: snownet::node: No TURN servers connected; connection will very likely fail to establish
2024-10-08T08:04:23.604691Z  INFO accept_connection{cid=5dd08d79-37fb-4fce-84c6-9f3b4b1ab806}: snownet::node: Created new connection
2024-10-08T08:04:23.604719Z  INFO firezone_tunnel::gateway: Allowing access to resource client=5dd08d79-37fb-4fce-84c6-9f3b4b1ab806 resource=eb0d10b9-d83f-47bb-8415-f109a6026b25 expires=Some("2024-10-11T22:13:34+00:00")
2024-10-08T08:04:23.791212Z  INFO add_remote_candidate{cid=5dd08d79-37fb-4fce-84c6-9f3b4b1ab806}: str0m::ice_::agent: Add remote candidate: Candidate(host=192.168.1.24:46012/udp prio=2130706175)

Here is also the full gateway log:
firezone-gateway.log

I would like to help with this as much as I can, let me know how else I can help.

thomaseizinger · 2024-10-08T11:41:48Z

Finally, I manually restarted the relay pod, and the gateway seems to enter a loop:

Was the API online and the gateway connected when you did that? How many relays are you deploying? I think you may need more than 1 so that the portal can perform a fail-over where it switches the gateway from one relay to the other.

thomaseizinger · 2024-10-08T11:44:05Z

@AndrewDryga I think what may be happening here is that if we only have a single relay in the portal, we don't get a relays_presence with just that one being "online", right? Is that something we want to support?

sdaberdaku · 2024-10-08T12:12:29Z

Finally, I manually restarted the relay pod, and the gateway seems to enter a loop:

Was the API online and the gateway connected when you did that? How many relays are you deploying? I think you may need more than 1 so that the portal can perform a fail-over where it switches the gateway from one relay to the other.

The API was online and the gateway connected when I restarted the relay.

If I increase the replicas of the relay, each replica should have its own static public IP address right? Or can they be replicated like the gateway pods?

sdaberdaku · 2024-10-08T12:13:16Z

By the way, restarting the gateway fixes the issue.

thomaseizinger · 2024-10-08T20:15:46Z

If I increase the replicas of the relay, each replica should have its own static public IP address right? Or can they be replicated like the gateway pods?

They need to have their own public IP unless you split the port-range for the allocations to not be overlapping. It doesn't need to be static, just shouldn't change throughout the uptime of the relay and needs to be routable.

thomaseizinger · 2024-10-08T20:20:06Z

By the way, restarting the gateway fixes the issue.

Yeah that is expected. Essentially I think the problem here is that without at least a 2nd relay, we have nothing to fail over when it goes down.

The restart fixes it because then then the portal assigns new relays to the gateway out of the available pool (only 1).

sdaberdaku · 2024-10-08T20:53:51Z

Thanks for the clarification! Is this behaviour a "bug" which you plan to fix, or should I just change the way I manage the relays? Il mar 8 ott 2024, 22:20 Thomas Eizinger ***@***.***> ha scritto:

…

By the way, restarting the gateway fixes the issue. Yeah that is expected. Essentially I think the problem here is that without at least a 2nd relay, we have nothing to fail over when it goes down. The restart fixes it because then then the portal assigns new relays to the gateway out of the available pool (only 1). — Reply to this email directly, view it on GitHub <#6648 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A25BQKAIC62WPGLV3JXJR2DZ2Q5AZAVCNFSM6AAAAABN6C3UWWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBQG42DONZTGE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

thomaseizinger · 2024-10-08T21:04:53Z

Is this behaviour a "bug" which you plan to fix, or should I just change the way I manage the relays?

I think it is a bug. I am not sure what a good fix for the bug is and when we can fix it.

It is an edge-case that we haven't designed for because our production setup has zero downtime with multiple relays during deployments. If you don't want downtime, I'd suggest adding more relays. If downtime is okay, I'd always restart the gateway with the relay.

sdaberdaku · 2024-10-08T21:24:13Z

Final question, can I use the same token for multiple relay instances (like with gateways) or does each relay need it's own token? Il mar 8 ott 2024, 23:05 Thomas Eizinger ***@***.***> ha scritto:

…

Is this behaviour a "bug" which you plan to fix, or should I just change the way I manage the relays? I think it is a bug. I am not sure what a good fix for the bug is and when we can fix it. It is an edge-case that we haven't designed for because our production setup aims has zero downtime during deployments. If you don't want downtime, I'd suggest adding more relays. If downtime is okay, I'd always restart the gateway with the relay. — Reply to this email directly, view it on GitHub <#6648 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A25BQKBIBH2R6PEMXYSBOG3Z2RCIVAVCNFSM6AAAAABN6C3UWWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBQHAYTSMBYGY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

thomaseizinger · 2024-10-08T22:08:21Z

Final question, can I use the same token for multiple relay instances (like with gateways) or does each relay need it's own token?

I think you should be able to use the same token.

sdaberdaku added the needs triage Issues opened by the public or need further labeling label Sep 10, 2024

thomaseizinger mentioned this issue Sep 11, 2024

fix(connlib): handle silently rebooted / disconnected relays #6666

Merged

thomaseizinger added kind/bug Something isn't working area/connlib Firezone's core connectivity library and removed needs triage Issues opened by the public or need further labeling labels Sep 30, 2024

thomaseizinger self-assigned this Sep 30, 2024

thomaseizinger closed this as completed in #6666 Oct 2, 2024

thomaseizinger mentioned this issue Oct 8, 2024

Should have at least two relays DoctorFTB/firezone-1.x-self-hosted#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No VPN connectivity after domain/api restart #6648

No VPN connectivity after domain/api restart #6648

sdaberdaku commented Sep 10, 2024 •

edited

Loading

sdaberdaku commented Sep 10, 2024

thomaseizinger commented Sep 10, 2024

sdaberdaku commented Sep 10, 2024

thomaseizinger commented Sep 10, 2024

thomaseizinger commented Sep 10, 2024

thomaseizinger commented Sep 10, 2024

sdaberdaku commented Sep 11, 2024

thomaseizinger commented Sep 11, 2024

sdaberdaku commented Sep 11, 2024

thomaseizinger commented Sep 11, 2024

sdaberdaku commented Sep 11, 2024

thomaseizinger commented Sep 11, 2024

sdaberdaku commented Oct 4, 2024

thomaseizinger commented Oct 4, 2024

sdaberdaku commented Oct 4, 2024

thomaseizinger commented Oct 4, 2024

sdaberdaku commented Oct 4, 2024 via email

thomaseizinger commented Oct 5, 2024

thomaseizinger commented Oct 5, 2024 •

edited

Loading

sdaberdaku commented Oct 5, 2024 via email

thomaseizinger commented Oct 5, 2024

sdaberdaku commented Oct 8, 2024 •

edited

Loading

thomaseizinger commented Oct 8, 2024

thomaseizinger commented Oct 8, 2024

sdaberdaku commented Oct 8, 2024

sdaberdaku commented Oct 8, 2024

thomaseizinger commented Oct 8, 2024 •

edited

Loading

thomaseizinger commented Oct 8, 2024

sdaberdaku commented Oct 8, 2024 via email

thomaseizinger commented Oct 8, 2024 •

edited

Loading

sdaberdaku commented Oct 8, 2024 via email

thomaseizinger commented Oct 8, 2024

No VPN connectivity after domain/api restart #6648

No VPN connectivity after domain/api restart #6648

Comments

sdaberdaku commented Sep 10, 2024 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Screenshots / Logs

Platform (please complete the following information)

sdaberdaku commented Sep 10, 2024

thomaseizinger commented Sep 10, 2024

sdaberdaku commented Sep 10, 2024

thomaseizinger commented Sep 10, 2024

thomaseizinger commented Sep 10, 2024

thomaseizinger commented Sep 10, 2024

sdaberdaku commented Sep 11, 2024

thomaseizinger commented Sep 11, 2024

sdaberdaku commented Sep 11, 2024

thomaseizinger commented Sep 11, 2024

sdaberdaku commented Sep 11, 2024

thomaseizinger commented Sep 11, 2024

sdaberdaku commented Oct 4, 2024

thomaseizinger commented Oct 4, 2024

sdaberdaku commented Oct 4, 2024

thomaseizinger commented Oct 4, 2024

sdaberdaku commented Oct 4, 2024 via email

thomaseizinger commented Oct 5, 2024

thomaseizinger commented Oct 5, 2024 • edited Loading

sdaberdaku commented Oct 5, 2024 via email

thomaseizinger commented Oct 5, 2024

sdaberdaku commented Oct 8, 2024 • edited Loading

thomaseizinger commented Oct 8, 2024

thomaseizinger commented Oct 8, 2024

sdaberdaku commented Oct 8, 2024

sdaberdaku commented Oct 8, 2024

thomaseizinger commented Oct 8, 2024 • edited Loading

thomaseizinger commented Oct 8, 2024

sdaberdaku commented Oct 8, 2024 via email

thomaseizinger commented Oct 8, 2024 • edited Loading

sdaberdaku commented Oct 8, 2024 via email

thomaseizinger commented Oct 8, 2024

sdaberdaku commented Sep 10, 2024 •

edited

Loading

thomaseizinger commented Oct 5, 2024 •

edited

Loading

sdaberdaku commented Oct 8, 2024 •

edited

Loading

thomaseizinger commented Oct 8, 2024 •

edited

Loading

thomaseizinger commented Oct 8, 2024 •

edited

Loading