Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicit public address for authority discovery DHT record and Identify protocol #3657

Closed
wants to merge 14 commits into from

Conversation

dmitry-markin
Copy link
Contributor

@dmitry-markin dmitry-markin commented Mar 12, 2024

As reported in #3519 (comment), DHT is flooded by rogue automatically discovered addresses (both in regular DHT records and authority discovery DHT records) when a node is run on a spot instance.

This PR adds flag --public-addr-only that forces publishing of only explicitly set public addresses.

If --public-addr-only flag is passed:

  1. The node publishes only explicitly set --public-addr addresses to the authority discovery DHT record instead af the discovered external addresses (applies to validators).
  2. Only --public_addr addresses are advertised via Identify protocol instead of all listen addresses and discovered external addresses, so that remote nodes do not add them to the node addresses in DHT (applies to all nodes).

This PR also fixes a subtle bug where expired external addresses were never removed from the list in PeerInfo when handling FromSwarm::ExpiredExternalAddr.

@dmitry-markin dmitry-markin requested review from altonen and lexnv March 12, 2024 10:39
Comment on lines 417 to 418
self.identify.on_swarm_event(FromSwarm::NewListenAddr(e));
// Do not report listen addresses to `Identify`, because remote nodes should only
// know about our external addresses.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is a good idea. Probably needs a burn-in in a real network to see if it's safe to remove because AFAIU many of our nodes rely on external address discovery via Identify and the issues we've experienced are limited Rococo. If this is necessary to fix it, then it probably should be behind a CLI flag to enable this behavior explicitly.

Copy link
Contributor Author

@dmitry-markin dmitry-markin Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's do a burn-in. This should not affect the external address discovery via Identify, as discovery is handled separately via observed_addr field of Identify message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was mixing up listen and external addresses

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The confusion is no surprise, as there is only one field listen_addrs in Identify message, where both listen and external addresses are added.

@dmitry-markin dmitry-markin added A0-needs_burnin Pull request needs to be tested on a live validator node before merge. DevOps is notified via matrix T0-node This PR/Issue is related to the topic “node”. labels Mar 12, 2024
@dmitry-markin dmitry-markin requested a review from altonen March 12, 2024 13:52
@dmitry-markin
Copy link
Contributor Author

dmitry-markin commented Mar 13, 2024

Putting more thought into it, I think this is not enough to clean DHT from temporary spot instance addresses, as they will be discovered via Identify and published as external addresses anyway. Unless connections from temporary addresses are blocked/forwarded by a firewall.

I think we need to explicitly filter all addresses except the explicitly set --public-addr from being added to DHT, even as external addresses. I would not introduce a new cli flag for this and enable this behavior always when --public-addr flag is passed, which seems reasonable as this is the intended dial address set by the operator.

@lexnv @altonen what do you think?

@dmitry-markin
Copy link
Contributor Author

The same applies to authority discovery, as a spot instance temporary addresses will end up in the authority discovery DHT records as external addresses.

@dmitry-markin
Copy link
Contributor Author

dmitry-markin commented Mar 13, 2024

I would not introduce a new cli flag for this and enable this behavior always when --public-addr flag is passed, which seems reasonable as this is the intended dial address set by the operator.

Another option is to add one more flag --no-external-addr-discovery, but the operator would need to pass all three: --public-addr --hide-listen-addr --no-external-addr-discovery while this behavior seems intended when --public-addr is passed.

EDIT: another option is --public-addr xxx --public-addr-only 😄

I'm going to update the PR to cover both Identify and authority discovery once we decide on the flag usage.

@altonen
Copy link
Contributor

altonen commented Mar 13, 2024

Out of caution, I'd go with an extra flag instead of modifying behavior of the current flags, simply because this is trying to address an issue we've encountered in Rococo, not on all networks, and because we don't know if there is someone relying on the current behavior of the flags. Your call though

@lexnv
Copy link
Contributor

lexnv commented Mar 13, 2024

I do like the --public-addr xxx --public-addr-only, this keeps things simple for us and node operations (as opposed of having 2 extra CLI options to hide the listen address and no external address).
Then, Rococo maintainers could add the extra flag to enable this behavior :D

@dmitry-markin dmitry-markin marked this pull request as draft March 13, 2024 15:58
@dmitry-markin dmitry-markin changed the title Do not report listen addresses via Identify protocol Explicit public address for authority discovery DHT record and Identify protocol Mar 14, 2024
@dmitry-markin
Copy link
Contributor Author

@lexnv could you have a look again, please? The logic with Identify was reworked + the action of the flag was extended on authority discovery DHT records.

@dmitry-markin dmitry-markin marked this pull request as ready for review March 15, 2024 11:34
@dmitry-markin dmitry-markin removed the request for review from altonen March 15, 2024 11:36

debug!(target: LOG_TARGET, "Authority DHT record peer_id='{:?}' addresses='{:?}'", peer_id, addresses.clone().collect::<Vec<_>>());
debug!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dq:
I think moving the debug log here changes a bit the details we expose.
I was curious to know the addresses we publish in DHT prior to adding the /p2p/local_peer_id.

Having the log line here might hint us about /p2p/other_peer entries (because they will remain untouched).
Could you see any value in distinguishing between rogue_ip_address and rogue_ip_address/p2p/local_peer_id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the log here to see what addresses exactly are published to DHT. In case we want pretty addresses without /p2p/... part, we would also need to drop the /p2p/... part if it was provided by the operator.

Copy link
Contributor

@lexnv lexnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice one 👍

@bkchr
Copy link
Member

bkchr commented Mar 18, 2024

I'm not convinced by this pull request. You are trying to add some duct tape. This change will require manual intervention of node operators. Aka this will not fix anything. We have almost all the data that we need. Other nodes report us out addresses. We just need to take the data and put it together in a better way.

This PR also fixes a subtle bug where expired external addresses were never removed from the list in PeerInfo when handling FromSwarm::ExpiredExternalAddr.

This is a correct fix. However, I'm not sure if you went down to libp2p checking when this is being generated. (I have done this and I think it is generated almost never because they have a address cache for 200 automatically).

As I already said in the issue or somewhere else, we need to start track the "lifeness" of these addresses manually in our networking code and prune them if they are too old.

Authority discovery should be changed to report the addresses in the following order:

  • Our own listen addresses/public addresses (This should already be good enough for proper validators as they are reachable from the outside)
  • The rest you fill up with addresses reported from other nodes.

@dmitry-markin
Copy link
Contributor Author

dmitry-markin commented Mar 18, 2024

As I already said in the issue or somewhere else, we need to start track the "lifeness" of these addresses manually in our networking code and prune them if they are too old.

For this, we'd need to introduce the address probing protocol, something like AutoNAT v2 proposed in libp2p. Otherwise we don't have any source of truth for what addresses are real and what are incorrectly reported by remote peers with faulty NAT / port forwarding configurations or incorrectly translated by libp2p (it assigns a local listen port to an observed external address and breaks things when the external address has a different port than the listen address).

Authority discovery should be changed to report the addresses in the following order:

* Our own listen addresses/public addresses (This should already be good enough for proper validators as they are reachable from the outside)

* The rest you fill up with addresses reported from other nodes.

Currently, upon fetching an authority-discovery DHT record, we just insert the addresses into Discovery behavior via different forms of add_known_address, which all boil down to calling Discovery::add_know_address, effectively just adding the addresses to the DHT routing table. When we dial a peer, we lookup addresses in libp2p Kad behavior's handle_pending_outbound_connection() where they are fetched from a k-bucket's entry in the order they were inserted there. As the validator have already participated in DHT, there are chances there are already addresses in the routing table reported via Identify protocol, including all the listen addresses and incorrectly discovered external addresses.

TLDR; we can order addresses to make the address provided by the operator go first, but we'd need to maintain the order in both (1) authority-discovery records and (2) DHT routing table (via providing the needed order in Identify).

@bkchr does this make sense?

@bkchr
Copy link
Member

bkchr commented Mar 18, 2024

For this, we'd need to introduce the address probing protocol, something like AutoNAT v2 proposed in libp2p. Otherwise we don't have any source of truth for what addresses are real and what are incorrectly reported by remote peers with faulty NAT / port forwarding configurations or incorrectly translated by libp2p (it assigns a local listen port to an observed external address and breaks things when the external address has a different port than the listen address).

Generally we are not supporting authorities behind NATs. Authorities should be reachable directly. Otherwise we would need support for hole punching etc.

Currently, upon fetching an authority-discovery DHT record, we just insert the addresses into Discovery behavior via different forms of add_known_address, which all boil down to calling Discovery::add_know_address, effectively just adding the addresses to the DHT routing table.

Generally I was talking about the sending site. AKA what we put into the DHT for our node.

When we dial a peer, we lookup addresses in libp2p Kad behavior's handle_pending_outbound_connection() where they are fetched from a k-bucket's entry in the order they were inserted there. As the validator have already participated in DHT, there are chances there are already addresses in the routing table reported via Identify protocol, including all the listen addresses and incorrectly discovered external addresses.

But we call this code in Kademlia, from here. And before we already use this ephemeral list. So, we could give addresses in the ephemeral list a higher "weight" when they come from authority discovery.

@dmitry-markin
Copy link
Contributor Author

Generally we are not supporting authorities behind NATs. Authorities should be reachable directly. Otherwise we would need support for hole punching etc.

Unfortunately, this is exactly what we are doing at Parity — running rococo validators on spot instances with hole punching.

Generally I was talking about the sending site. AKA what we put into the DHT for our node.

👍

But we call this code in Kademlia, from here. And before we already use this ephemeral list. So, we could give addresses in the ephemeral list a higher "weight" when they come from authority discovery.

Yes, makes sense, missed this. Should be already working as needed if we publish the authority-discovery DHT records with the right order of addresses.

@bkchr
Copy link
Member

bkchr commented Mar 18, 2024

Unfortunately, this is exactly what we are doing at Parity — running rococo validators on spot instances with hole punching.

Then we need to change the infrastructure and not the code ;)

@bkchr
Copy link
Member

bkchr commented Mar 18, 2024

Should be already working as needed if we publish the authority-discovery DHT records with the right order of addresses.

Yes, but we could also optimize the function to not return duplicated addresses.

@dmitry-markin
Copy link
Contributor Author

Closing in favor of #3757.

github-merge-queue bot pushed a commit that referenced this pull request Mar 22, 2024
)

Make sure explicitly set by the operator public addresses go first in
the authority discovery DHT records.

Also update `Discovery` behavior to eliminate duplicates in the returned
addresses.

This PR should improve situation with
#3519.

Obsoletes #3657.
dharjeezy pushed a commit to dharjeezy/polkadot-sdk that referenced this pull request Mar 24, 2024
…ritytech#3757)

Make sure explicitly set by the operator public addresses go first in
the authority discovery DHT records.

Also update `Discovery` behavior to eliminate duplicates in the returned
addresses.

This PR should improve situation with
paritytech#3519.

Obsoletes paritytech#3657.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A0-needs_burnin Pull request needs to be tested on a live validator node before merge. DevOps is notified via matrix T0-node This PR/Issue is related to the topic “node”.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants