Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QUIC connections are culled... sometimes #9061

Closed
3 tasks done
mrd0ll4r opened this issue Jun 26, 2022 · 12 comments
Closed
3 tasks done

QUIC connections are culled... sometimes #9061

mrd0ll4r opened this issue Jun 26, 2022 · 12 comments
Labels
kind/bug A bug in existing code (including security flaws) P1 High: Likely tackled by core team if no one steps up status/blocked Unable to be worked further until needs are met

Comments

@mrd0ll4r
Copy link
Contributor

Checklist

Installation method

built from source

Version

go-ipfs version: 0.13.0
Repo version: 12
System version: amd64/linux
Golang version: go1.18

Compiled the v13.0 tag with go1.18.3 on x86_64 Linux

Config

{
  "API": {
    "HTTPHeaders": {}
  },
  "Addresses": {
    "API": [
      "/ip4/127.0.0.1/tcp/5001",
      "/ip4/10.0.1.2/tcp/5001"
    ],
    "Announce": [],
    "AppendAnnounce": [],
    "Gateway": "/ip4/127.0.0.1/tcp/8080",
    "NoAnnounce": [
      "/ip4/10.0.0.0/ipcidr/8",
      "/ip4/100.64.0.0/ipcidr/10",
      "/ip4/169.254.0.0/ipcidr/16",
      "/ip4/172.16.0.0/ipcidr/12",
      "/ip4/192.0.0.0/ipcidr/24",
      "/ip4/192.0.2.0/ipcidr/24",
      "/ip4/192.168.0.0/ipcidr/16",
      "/ip4/198.18.0.0/ipcidr/15",
      "/ip4/198.51.100.0/ipcidr/24",
      "/ip4/203.0.113.0/ipcidr/24",
      "/ip4/240.0.0.0/ipcidr/4",
      "/ip6/100::/ipcidr/64",
      "/ip6/2001:2::/ipcidr/48",
      "/ip6/2001:db8::/ipcidr/32",
      "/ip6/fc00::/ipcidr/7",
      "/ip6/fe80::/ipcidr/10"
    ],
    "Swarm": [
      "/ip4/0.0.0.0/tcp/4001",
      "/ip6/::/tcp/4001",
      "/ip4/0.0.0.0/udp/4001/quic",
      "/ip6/::/udp/4001/quic"
    ]
  },
  "AutoNAT": {},
  "Bootstrap": [
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmQCU2EcMqAqQPR2i9bChDtGNJchTbq5TbXJJ16u19uLTa",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmcZf59bWwK5XFi76CZX8cbJ4BhTzzA3gU1ZjYZcYW3dwt",
    "/ip4/104.131.131.82/tcp/4001/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
    "/ip4/104.131.131.82/udp/4001/quic/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ"
  ],
  "DNS": {
    "Resolvers": {}
  },
  "Datastore": {
    "BloomFilterSize": 0,
    "GCPeriod": "1h",
    "HashOnRead": false,
    "Spec": {
      "mounts": [
        {
          "child": {
            "path": "blocks",
            "shardFunc": "/repo/flatfs/shard/v1/next-to-last/2",
            "sync": true,
            "type": "flatfs"
          },
          "mountpoint": "/blocks",
          "prefix": "flatfs.datastore",
          "type": "measure"
        },
        {
          "child": {
            "compression": "none",
            "path": "datastore",
            "type": "levelds"
          },
          "mountpoint": "/",
          "prefix": "leveldb.datastore",
          "type": "measure"
        }
      ],
      "type": "mount"
    },
    "StorageGCWatermark": 90,
    "StorageMax": "10GB"
  },
  "Discovery": {
    "MDNS": {
      "Enabled": false,
      "Interval": 10
    }
  },
  "Experimental": {
    "AcceleratedDHTClient": false,
    "FilestoreEnabled": false,
    "GraphsyncEnabled": false,
    "Libp2pStreamMounting": false,
    "P2pHttpProxy": false,
    "StrategicProviding": false,
    "UrlstoreEnabled": false
  },
  "Gateway": {
    "APICommands": [],
    "HTTPHeaders": {
      "Access-Control-Allow-Headers": [
        "X-Requested-With",
        "Range",
        "User-Agent"
      ],
      "Access-Control-Allow-Methods": [
        "GET"
      ],
      "Access-Control-Allow-Origin": [
        "*"
      ]
    },
    "NoDNSLink": false,
    "NoFetch": false,
    "PathPrefixes": [],
    "PublicGateways": null,
    "RootRedirect": "",
    "Writable": false
  },
  "Identity": {
    "PeerID": "12D3KooWBkcbgU2GLxVPnYmg9esWnXYwJzj8FC5TsDJQKJibPNZL"
  },
  "Internal": {},
  "Ipns": {
    "RecordLifetime": "",
    "RepublishPeriod": "",
    "ResolveCacheSize": 128
  },
  "Migration": {
    "DownloadSources": [],
    "Keep": ""
  },
  "Mounts": {
    "FuseAllowOther": false,
    "IPFS": "/ipfs",
    "IPNS": "/ipns"
  },
  "Peering": {
    "Peers": null
  },
  "Pinning": {
    "RemoteServices": {}
  },
  "Plugins": {
    "Plugins": {
      "metric-export-plugin": {
        "Config": {
          "AgentVersionCutOff": 20,
          "HTTPServerConfig": {
            "ListenAddresses": [
              "10.0.1.2:8432",
              "127.0.0.1:8432"
            ]
          },
          "PopulatePrometheusInterval": 10,
          "TCPServerConfig": {
            "ListenAddresses": [
              "10.0.1.2:8181",
              "127.0.0.1:8181"
            ]
          }
        }
      }
    }
  },
  "Provider": {
    "Strategy": ""
  },
  "Pubsub": {
    "DisableSigning": false,
    "Router": ""
  },
  "Reprovider": {
    "Interval": "12h",
    "Strategy": "all"
  },
  "Routing": {
    "Type": "dht"
  },
  "Swarm": {
    "AddrFilters": [
      "/ip4/10.0.0.0/ipcidr/8",
      "/ip4/100.64.0.0/ipcidr/10",
      "/ip4/169.254.0.0/ipcidr/16",
      "/ip4/172.16.0.0/ipcidr/12",
      "/ip4/192.0.0.0/ipcidr/24",
      "/ip4/192.0.2.0/ipcidr/24",
      "/ip4/192.168.0.0/ipcidr/16",
      "/ip4/198.18.0.0/ipcidr/15",
      "/ip4/198.51.100.0/ipcidr/24",
      "/ip4/203.0.113.0/ipcidr/24",
      "/ip4/240.0.0.0/ipcidr/4",
      "/ip6/100::/ipcidr/64",
      "/ip6/2001:2::/ipcidr/48",
      "/ip6/2001:db8::/ipcidr/32",
      "/ip6/fc00::/ipcidr/7",
      "/ip6/fe80::/ipcidr/10"
    ],
    "ConnMgr": {
      "GracePeriod": "0s",
      "HighWater": 100000,
      "LowWater": 0,
      "Type": "basic"
    },
    "DisableBandwidthMetrics": false,
    "DisableNatPortMap": true,
    "EnableHolePunching": false,
    "RelayClient": {},
    "RelayService": {
      "Enabled": false
    },
    "ResourceMgr": {
      "Enabled": false
    },
    "Transports": {
      "Multiplexers": {},
      "Network": {},
      "Security": {}
    }
  }
}

Description

Attaching prometheus metrics for ipfs_p2p_peers_total over some time:
image

I have the ConnMgr configured as described in #9041, which generally seems to work.
In almost all the cases, when QUIC connections are culled, they are replaced with connections from other transports.
(The IPv6 case above had both culled, but that might have been networking-related.)

Also, as can be seen, this happens independently for IPv4 and IPv6.

In many (maybe all) of the cases, I see some other things at the same time:

  • Prometheus (running on the same host, so probably not network-related) fails to get a few scrapes:
    image

  • The memory usage (go_memstats_alloc_bytes) of the node spikes, but the VM has 16 GiB of memory, and there's not much else running, so I don't think that's the root cause, maybe more of a symptom

  • CPU usage spikes (and this might actually hit the limit on the VM, which as 4 cores)

  • A large GC pause occurs (this is irate(go_gc_duration_seconds_sum[1m]))

  • Number of FDs jumps, which indicates that new TCP connections are opened (since QUIC probably uses one FD for all connections, and TCP uses one per connection, iirc)
    image

I know our usage is not exactly standard, but it would still be nice to figure out what's going on here.
We're running with IPFS_FD_MAX=100000 LIBP2P_SWARM_FD_LIMIT=100000 and warn logging.
I think I didn't see anything suspicious in the logs at that level, but we have quite a lot of logging go on, so that's not 100% certain.

Let me know how I can further help diagnose this issue!

@mrd0ll4r mrd0ll4r added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels Jun 26, 2022
@Jorropo
Copy link
Contributor

Jorropo commented Jun 26, 2022

There is no transport prioritization logic in libp2p, it just fire connections concurrently and the first one to work is used.

We don't expect any particular pattern here, so it doing whatever it do is expected.

@mrd0ll4r
Copy link
Contributor Author

That's okay, and I don't really care whether I hold QUIC or TCP connections. My problem is rather about why the node decides to close a large number of connections sometimes, and what I could do to avoid this.

@BigLep
Copy link
Contributor

BigLep commented Jul 1, 2022

We're going to prioritize looking at #9041 first, and then see where that leaves this.

@BigLep BigLep added status/blocked Unable to be worked further until needs are met and removed need/triage Needs initial labeling and prioritization labels Jul 1, 2022
@guseggert guseggert added the P1 High: Likely tackled by core team if no one steps up label Sep 22, 2022
@mrd0ll4r
Copy link
Contributor Author

mrd0ll4r commented Jan 9, 2023

Well, we moved to a different provider and gave it all some more CPU, which seems to have fixed this. It's unclear if it was the networking at our old provider, or if it was CPU time...
We're still running v0.16, so I can't guarantee this is not an issue anymore. But it's fixed for us, for now.

EDIT: scratch that, we're running v0.17, but with:

# This disables mDNS discovery, which is generally useful in datacenters.
ipfs config profile apply server

# Disable the resource and connection managers, we want all the connections.
ipfs config --bool 'Swarm.ResourceMgr.Enabled' false

# Configure the ConnMgr to still manage connections, but without an upper limit.
ipfs config --json 'Swarm.ConnMgr' '{
  "GracePeriod": "0s",
  "HighWater": 100000,
  "LowWater": 0,
  "Type": "basic"
}'

EDIT2: scratch that again, sorry... we were running v0.16, but with above settings.

@Jorropo
Copy link
Contributor

Jorropo commented Jan 19, 2023

@mrd0ll4r I've somehow missed:

    "ConnMgr": {
      "GracePeriod": "0s",
      "HighWater": 100000,
      "LowWater": 0,
      "Type": "basic"
    },

This tell the connection manager to instantly close all connections when you are above 100K.
What you seems to be trying to do is:

    "ConnMgr": {
      "Type": "none"
    },

@mrd0ll4r
Copy link
Contributor Author

At the time this came up that didn't work and I don't know if it does now. I think I tried a few configurations with the connection manager, and this one behave the most like I wanted, as in, take all the connections it can get.

So far, I haven't reached 100k yet, but I'll let you know what happens if I do :)

@mrd0ll4r
Copy link
Contributor Author

mrd0ll4r commented Jan 22, 2023

Another update: The connection drops are back, on new hardware with new networking, so I don't think it's related to that anymore.

You can have a look at one of them here: https://grafana.monitoring.ipfs.trudi.group/?orgId=1&from=1674369895849&to=1674377132096

I have access to the nodes, so I could get pprof info and whatnot, but I don't know when these things happen, so... not really.
FYI: We're running v0.17, with this exact docker-compose setup.
The nodes are configured using this script.
Our plugin does not mess with connections. It opens long-lived Bitswap streams (sorry Max), and exposes some functionality to sample the peer store, but that's all read-only, infrequent, and non-blocking.

EDIT: More info: They reappeared when we upgraded from 0.16 to 0.17

@guseggert
Copy link
Contributor

This seems like a libp2p issue, so looping in @marten-seemann @MarcoPolo

@mrd0ll4r
Copy link
Contributor Author

Thanks for not giving up on this :)

Update: I upgraded to v0.18, which made the entire situation much worse. I then changed our plugin to not keep long-living Bitswap streams to everyone, which has reduced CPU time considerably. I suppose I'm hitting (other peers') resource manager limits, which may be why this has become more expensive recently. Interestingly however: while I was running 0.17, CPU usage was fine.

Anyway, now we're really just passively sitting there and listening. Connection drops have become less frequent.

So, my current theory is that it's somehow CPU related, not sure exactly how though. In general, the machine running this is never overloaded, looking at CPU time at least...

Dashboard to look at stuff: https://grafana.monitoring.ipfs.trudi.group/?orgId=1&from=now-30d&to=now&refresh=5m
( Scroll down to the Runtime & System Metrics section, which also shows which version of kubo we were running)

@marten-seemann
Copy link
Member

Not sure I understand what the issue is here. Is it connections being killed by your node, or by remote nodes. If they're being killed by your node, the answer probably lies in the connection manager, as it's the only place in libp2p where we kill connections unless explicitly asked to by the application.

If it's by the remote node, this is hard to debug, as we haven't gotten around to implementing libp2p/specs#479 yet.

@mrd0ll4r
Copy link
Contributor Author

mrd0ll4r commented Feb 1, 2023

As it's a large number of connections at the same time, I'd be very surprised if it were caused by the (unsynchronized) rest of the network, so I'm guessing it's my node. Interestingly, however, is that it happens on both of our nodes at the same time. They run on the same host, which leads me to think it's either load or networking related, on that node.
I used to run with ConnectionManager: none, but that stopped working properly at some point (see linked issue). So now I'm running with:

ipfs config --json 'Swarm.ConnMgr' '{
  "GracePeriod": "0s",
  "HighWater": 100000,
  "LowWater": 0,
  "Type": "basic"
}'

With the idea being that the network is some 40 to 50k peers at the moment, so 100k should never be hit. And indeed, I never hit 100k, but connections are still culled.

On the other hand, my usage is not exactly normal, so... yeah. I don't want to take up your resources with this. It's weird and messes with our size estimates, but it's not terrible or anything.

@lidel
Copy link
Member

lidel commented Feb 13, 2023

@mrd0ll4r a lot changed since 0.18.0, please upgrade to 0.18.1 and try (in order):

  1. Try ConnMgr.Type: "none" with 0.18.1, and if it does not work, fill a new bug
  2. Clean up your connection limits: https://github.com/ipfs/kubo/blob/master/docs/changelogs/v0.18.md#improving-libp2p-resource-management-integration and then adjust whatever you feel is necessary. If things don't work with the new managers from 0.18.1, fill a separate issue.

This will help us with keeping debugging/discussion on track and scoped to specific bug.

@lidel lidel closed this as completed Feb 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) P1 High: Likely tackled by core team if no one steps up status/blocked Unable to be worked further until needs are met
Projects
None yet
Development

No branches or pull requests

6 participants