QUIC connections are culled... sometimes #9061

mrd0ll4r · 2022-06-26T10:44:00Z

Checklist

This is a bug report, not a question. Ask questions on discuss.ipfs.io.
I have searched on the issue tracker for my bug.
I am running the latest go-ipfs version or have an issue updating.

Installation method

built from source

Version

go-ipfs version: 0.13.0
Repo version: 12
System version: amd64/linux
Golang version: go1.18

Compiled the v13.0 tag with go1.18.3 on x86_64 Linux

Config

{
  "API": {
    "HTTPHeaders": {}
  },
  "Addresses": {
    "API": [
      "/ip4/127.0.0.1/tcp/5001",
      "/ip4/10.0.1.2/tcp/5001"
    ],
    "Announce": [],
    "AppendAnnounce": [],
    "Gateway": "/ip4/127.0.0.1/tcp/8080",
    "NoAnnounce": [
      "/ip4/10.0.0.0/ipcidr/8",
      "/ip4/100.64.0.0/ipcidr/10",
      "/ip4/169.254.0.0/ipcidr/16",
      "/ip4/172.16.0.0/ipcidr/12",
      "/ip4/192.0.0.0/ipcidr/24",
      "/ip4/192.0.2.0/ipcidr/24",
      "/ip4/192.168.0.0/ipcidr/16",
      "/ip4/198.18.0.0/ipcidr/15",
      "/ip4/198.51.100.0/ipcidr/24",
      "/ip4/203.0.113.0/ipcidr/24",
      "/ip4/240.0.0.0/ipcidr/4",
      "/ip6/100::/ipcidr/64",
      "/ip6/2001:2::/ipcidr/48",
      "/ip6/2001:db8::/ipcidr/32",
      "/ip6/fc00::/ipcidr/7",
      "/ip6/fe80::/ipcidr/10"
    ],
    "Swarm": [
      "/ip4/0.0.0.0/tcp/4001",
      "/ip6/::/tcp/4001",
      "/ip4/0.0.0.0/udp/4001/quic",
      "/ip6/::/udp/4001/quic"
    ]
  },
  "AutoNAT": {},
  "Bootstrap": [
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmQCU2EcMqAqQPR2i9bChDtGNJchTbq5TbXJJ16u19uLTa",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmcZf59bWwK5XFi76CZX8cbJ4BhTzzA3gU1ZjYZcYW3dwt",
    "/ip4/104.131.131.82/tcp/4001/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
    "/ip4/104.131.131.82/udp/4001/quic/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ"
  ],
  "DNS": {
    "Resolvers": {}
  },
  "Datastore": {
    "BloomFilterSize": 0,
    "GCPeriod": "1h",
    "HashOnRead": false,
    "Spec": {
      "mounts": [
        {
          "child": {
            "path": "blocks",
            "shardFunc": "/repo/flatfs/shard/v1/next-to-last/2",
            "sync": true,
            "type": "flatfs"
          },
          "mountpoint": "/blocks",
          "prefix": "flatfs.datastore",
          "type": "measure"
        },
        {
          "child": {
            "compression": "none",
            "path": "datastore",
            "type": "levelds"
          },
          "mountpoint": "/",
          "prefix": "leveldb.datastore",
          "type": "measure"
        }
      ],
      "type": "mount"
    },
    "StorageGCWatermark": 90,
    "StorageMax": "10GB"
  },
  "Discovery": {
    "MDNS": {
      "Enabled": false,
      "Interval": 10
    }
  },
  "Experimental": {
    "AcceleratedDHTClient": false,
    "FilestoreEnabled": false,
    "GraphsyncEnabled": false,
    "Libp2pStreamMounting": false,
    "P2pHttpProxy": false,
    "StrategicProviding": false,
    "UrlstoreEnabled": false
  },
  "Gateway": {
    "APICommands": [],
    "HTTPHeaders": {
      "Access-Control-Allow-Headers": [
        "X-Requested-With",
        "Range",
        "User-Agent"
      ],
      "Access-Control-Allow-Methods": [
        "GET"
      ],
      "Access-Control-Allow-Origin": [
        "*"
      ]
    },
    "NoDNSLink": false,
    "NoFetch": false,
    "PathPrefixes": [],
    "PublicGateways": null,
    "RootRedirect": "",
    "Writable": false
  },
  "Identity": {
    "PeerID": "12D3KooWBkcbgU2GLxVPnYmg9esWnXYwJzj8FC5TsDJQKJibPNZL"
  },
  "Internal": {},
  "Ipns": {
    "RecordLifetime": "",
    "RepublishPeriod": "",
    "ResolveCacheSize": 128
  },
  "Migration": {
    "DownloadSources": [],
    "Keep": ""
  },
  "Mounts": {
    "FuseAllowOther": false,
    "IPFS": "/ipfs",
    "IPNS": "/ipns"
  },
  "Peering": {
    "Peers": null
  },
  "Pinning": {
    "RemoteServices": {}
  },
  "Plugins": {
    "Plugins": {
      "metric-export-plugin": {
        "Config": {
          "AgentVersionCutOff": 20,
          "HTTPServerConfig": {
            "ListenAddresses": [
              "10.0.1.2:8432",
              "127.0.0.1:8432"
            ]
          },
          "PopulatePrometheusInterval": 10,
          "TCPServerConfig": {
            "ListenAddresses": [
              "10.0.1.2:8181",
              "127.0.0.1:8181"
            ]
          }
        }
      }
    }
  },
  "Provider": {
    "Strategy": ""
  },
  "Pubsub": {
    "DisableSigning": false,
    "Router": ""
  },
  "Reprovider": {
    "Interval": "12h",
    "Strategy": "all"
  },
  "Routing": {
    "Type": "dht"
  },
  "Swarm": {
    "AddrFilters": [
      "/ip4/10.0.0.0/ipcidr/8",
      "/ip4/100.64.0.0/ipcidr/10",
      "/ip4/169.254.0.0/ipcidr/16",
      "/ip4/172.16.0.0/ipcidr/12",
      "/ip4/192.0.0.0/ipcidr/24",
      "/ip4/192.0.2.0/ipcidr/24",
      "/ip4/192.168.0.0/ipcidr/16",
      "/ip4/198.18.0.0/ipcidr/15",
      "/ip4/198.51.100.0/ipcidr/24",
      "/ip4/203.0.113.0/ipcidr/24",
      "/ip4/240.0.0.0/ipcidr/4",
      "/ip6/100::/ipcidr/64",
      "/ip6/2001:2::/ipcidr/48",
      "/ip6/2001:db8::/ipcidr/32",
      "/ip6/fc00::/ipcidr/7",
      "/ip6/fe80::/ipcidr/10"
    ],
    "ConnMgr": {
      "GracePeriod": "0s",
      "HighWater": 100000,
      "LowWater": 0,
      "Type": "basic"
    },
    "DisableBandwidthMetrics": false,
    "DisableNatPortMap": true,
    "EnableHolePunching": false,
    "RelayClient": {},
    "RelayService": {
      "Enabled": false
    },
    "ResourceMgr": {
      "Enabled": false
    },
    "Transports": {
      "Multiplexers": {},
      "Network": {},
      "Security": {}
    }
  }
}

Description

Attaching prometheus metrics for ipfs_p2p_peers_total over some time:

I have the ConnMgr configured as described in #9041, which generally seems to work.
In almost all the cases, when QUIC connections are culled, they are replaced with connections from other transports.
(The IPv6 case above had both culled, but that might have been networking-related.)

Also, as can be seen, this happens independently for IPv4 and IPv6.

In many (maybe all) of the cases, I see some other things at the same time:

Prometheus (running on the same host, so probably not network-related) fails to get a few scrapes:
The memory usage (go_memstats_alloc_bytes) of the node spikes, but the VM has 16 GiB of memory, and there's not much else running, so I don't think that's the root cause, maybe more of a symptom
CPU usage spikes (and this might actually hit the limit on the VM, which as 4 cores)
A large GC pause occurs (this is irate(go_gc_duration_seconds_sum[1m]))
Number of FDs jumps, which indicates that new TCP connections are opened (since QUIC probably uses one FD for all connections, and TCP uses one per connection, iirc)

I know our usage is not exactly standard, but it would still be nice to figure out what's going on here.
We're running with IPFS_FD_MAX=100000 LIBP2P_SWARM_FD_LIMIT=100000 and warn logging.
I think I didn't see anything suspicious in the logs at that level, but we have quite a lot of logging go on, so that's not 100% certain.

Let me know how I can further help diagnose this issue!

The text was updated successfully, but these errors were encountered:

Jorropo · 2022-06-26T10:51:46Z

There is no transport prioritization logic in libp2p, it just fire connections concurrently and the first one to work is used.

We don't expect any particular pattern here, so it doing whatever it do is expected.

mrd0ll4r · 2022-06-28T13:16:52Z

That's okay, and I don't really care whether I hold QUIC or TCP connections. My problem is rather about why the node decides to close a large number of connections sometimes, and what I could do to avoid this.

BigLep · 2022-07-01T15:47:55Z

We're going to prioritize looking at #9041 first, and then see where that leaves this.

mrd0ll4r · 2023-01-09T13:02:44Z

Well, we moved to a different provider and gave it all some more CPU, which seems to have fixed this. It's unclear if it was the networking at our old provider, or if it was CPU time...
We're still running v0.16, so I can't guarantee this is not an issue anymore. But it's fixed for us, for now.

EDIT: scratch that, we're running v0.17, but with:

# This disables mDNS discovery, which is generally useful in datacenters.
ipfs config profile apply server

# Disable the resource and connection managers, we want all the connections.
ipfs config --bool 'Swarm.ResourceMgr.Enabled' false

# Configure the ConnMgr to still manage connections, but without an upper limit.
ipfs config --json 'Swarm.ConnMgr' '{
  "GracePeriod": "0s",
  "HighWater": 100000,
  "LowWater": 0,
  "Type": "basic"
}'

EDIT2: scratch that again, sorry... we were running v0.16, but with above settings.

Jorropo · 2023-01-19T16:58:34Z

@mrd0ll4r I've somehow missed:

    "ConnMgr": {
      "GracePeriod": "0s",
      "HighWater": 100000,
      "LowWater": 0,
      "Type": "basic"
    },

This tell the connection manager to instantly close all connections when you are above 100K.
What you seems to be trying to do is:

    "ConnMgr": {
      "Type": "none"
    },

mrd0ll4r · 2023-01-19T18:03:51Z

At the time this came up that didn't work and I don't know if it does now. I think I tried a few configurations with the connection manager, and this one behave the most like I wanted, as in, take all the connections it can get.

So far, I haven't reached 100k yet, but I'll let you know what happens if I do :)

mrd0ll4r · 2023-01-22T11:14:45Z

Another update: The connection drops are back, on new hardware with new networking, so I don't think it's related to that anymore.

You can have a look at one of them here: https://grafana.monitoring.ipfs.trudi.group/?orgId=1&from=1674369895849&to=1674377132096

I have access to the nodes, so I could get pprof info and whatnot, but I don't know when these things happen, so... not really.
FYI: We're running v0.17, with this exact docker-compose setup.
The nodes are configured using this script.
Our plugin does not mess with connections. It opens long-lived Bitswap streams (sorry Max), and exposes some functionality to sample the peer store, but that's all read-only, infrequent, and non-blocking.

EDIT: More info: They reappeared when we upgraded from 0.16 to 0.17

guseggert · 2023-01-30T14:46:54Z

This seems like a libp2p issue, so looping in @marten-seemann @MarcoPolo

mrd0ll4r · 2023-01-30T15:01:04Z

Thanks for not giving up on this :)

Update: I upgraded to v0.18, which made the entire situation much worse. I then changed our plugin to not keep long-living Bitswap streams to everyone, which has reduced CPU time considerably. I suppose I'm hitting (other peers') resource manager limits, which may be why this has become more expensive recently. Interestingly however: while I was running 0.17, CPU usage was fine.

Anyway, now we're really just passively sitting there and listening. Connection drops have become less frequent.

So, my current theory is that it's somehow CPU related, not sure exactly how though. In general, the machine running this is never overloaded, looking at CPU time at least...

Dashboard to look at stuff: https://grafana.monitoring.ipfs.trudi.group/?orgId=1&from=now-30d&to=now&refresh=5m
( Scroll down to the Runtime & System Metrics section, which also shows which version of kubo we were running)

marten-seemann · 2023-02-01T01:57:06Z

Not sure I understand what the issue is here. Is it connections being killed by your node, or by remote nodes. If they're being killed by your node, the answer probably lies in the connection manager, as it's the only place in libp2p where we kill connections unless explicitly asked to by the application.

If it's by the remote node, this is hard to debug, as we haven't gotten around to implementing libp2p/specs#479 yet.

mrd0ll4r · 2023-02-01T07:29:24Z

As it's a large number of connections at the same time, I'd be very surprised if it were caused by the (unsynchronized) rest of the network, so I'm guessing it's my node. Interestingly, however, is that it happens on both of our nodes at the same time. They run on the same host, which leads me to think it's either load or networking related, on that node.
I used to run with ConnectionManager: none, but that stopped working properly at some point (see linked issue). So now I'm running with:

ipfs config --json 'Swarm.ConnMgr' '{
  "GracePeriod": "0s",
  "HighWater": 100000,
  "LowWater": 0,
  "Type": "basic"
}'

With the idea being that the network is some 40 to 50k peers at the moment, so 100k should never be hit. And indeed, I never hit 100k, but connections are still culled.

On the other hand, my usage is not exactly normal, so... yeah. I don't want to take up your resources with this. It's weird and messes with our size estimates, but it's not terrible or anything.

lidel · 2023-02-13T14:23:27Z

@mrd0ll4r a lot changed since 0.18.0, please upgrade to 0.18.1 and try (in order):

Try ConnMgr.Type: "none" with 0.18.1, and if it does not work, fill a new bug
Clean up your connection limits: https://github.com/ipfs/kubo/blob/master/docs/changelogs/v0.18.md#improving-libp2p-resource-management-integration and then adjust whatever you feel is necessary. If things don't work with the new managers from 0.18.1, fill a separate issue.

This will help us with keeping debugging/discussion on track and scoped to specific bug.

mrd0ll4r added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels Jun 26, 2022

BigLep added status/blocked Unable to be worked further until needs are met and removed need/triage Needs initial labeling and prioritization labels Jul 1, 2022

guseggert added the P1 High: Likely tackled by core team if no one steps up label Sep 22, 2022

lidel closed this as completed Feb 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QUIC connections are culled... sometimes #9061

QUIC connections are culled... sometimes #9061

mrd0ll4r commented Jun 26, 2022

Jorropo commented Jun 26, 2022

mrd0ll4r commented Jun 28, 2022

BigLep commented Jul 1, 2022

mrd0ll4r commented Jan 9, 2023 •

edited

Loading

Jorropo commented Jan 19, 2023 •

edited

Loading

mrd0ll4r commented Jan 19, 2023

mrd0ll4r commented Jan 22, 2023 •

edited

Loading

guseggert commented Jan 30, 2023

mrd0ll4r commented Jan 30, 2023

marten-seemann commented Feb 1, 2023

mrd0ll4r commented Feb 1, 2023

lidel commented Feb 13, 2023 •

edited

Loading

QUIC connections are culled... sometimes #9061

QUIC connections are culled... sometimes #9061

Comments

mrd0ll4r commented Jun 26, 2022

Checklist

Installation method

Version

Config

Description

Jorropo commented Jun 26, 2022

mrd0ll4r commented Jun 28, 2022

BigLep commented Jul 1, 2022

mrd0ll4r commented Jan 9, 2023 • edited Loading

Jorropo commented Jan 19, 2023 • edited Loading

mrd0ll4r commented Jan 19, 2023

mrd0ll4r commented Jan 22, 2023 • edited Loading

guseggert commented Jan 30, 2023

mrd0ll4r commented Jan 30, 2023

marten-seemann commented Feb 1, 2023

mrd0ll4r commented Feb 1, 2023

lidel commented Feb 13, 2023 • edited Loading

mrd0ll4r commented Jan 9, 2023 •

edited

Loading

Jorropo commented Jan 19, 2023 •

edited

Loading

mrd0ll4r commented Jan 22, 2023 •

edited

Loading

lidel commented Feb 13, 2023 •

edited

Loading