Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak #10461

Closed
3 tasks done
Tracked by #10436
RubenKelevra opened this issue Jul 23, 2024 · 19 comments
Closed
3 tasks done
Tracked by #10436

Memory leak #10461

RubenKelevra opened this issue Jul 23, 2024 · 19 comments
Labels
kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization

Comments

@RubenKelevra
Copy link
Contributor

Checklist

Installation method

built from source

Version

0.29.0

Config

{
  "API": {
    "HTTPHeaders": {}
  },
  "Addresses": {
    "API": "/ip4/127.0.0.1/tcp/5001",
    "Announce": null,
    "AppendAnnounce": null,
    "Gateway": "/ip4/127.0.0.1/tcp/8081",
    "NoAnnounce": [
      "/ip4/10.0.0.0/ipcidr/8",
      "/ip4/100.64.0.0/ipcidr/10",
      "/ip4/169.254.0.0/ipcidr/16",
      "/ip4/172.16.0.0/ipcidr/12",
      "/ip4/192.0.0.0/ipcidr/24",
      "/ip4/192.0.0.0/ipcidr/29",
      "/ip4/192.0.0.8/ipcidr/32",
      "/ip4/192.0.0.170/ipcidr/32",
      "/ip4/192.0.0.171/ipcidr/32",
      "/ip4/192.0.2.0/ipcidr/24",
      "/ip4/192.168.0.0/ipcidr/16",
      "/ip4/198.18.0.0/ipcidr/15",
      "/ip4/198.51.100.0/ipcidr/24",
      "/ip4/203.0.113.0/ipcidr/24",
      "/ip4/240.0.0.0/ipcidr/4",
      "/ip6/100::/ipcidr/64",
      "/ip6/2001:2::/ipcidr/48",
      "/ip6/2001:db8::/ipcidr/32",
      "/ip6/fc00::/ipcidr/7",
      "/ip6/fe80::/ipcidr/10"
    ],
    "Swarm": [
      "/ip4/0.0.0.0/tcp/443",
      "/ip6/::/tcp/443",
      "/ip4/0.0.0.0/udp/443/quic-v1",
      "/ip4/0.0.0.0/udp/443/quic-v1/webtransport",
      "/ip6/::/udp/443/quic-v1",
      "/ip6/::/udp/443/quic-v1/webtransport"
    ]
  },
  "AutoNAT": {},
  "Bootstrap": [
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmQCU2EcMqAqQPR2i9bChDtGNJchTbq5TbXJJ16u19uLTa",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmcZf59bWwK5XFi76CZX8cbJ4BhTzzA3gU1ZjYZcYW3dwt",
    "/ip4/104.131.131.82/tcp/4001/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
    "/ip4/104.131.131.82/udp/4001/quic-v1/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ"
  ],
  "DNS": {
    "Resolvers": null
  },
  "Datastore": {
    "BloomFilterSize": 0,
    "GCPeriod": "48h",
    "HashOnRead": false,
    "Spec": {
      "mounts": [
        {
          "child": {
            "path": "blocks",
            "shardFunc": "/repo/flatfs/shard/v1/next-to-last/2",
            "sync": false,
            "type": "flatfs"
          },
          "mountpoint": "/blocks",
          "prefix": "flatfs.datastore",
          "type": "measure"
        },
        {
          "child": {
            "compression": "none",
            "path": "datastore",
            "type": "levelds"
          },
          "mountpoint": "/",
          "prefix": "leveldb.datastore",
          "type": "measure"
        }
      ],
      "type": "mount"
    },
    "StorageGCWatermark": 90,
    "StorageMax": "500GB"
  },
  "Discovery": {
    "MDNS": {
      "Enabled": false,
      "Interval": 10
    }
  },
  "Experimental": {
    "FilestoreEnabled": true,
    "GraphsyncEnabled": false,
    "Libp2pStreamMounting": false,
    "P2pHttpProxy": false,
    "StrategicProviding": false,
    "UrlstoreEnabled": false
  },
  "Gateway": {
    "APICommands": [],
    "HTTPHeaders": {},
    "NoDNSLink": false,
    "NoFetch": false,
    "PathPrefixes": [],
    "PublicGateways": null,
    "RootRedirect": "",
    "Writable": false
  },
  "Identity": {
    "PeerID": "x"
  },
  "Internal": {
    "Bitswap": {
      "EngineBlockstoreWorkerCount": 32,
      "EngineTaskWorkerCount": 128,
      "MaxOutstandingBytesPerPeer": null,
      "ProviderSearchDelay": null,
      "TaskWorkerCount": 128
    }
  },
  "Ipns": {
    "RecordLifetime": "4h",
    "RepublishPeriod": "1h",
    "ResolveCacheSize": 2048,
    "UsePubsub": true
  },
  "Migration": {
    "DownloadSources": null,
    "Keep": ""
  },
  "Mounts": {
    "FuseAllowOther": false,
    "IPFS": "/ipfs",
    "IPNS": "/ipns"
  },
  "Peering": {
    "Peers": null
  },
  "Pinning": {},
  "Plugins": {
    "Plugins": null
  },
  "Provider": {
    "Strategy": ""
  },
  "Pubsub": {
    "DisableSigning": false,
    "Enabled": true,
    "Router": "gossipsub"
  },
  "Reprovider": {},
  "Routing": {
    "AcceleratedDHTClient": false,
    "Methods": null,
    "Routers": null
  },
  "Swarm": {
    "AddrFilters": [
      "/ip4/10.0.0.0/ipcidr/8",
      "/ip4/100.64.0.0/ipcidr/10",
      "/ip4/169.254.0.0/ipcidr/16",
      "/ip4/172.16.0.0/ipcidr/12",
      "/ip4/192.0.0.0/ipcidr/24",
      "/ip4/192.0.0.0/ipcidr/29",
      "/ip4/192.0.0.8/ipcidr/32",
      "/ip4/192.0.0.170/ipcidr/32",
      "/ip4/192.0.0.171/ipcidr/32",
      "/ip4/192.0.2.0/ipcidr/24",
      "/ip4/192.168.0.0/ipcidr/16",
      "/ip4/198.18.0.0/ipcidr/15",
      "/ip4/198.51.100.0/ipcidr/24",
      "/ip4/203.0.113.0/ipcidr/24",
      "/ip4/240.0.0.0/ipcidr/4",
      "/ip6/100::/ipcidr/64",
      "/ip6/2001:2::/ipcidr/48",
      "/ip6/2001:db8::/ipcidr/32",
      "/ip6/fc00::/ipcidr/7",
      "/ip6/fe80::/ipcidr/10"
    ],
    "ConnMgr": {
      "GracePeriod": "3m0s",
      "HighWater": 600,
      "LowWater": 500,
      "Type": "basic"
    },
    "DisableBandwidthMetrics": true,
    "DisableNatPortMap": true,
    "RelayClient": {},
    "RelayService": {},
    "ResourceMgr": {
      "Limits": {}
    },
    "Transports": {
      "Multiplexers": {},
      "Security": {}
    }
  }
}

Description

ipfs's memory usage increased over the uptime of the server (11 days, 16 hours, 40 minutes) until it reached 69% of my 32 GB memory:

Screenshot_20240723_102728

@RubenKelevra RubenKelevra added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels Jul 23, 2024
@RubenKelevra
Copy link
Contributor Author

Two minutes after restarting the service, ipfs uses 1% of 32 GB memory in my usecase.

@Rashkae2
Copy link

It's really bad. I increased my IPFS VM from 8GB to 12GB Ram, but with AcceleratedDHT on, it can't even make it past 24 hrs.

@RubenKelevra
Copy link
Contributor Author

Thanks for confirming @Rashkae2

@aschmahmann
Copy link
Contributor

@RubenKelevra can you give a pprof dump ipfs diag profile or at least post the heap from the profile? Wondering if this is libp2p/go-libp2p#2841 (which is fixed and will be in the next release which should have an RC this week).

@RubenKelevra
Copy link
Contributor Author

@aschmahmann sure, do I need to censor anything in the dump to protect my private key or the private keys of ipns?

@mercxry
Copy link

mercxry commented Jul 30, 2024

I'm also having the same memory leak on latest version 0.29.0, this is my server memory over the last 15 days

CleanShot 2024-07-30 at 08 46 46@2x

and then after restarting ipfs/kubo
CleanShot 2024-07-30 at 08 48 24@2x

@RubenKelevra
Copy link
Contributor Author

RubenKelevra commented Jul 30, 2024

I also got this warning when I shut ipfs down.

128 provides in 22 minutes is an atrocious rate. Wtf?

This server got 2.5 Gbit/s an NVMe and uses 500–600 connections. The number should be a couple of magnitudes higher.

Jul 27 12:16:06 odin.pacman.store ipfs[608]: Daemon is ready
Jul 29 22:36:32 odin.pacman.store ipfs[608]: 2024/07/29 22:36:32 websocket: failed to close network connection: close tcp 45.83.104.156:39592->83.173.236.97:443: use of closed network connection
Jul 30 10:50:56 odin.pacman.store ipfs[608]: Received interrupt signal, shutting down...
Jul 30 10:50:56 odin.pacman.store ipfs[608]: (Hit ctrl-c again to force-shutdown the daemon.)
Jul 30 10:50:56 odin.pacman.store systemd[1]: Stopping InterPlanetary File System (IPFS) daemon...
Jul 30 10:50:58 odin.pacman.store ipfs[608]: 2024-07-30T10:50:58.454+0200        ERROR        core:constructor        node/provider.go:92
Jul 30 10:50:58 odin.pacman.store ipfs[608]: 🔔🔔🔔 YOU ARE FALLING BEHIND DHT REPROVIDES! 🔔🔔🔔
Jul 30 10:50:58 odin.pacman.store ipfs[608]: ⚠ Your system is struggling to keep up with DHT reprovides!
Jul 30 10:50:58 odin.pacman.store ipfs[608]: This means your content could partially or completely inaccessible on the network.
Jul 30 10:50:58 odin.pacman.store ipfs[608]: We observed that you recently provided 128 keys at an average rate of 22m46.337422021s per key.
Jul 30 10:50:58 odin.pacman.store ipfs[608]: 💾 Your total CID count is ~792130 which would total at 300643h34m22.10549473s reprovide process.
Jul 30 10:50:58 odin.pacman.store ipfs[608]: ⏰ The total provide time needs to stay under your reprovide interval (22h0m0s) to prevent falling behind!
Jul 30 10:50:58 odin.pacman.store ipfs[608]: 💡 Consider enabling the Accelerated DHT to enhance your reprovide throughput. See:
Jul 30 10:50:58 odin.pacman.store ipfs[608]: https://github.com/ipfs/kubo/blob/master/docs/config.md#routingaccelerateddhtclient
Jul 30 10:50:59 odin.pacman.store systemd[1]: ipfs@ipfs.service: Deactivated successfully.
Jul 30 10:50:59 odin.pacman.store systemd[1]: Stopped InterPlanetary File System (IPFS) daemon.
Jul 30 10:50:59 odin.pacman.store systemd[1]: ipfs@ipfs.service: Consumed 1d 54min 42.440s CPU time, 13.6G memory peak.

@RubenKelevra
Copy link
Contributor Author

@aschmahmann sure, do I need to censor anything in the dump to protect my private key or the private keys of ipns?

Hey @aschmahmann, don't bother. I've started ipfs now with a fresh key and no keystore in the server to provide a full dump without any concerns.

But it would be nice to know for the future on how to do this safely, maybe with an howto?

@lidel
Copy link
Member

lidel commented Jul 30, 2024

@RubenKelevra good news, privacy notice exists under ipfs diag profile --help and the dumps dont include your private keys:

Privacy Notice:

  The output file includes:

  - A list of running goroutines.
  - A CPU profile.
  - A heap inuse profile.
  - A heap allocation profile.
  - A mutex profile.
  - A block profile.
  - Your copy of go-ipfs.
  - The output of 'ipfs version --all'.
  
It does not include:

  - Any of your IPFS data or metadata.
  - Your config or private key.
  - Your IP address.
  - The contents of your computer's memory, filesystem, etc.

If you could share profile .zip (here or privately via message to https://discuss.ipfs.tech/u/lidel/), that would be helpful.

FYSA there will be 0.30.0-rc1 next week, which includes some fixes (#10436) which might help, or narrow down the number of existing leaks.

@RubenKelevra
Copy link
Contributor Author

@lidel thanks for the info! Will do ASAP :)

@lidel
Copy link
Member

lidel commented Jul 30, 2024

btw: if you want to improve the provide speed without running accelerated dht client, you may also experiment with https://github.com/ipfs/kubo/blob/master/docs/experimental-features.md#optimistic-provide

@RubenKelevra
Copy link
Contributor Author

RubenKelevra commented Jul 30, 2024

@lidel wrote:

btw: if you want to improve the provide speed without running accelerated dht client, you may also experiment with https://github.com/ipfs/kubo/blob/master/docs/experimental-features.md#optimistic-provid

Thanks, but I think this may be more related to the memory leak issue. 474 sec for a single provide feels a bit too high. ;)

As soon as the issue is gone I'll look into that.

Filelink is out via PM.

@gammazero
Copy link
Contributor

gammazero commented Aug 1, 2024

The pprof data you provided indicated that the memory consumption is primarily due to quic connections. There was at least one quic resource issue that has been fixed in a later version of libp2p then you version of kubo is using. That, and the settings in your config my be responsible for this memory use. In you config you have

"GracePeriod": "3m0s",
"HighWater": 600,
"LowWater": 500,

The GracePeriod is set to 3 minutes which is much longer than the default 20 seconds. These settings could cause a large number of connections resulting in memory consumption.

It would be informative to see if using values closer to the defaults would help significantly. Results may also improve with the next version of kubo using a newer go-libp2p that has fixes that may affect this.

@RubenKelevra
Copy link
Contributor Author

RubenKelevra commented Aug 1, 2024

Hey @gammazero,

I've adjusted the default settings, but it seems like they are more suited for a client application, right?

I'm running a server with 2.5 Gbit/s network card, 10 cores, and 32 GB of memory. Its only task is to seed into the IPFS network. Given this setup, the current configuration feels a bit conservative rather than excessive.

Do you know what settings ipfs.io infrastructure uses for their connection manager?

@gammazero wrote:

That, and the settings in your config my be responsible for this memory use. In you config you have

I don't think that's the issue. I've been using these settings for 3 years without any memory problems until now. It seems unlikely that the settings are the cause, especially since the memory usage increases steadily over 18 days, rather than spiking within an hour.

@gammazero
Copy link
Contributor

the current configuration feels a bit conservative

I was thinking that the 3 minute grace period was the setting that may have the most effect.

using these settings for 3 years without any memory problems until now

OK, that is a hint that it may be a libp2p/quic issue. Lets keep this issue open and see what it looks like when we a kubo RC with new libp2p and quic.

@RubenKelevra
Copy link
Contributor Author

@gammazero the idea behind using 3 minutes was, to not kill useful long term connections due to an influx of single request connections which end up stale afterwards.

Not sure how kubo has improved in the meantime, but I had a lot of "stalls" while downloading from the server in the beginning, if it did other stuff. The switch from 20 seconds to 3 minutes fixed that.

@RubenKelevra
Copy link
Contributor Author

RubenKelevra commented Aug 2, 2024

@gammazero wrote:

The pprof data you provided indicated that the memory consumption is primarily due to quic connections. There was at least one quic resource issue that has been fixed in a later version of libp2p then you version of kubo is using.

@gammazero wrote:

OK, that is a hint that it may be a libp2p/quic issue. Lets keep this issue open and see what it looks like when we a kubo RC with new libp2p and quic.

Just started 749a61b, I guess this should contain the fix, right? I'll report back after a day or two if this issue persists or not.

If it persists I would be happy to run a bisect to find what broke it. :)

@RubenKelevra
Copy link
Contributor Author

749a61b runs for 4 days straight now and still uses just 2% memory. I call this fixed.

Thanks @gammazero @lidel and @aschmahmann!

@lidel
Copy link
Member

lidel commented Aug 8, 2024

Great news, thank you for reporting and testing @RubenKelevra ❤️
This will ship in Kubo 0.30 (#10436)

@lidel lidel mentioned this issue Aug 8, 2024
32 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization
Projects
None yet
Development

No branches or pull requests

6 participants