-
-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bandwhich appears to have a serious memory leak #284
Comments
Caught a picture from bottom. Note the purple line shoots up (that's bandwhich eating RAM) then the yellow line climbs as it eats all the swap, then gets killed by the OOM killer... |
Is this problem happening on Bandwhich itself does not use any unsafe code, so it's likely that the memory leak was caused by a dependency. We had lots of dependency bumps since the last release, so this may have been already solved if indeed that's the version you are seeing problems on. |
I was using whatever was in I found it quite easily reproducible on multiple machines for me. Just leave it running for a while and eventually, it blows up. Might take many minutes or hours, depending on traffic volume, maybe. I built it with rust 1.70.0 if that makes any difference? |
Yes that's true. You don't need to test again =).
Curious. I'm no expert in debugging memory issues, but I'll give it a try.
I doubt it. |
I can't reproduce on my linux machine, can you run it in debug mode with flamegraph https://github.com/flamegraph-rs/flamegraph it can give some hints |
Cannot reproduce on my box ( |
Out of curiosity, are all of your testing done on the same distro? @popey As unlikely as it is, maybe there's something wrong with the allocator or even the kernel? |
I tested this on a box specifically trying to reproduce this and could:
This is on Ubuntu 22.04.3 LTS with v0.21.0 from the releases tab. I started at around 21:15 on Sep 20, so it ran for ~9 hours before being killed. I'll try and get a flamegraph version today. |
same issue here with v0.21.0 on void. |
I can't seem to zoom in the indivual svg elements, anyhow the big view shows mostly read/iterate/openat, so maybe there is leak when reading procfs, maybe it needs a big number of connections |
I tried https://github.com/KDE/heaptrack and it seems very simple and work really well, shows graph and exact functions and which one leaked Note: heaptrak bandhwich didn't work for me, I had to start bandwhich, then use heaptrack -p pid of bandhwich |
It shows 0 leaks for me |
i kept the program running more, I do have some small leaks heaptrack is making me suspect https://github.com/imsnif/bandwhich/blob/main/src/display/ui.rs#L184 seems like we keep updating the state which will keep updating the utilization map https://github.com/imsnif/bandwhich/blob/main/src/network/utilization.rs#L27 without ever removing from it Maybe when there is a big number of connection this causes the leak to show |
ah I see to make the leaks show up you need to attach heaptrack to bandwhich, and then close heptrack not bandhwich Because seems RAII is working correctly and everyhting will get cleaned up at exit, but if we disconnect early it shows us places that accumulate memory in the life time of the program like the state updating above |
We do have this function which is indeed called https://github.com/imsnif/bandwhich/blob/main/src/network/utilization.rs#L24 but its using HashMap::clear which does say in the doc that it clear the entries but keeps the memory allocated, maybe that's what we're looking for |
note that this have a lot of false positive though |
So if I understand correctly, what you are saying @sigmaSd is that this is not so much a leak as it is simply keeping a value long past its usefulness? |
@popey @gms8994 @terminaldweller Can you please reproduce the severe memory usage issue again while tracing using heaptrack, and upload the output file? This will help us verify @sigmaSd's hypothesis. sudo heaptrack bandwhich Thanks in advance. |
Yes But its more like keeping the memory past its usefulness Also maybe someone who reproduces the issue can try changing https://github.com/imsnif/bandwhich/blob/main/src/network/utilization.rs#L24 to |
Though this doesn't explain alone why it would fail to allocate memory |
@cyqsimon sure thing. I've left bandwhich running under heaptrack on my desktop. It has 64GB RAM and this OOM usually happens unexpectedly after some hours. So will report back if/when it happens. |
You don't need to wait for oom, just stop heaptrack after you see an usually memory (like 100mb is probably enough) |
Oh, too late. By the time I saw this reply it had already run for a while.. |
Incorrect diagnosis. See my newer comment.Good good. From the backtrace I was able to track the allocations back to a single line. Line 21 in 8c6be28
It seems like your hypothesis is correct @imsnif, albeit of a different hashmap. I'll take a look at how this stored data is used and how to best implement some sort of cleanup. Edit:Perhaps I shouldn't rule out the possibility that it's caused by file handles not being closed. I'll make sure that isn't the case first. |
I don't think so. bandwhich/src/display/ui_state.rs Lines 116 to 122 in 8c6be28
Any more items than 5 in Re-examing the backtrace more carefully, I'm afraid that my fear in my previous comment was justified: This is the only instance Line 16 in 8c6be28
And if we take a look at Voila. And just to take it one step further, All this is to say that this might be an "upstream's upstream" bug, which is going to be a PITA to deal with. There's one thing we can try immediately though: |
Okay, can all please try the You can either pull the branch and build locally or use the CI build. If this still does not fix this issue, I will go ahead and submit bug reports in |
Actually, speaking of |
Left the MUSL build running for 2.5 hours, and no OOM crash yet. Will leave it overnight. |
It died at 3am.
|
Ok, trying that build didn't work either. I don't tend to sit and watch it, but leave it running and look at the terminal when I remember. I just looked and it's using 70GB after 25 minutes.
|
@sigmaSd What distro are you on? Trying to establish some datapoints for an upstream bug report. |
Hi all, I went hunting for the bug after cyqsimon brought it up in the community discord server. Looks like I have found the issue and it is in rustix's linux_raw backend. As cyqsimon already pointed out the From the heaptrace the allocation is coming from the At https://github.com/bytecodealliance/rustix/blob/main/src/backend/linux_raw/fs/dir.rs#L78 the code in self.buf.resize(self.buf.capacity() + 32 * size_of::<linux_dirent64>(), 0); This causes In normal operation |
I can't reproduce on archlinux, @cyqsimon did you try it on one of the affected distro?, this all would be easier if one could reproduce the issue |
Hi all, quick update. Thanks to the collaborative effort of @sigmaSd, @konnorandrews, and @sunfishcode (rustix maintainer), we were able to diagnose the issue and reproduce the fault condition in rustix directly. They will likely patch it in the next release. Since we currently depend on procfs 0.15.1, which depends on an older, incompatible version of rustix, we will not be able to update our indirect dependency on rustix immediately. I will notify @eminence (procfs maintainer) as soon as the rustix patch is available. |
Nice work everyone! |
- Fixes #284 - See GHSA-c827-hfw6-qwvm
Good news, rustix has backported this fix to all minor versions up to 0.35, which is the first minor version affected. This means we do not block on an update from procfs (I have notified them nonetheless). Since I am pretty confident that we've fixed the exact problem, I will close this issue as completed. Please reopen and/or comment in case you are still seeing it on |
Includes dependency update to fix security issue: imsnif/bandwhich#284 (comment)
Includes dependency update to fix security issue: imsnif/bandwhich#284 (comment)
I built and ran bandwhich on a few machines. After a while of running, it dies. I see this in my logs
While I've seen this on remote, small servers. I have also seen it on my workstation which has 32GB RAM and 22GB swap.
The text was updated successfully, but these errors were encountered: