Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Major memory leak on 1.2.4/1.2.5 #8307

Closed
denisprovost opened this issue Sep 2, 2021 · 31 comments
Closed

[BUG] Major memory leak on 1.2.4/1.2.5 #8307

denisprovost opened this issue Sep 2, 2021 · 31 comments
Assignees
Labels
bug Something isn't working

Comments

@denisprovost
Copy link

Description
After chia synced and start farming, memory increased. 3hours -> +10%

Capture

After a few hours :
Capture3

After to have stop chia process :
Capture4

OS: Ubuntu 21.04 on Pi4 (full node + haverster)
RAM: 8 GB
Chia version: 1.2.4/1.2.5

@denisprovost denisprovost added the bug Something isn't working label Sep 2, 2021
@denisprovost
Copy link
Author

Right now I'm going back to 1.2.3 and going to check if the problem is with my system or chia process. I will see in a few hours.

@denisprovost denisprovost changed the title [BUG] Memory leak? [BUG] Memory leak on 1.24/1.2.5? Sep 2, 2021
@denisprovost denisprovost changed the title [BUG] Memory leak on 1.24/1.2.5? [BUG] Memory leak on 1.2.4/1.2.5? Sep 2, 2021
@ALTracer
Copy link

ALTracer commented Sep 2, 2021

I noticed a memory leak in chia_full_node, too. One of the four process allocated a whopping 5350 MiB as opposed to 600 MiB.

OS: Gentoo 17.1 stable amd64 on custom desktop
RAM: 16 GiB -2 (AMD 3400G APU) +zram
Chia version: 1.2.5

Will post debug-enabled existing logs on request.
munin-memory
glances

@denisprovost
Copy link
Author

ok, happy to see what the problem is not single

@denisprovost denisprovost changed the title [BUG] Memory leak on 1.2.4/1.2.5? [BUG] Major memory leak on 1.2.4/1.2.5? Sep 2, 2021
@avsync
Copy link

avsync commented Sep 2, 2021

I'm seeing this leak too on 1.2.4 and 1.2.5!

@erickoh
Copy link

erickoh commented Sep 2, 2021

I'm having this memory leak issue on 1.1.7 since the past week too. I'm on ubuntu 20.04
Currently I am seeing a chia_full_node process taking up 2.5g of RES memory.
Memory usage continues to creep up and eventually it crashes and becomes a defunct process

I tried upgrading some of my full-nodes to 1.2.5, but still having these memory leaks and eventual crash

@Rigidity
Copy link
Contributor

Rigidity commented Sep 2, 2021

Can confirm this memory leak on Ubuntu on all versions I have used, 1.2.3-1.2.5

@denisprovost
Copy link
Author

Has it been reported before?

@emlowe
Copy link
Contributor

emlowe commented Sep 2, 2021

We are actively researching this

@denisprovost
Copy link
Author

denisprovost commented Sep 2, 2021

I have moved on 1.2.3 for check, same issue

Capture7

@Jacek-ghub
Copy link

@emlowe

We are actively researching this

Uhm, what is the ETA?

@Rigidity
Copy link
Contributor

Rigidity commented Sep 3, 2021

@emlowe

We are actively researching this

Uhm, what is the ETA?

I would imagine an issue like this is a priority, so whenever they can get a hot fix out.

@Jacek-ghub
Copy link

Jacek-ghub commented Sep 3, 2021

I would imagine an issue like this is a priority

I would also like to imagine that this would be the case.

However, if no ETA is provided (for any issue, not just this), than it is just a BS to get people focused on something else. Sorry, but how things/issues are being handled is not how software company runs.

@Rigidity
Copy link
Contributor

Rigidity commented Sep 3, 2021

I would imagine an issue like this is a priority

I would also like to imagine that this would be the case.

However, if no ETA is provided (for any issue, not just this), than it is just a BS to get people focused on something else. Sorry, but how things/issues are being handled is not how software company runs.

You're welcome to help fix the issue yourself, by submitting a pull request to this repo, but other wise you'll likely have to wait until the next patch version. They have said that they found what the problem is and are actively working on fixing it. If that's not good enough, I don't know what is. Why would they not solve a problem with their blockchain just to make people "focused on something else" when that would be detrimental to the network? Have patience...

@denisprovost denisprovost changed the title [BUG] Major memory leak on 1.2.4/1.2.5? [BUG] Major memory leak on 1.2.4/1.2.5 Sep 3, 2021
@denisprovost
Copy link
Author

That this leak memory has existed for a few versions is not reassuring. But yes let's leave time to find a solution and patch :)

@Jacek-ghub
Copy link

Jacek-ghub commented Sep 3, 2021

You're welcome to help fix the issue yourself

Works both ways. Should I say that I am waiting for you to join the pull request efforts and will work with you on that? What is the point of it?

We are both customers of Chia company. We purchased drives, plotters, do everything possible to help the ecosystem. There is no need to point fingers at people to do stuff, when the company is not doing their part.

As @denisprovost stated "That this leak memory has existed for a few versions is not reassuring." Where was the QA that let 1.2.4 go out the door? Where was the QA to rush a broken 1.2.5 release. Why are we treated like alpha testers? So, are you really thinking that 1.2.6 will be "the working one?"

Again, if all that you said is true, then what is the problem to say that "we will have it hopefully by Monday (or whatever makes sense)?" That is how ETA works, that you provide some timeline for people to calm down, schedule their time. I had other issues where the guy was just mudding the water to get by with other people reports.

Again, providing the ETA is not a big deal, it doesn't compromise anything, it just let people better manage their time. Nothing more than that. Otherwise, it is just hurting the ecosystem that we all try to support.

@Jacek-ghub
Copy link

I am on Windows, and don't think that the problem exists on Windows. I run full-node 24/7, and don't see any crashes. Although, I am still on 1.2.3 (saw problems other people had with 1.2.4, then rushed 1.2.5 that was as good as 1.2.4, and decided to wait).

It looks like the problem is more related to your setup (libs) than to OS or Chia version. I guess, it would be more people in this thread, if that issue would affect Ubuntu with a particular version, or some Chia version(s).

Is it possible that all of you run some recent OS/library updates (and got for instance new python libs) that are causing those issues for all Chia releases?

@denisprovost
Copy link
Author

denisprovost commented Sep 3, 2021

It's a fairly classic answer, you can imagine that before posting I checked, made tests etc and I bring results . I sincerely hope that the chia team will not see the problem from this angle. I am not the only one with problems like this. It may be my system, but it may not be.

A stable system does not drift on its own

#3366
#3209
#2055

Annonced fixed but no ;)

@Rigidity
Copy link
Contributor

Rigidity commented Sep 3, 2021

This memory leak has happened to everyone I know who uses Ubuntu.

@djails
Copy link

djails commented Sep 3, 2021

This memory leak has happened to everyone I know who uses Ubuntu.

I'll add my +1, it's happening to me as well.

@Jacek-ghub
Copy link

@denisprovost

you can imagine that before posting I checked

Don't take me wrong, I am on your side. I am not trying to say that you didn't check/test, or dismiss your results. The issue is real, but unless your setup is in a full debug mode, it is really hard to enable and read / relevant / understand the debug logs.

Again, you are the main person pushing this issue, so it was really not my intention to imply that I doubt you, or want to dismiss it. That is basically the main reason I asked for the ETA (they can get QA to immediately run regression test on different platforms, and get the engineer in charge to focus on the most promising one - basically few hours, and they should know the offending part, and be able to provide projected milestones).

One component that is actually not under our control is the UI part that potentially runs updated scripts/libs every other day or so (common practice with JS code, what Electron is using). Although, if that is the case, that potentially would influence other platforms as well - maybe not, as those libs may have localized issues. (Although, I didn't reboot my full node for several weeks, so no network residing scripts got updated for me.)

@Rigidity

This memory leak has happened to everyone I know who uses Ubuntu.

Well, how many people you know, since when they are affected, can they also chime in? The more info you can provide, the easier it is to narrow the scope.

@ALTracer runs Gentoo, so that means it is not strictly Debian/Ubuntu related.

@erickoh is still running v1.1.7, but based on what he wrote, he started having issues just a couple of weeks ago. That would imply that the issue is newer than his Chia version, or rather independent from Chia version. This is potentially the strongest statement pointing to some modified libs (again, either coming through some updates or Electron network scripts).

Although, the issue is already well stated, and the Chia eng. team will have a busy weekend. So, we should lay it off for a while.

@emlowe
Copy link
Contributor

emlowe commented Sep 3, 2021

We are actively testing what we believe to be a fix

@emlowe
Copy link
Contributor

emlowe commented Sep 3, 2021

Until we confirm a fix, this is preliminary information:

We believe this affects all versions on all platforms.
The problem started "recently" because it is related to how the node handles "compact vdfs" that are generated from Bluebox Timelords. We recently started to generate a large number of compact vdfs on mainnet by aggressively deploying Bluebox Timelords in AWS. These compact vdfs get gossiped around the network and nodes take these and replace their non-compacted versions with the new version. The sheer number of such messages was causing this issue.

We are currently duplicating this on testnet7 so we can verify

@emlowe
Copy link
Contributor

emlowe commented Sep 3, 2021

We believe PR #8315 fixes this issue, for those that want (and understand how) to try the patch.
We continue to test on testnet7
I don't have an ETA when a build will be available (outside the PR builds)

@erickoh
Copy link

erickoh commented Sep 4, 2021

Thanks, I have not experienced this memory leak problem at all over the past 24 hours

@emlowe
Copy link
Contributor

emlowe commented Sep 4, 2021

After running in testnet for about 10 hours now we are pretty confident PR #8315 fixes this issue. I don't think we will rush out a release this weekend though. Since we have stopped generating the majority of compact vdfs in mainnet, we believe this problem has been largely stopped in mainnet as well (it may depend somewhat on which peers you connect to and how many compact vdfs are getting passed around - but not many are being newly generated right now)

@denisprovost
Copy link
Author

denisprovost commented Sep 4, 2021

Thanks for your quick reaction. I have downgraded to 1.2.3 like many people, I will wait for a release like many people too. Please release when you are sure the problem is fixed.

@avsync
Copy link

avsync commented Sep 4, 2021

Downgrading to 1.2.3 offers no benefit over 1.2.5. Read a few posts up, the issue was blueboxes flooding all versions of nodes. For me since the blueboxes were shut down 1.2.5 and 1.2.5 with PR 8315 both have similar normal memory usage but hard to say if the issue is resolved as the cause has been removed. Testnet findings seem to wrap it up though.

@denisprovost
Copy link
Author

denisprovost commented Sep 4, 2021

Yes but with 1.2.3, even if the problem is present (as I indicated above), my raspberry supports it better and does not crash after 6 hours of farming. The memory is very busy (65%) but remains stable at this level, which is not the case in 1.2.4 / 1.2.5 which ends at 100% and ends up crashing.

A restart of my system with raspberry takes 40 minutes off-farm (which I find huge, 'SSL context Connect call failed 127.0.0.1' for 15mns). I can't afford to reboot too much, I leave those that have a quick reboot time and recync allowing them to validate the patch.

I stay in 1.2.3 and wait for release when the problem will be really fixed, for the moment it is a 'test' patch which can only be used to confirm that this fixes the problem. If you have noticed that this fixes the problem for you, it is on the right track but for the moment, in my case, the safest is to wait for a real release which fixes the problem, which the majority of people do.

But once again, thank you for the work of the chia team and the speed of reaction!

@denisprovost
Copy link
Author

denisprovost commented Sep 5, 2021

This morning my harvester no longer wanted to stay connected same after restart, I rebooted the system and I took the opportunity for switch to 1.2.5 (again) + PR 8315.

Capture

I confirm that on my side this solves the problem of the major memory leak. 40% of memory used with smart variations since 8h

Capture

@erickoh
Copy link

erickoh commented Sep 6, 2021

Now my fullnode is stable, but my farmer process keeps crashing
Not sure if it is a related problem.
This is on 1.2.5 on ubuntu 20.04

dmesg | grep -i memory
[25056.304479] out_of_memory.part.0+0x1df/0x3d0
[25056.304481] out_of_memory+0x6d/0xd0
[25056.304644] Tasks state (memory values in pages):
[25056.304759] Out of memory: Killed process 1746 (chia_farmer) total-vm:1198528kB, anon-rss:922096kB, file-rss:2568kB, shmem-rss:0kB, UID:0 pgtables:2112kB oom_score_adj:0
[35438.326053] out_of_memory.part.0+0x1df/0x3d0
[35438.326054] out_of_memory+0x6d/0xd0
[35438.326162] Tasks state (memory values in pages):
[35438.326251] Out of memory: Killed process 8621 (chia_farmer) total-vm:1101484kB, anon-rss:852012kB, file-rss:2836kB, shmem-rss:0kB, UID:0 pgtables:1892kB oom_score_adj:0

@denisprovost
Copy link
Author

denisprovost commented Sep 6, 2021

It all depends on whether this full memory is linked to a gradual increase in ram. This happens after several hours or very quickly after a restart of the chia process ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants