-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: scavengelist disabled on 64k page size systems (ppc64) #9993
Comments
Linux, right? The runtime currently uses 8K pages, so everything should be a multiple of I don't see how an madvise call can possibly get a non-multiple of 8K. It There are a few places where munmap uses non-8K multiples. It's called You might want to try setting _PageShift to 16 and see if that helps On Tue, Feb 24, 2015 at 9:39 PM, Dave Cheney notifications@github.com
|
Yes, linux/ppc64le. I'll try the page shift hack and also out some asserts in to check
|
The plot thickens. With some asserts in place (and
|
Instrumenting
|
what's wrong with it?
_PageSize is only for heap and is an internal detail |
My understanding is mmap, munmap etc always work on page sized quantities. The result is if you madvise DONT_NEED a 4k section on a 64k system, the We hit exactly this problem using gccgo on 64kb systems last year. Sorry for the brevity (phone)
|
Is the problem only with madvise or with mmap/munmap as well? |
We saw the issue with the scavenger madvising unused 4k pages of the heap, On Thu, Feb 26, 2015 at 3:21 PM, Dmitry Vyukov notifications@github.com
|
I see.
If we saw that it happens in reality, then I guess we need to fix it. |
I believe the problem is here https://github.com/golang/go/blob/master/src/runtime/mheap.go#L731 Attached is the patch we sent to gccgo last year, but reading over it I am unsure if it is safe to use here. I'm not sure it is safe to widen the range of memory freed as there is no signal to the GC that we've freed a larger area than the scavenge free list wanted us to do.
|
as far as I see, you only shrink the region from both ends we will need to ensure that _PageSize <= _PhysPageSize on platforms where SysUnused is not a no-op also I suspect that such change can make scavenger less efficient on platforms with _PhysPageSize > _PageSize, because even consecutive unused pages can end up being not madvised (though, of course, it is better than crashing) |
Update #9993 If the physical page size of the machine is larger than the logical heap size, for example 8k logical, 64k physical, then madvise(2) will round up the requested amount to a 64k boundary and may discard pages close to the page being madvised. This patch disables the scavenger in these situations, which at the moment is only ppc64 and ppc64le systems. NaCl also uses a 64k page size, but it's not clear if it is affected by this problem. Change-Id: Ib897f8d3df5bd915ddc0b510f2fd90a30ef329ca Reviewed-on: https://go-review.googlesource.com/6091 Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Hey all, we're interested in creating a Go Mobile library, but this issue is currently a blocker for us. On arm64 devices this issue means that the Go GC will increasingly own memory within an iOS (and presumably Android) process and never give it back, which can eventually get to the point where the memory available to the rest of the process is too small to run normal operations and OS will kill the app due to memory pressure. More discussion here: https://groups.google.com/forum/#!topic/golang-nuts/-7ehtvFzJj0 Thanks! |
As a first step, we can certainly make scavengelist or sysUnused round the boundaries in to _PhysPageSize. That will at least help with scavenging memory that backed large allocations and, as free spans get coalesced, it will eventually release coalesced spans that are sufficiently large (at least most of the time). I have some ideas for how to make the scavenger work well in environments where the heap's granularity is finer than the system's granularity. Our current heuristic of releasing a free span only once no page in that span has been used for 5 minutes, in addition to being ridiculously conservative, is a good way to turn the heap in to Swiss cheese. To a first order, the age of a particular span is irrelevant (there is some weak temporal locality). What matters is not which spans we release, but how many we release, so we could chose how many using a temporal heuristic, but chose which spans to release using a spacial heuristic. If we bias this toward large ranges, we'll free memory much more effectively on platforms with large physical pages. I'm hoping to get to this early in 1.7. |
CL https://golang.org/cl/22064 mentions this issue. |
CL https://golang.org/cl/22065 mentions this issue. |
CL https://golang.org/cl/22061 mentions this issue. |
CL https://golang.org/cl/22066 mentions this issue. |
@davecheney, would you mind stress testing CL 22066 on a PPC or ARM64? Based on your original post, it sounds like |
Thanks Austin. I've asked @mwhudson to take a look as I'm travelling ATM. On Fri, 15 Apr 2016, 06:17 Austin Clements, notifications@github.com
|
I'll test this. To be clear, it should be a system where "getconf PAGESIZE" says 65536? (The arm64 machine I have access to appears to have 4k pages, but the ppc64 one is 64k) |
Thanks! Yes, in that case you should stress test on the ppc64. Though it would be good to make sure it passes all.bash on the arm64 just to test the case where physical pages are smaller than sys.PhysPageSize. |
GODEBUG=scavenge=1 is quite slow :-) |
The tests didn't pass cleanly with scavenge=1 but I sort of get the feeling that's not the fault of that CL: https://gist.github.com/mwhudson/21badf1d67536a9fc3719d19cc5865b6 I'm running the same test on master now. |
Yeah. :) Unfortunately, that also turns the forced GC rate way up, which is probably where a lot of the slowdown is coming from. I agree that those failures don't look likely to be related to this CL. As a baseline, I tried
Also, with scavenge=1, math/big took 297 seconds, so I'm not too surprised it timed out on arm64 and ppc64. The other failures in the gist you posted all looked networking-related. |
The test run on master had the same failures so yeah. Test passed afaict. |
Thanks! I'll push these in a little bit. |
Currently several different Linux architectures have separate copies of the auxv parser. Bring these all together into a single copy of the parser that calls out to a per-arch handler for each tag/value pair. This is in preparation for handling common auxv tags in one place. For golang#9993. Change-Id: Iceebc3afad6b4133b70fca7003561ae370445c10 Reviewed-on: https://go-review.googlesource.com/22061 Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Michael Hudson-Doyle <michael.hudson@canonical.com>
The runtime hard-codes an assumed physical page size. If this is smaller than the kernel's page size or not a multiple of it, sysUnused may incorrectly release more memory to the system than intended. Add a runtime startup check that the runtime's assumed physical page is compatible with the kernel's physical page size. For golang#9993. Change-Id: Ida9d07f93c00ca9a95dd55fc59bf0d8a607f6728 Reviewed-on: https://go-review.googlesource.com/22064 Reviewed-by: Rick Hudson <rlh@golang.org>
If sysUnused is passed an address or length that is not aligned to the physical page boundary, the kernel will unmap more memory than the caller wanted. Add a check for this. For golang#9993. Change-Id: I68ff03032e7b65cf0a853fe706ce21dc7f2aaaf8 Reviewed-on: https://go-review.googlesource.com/22065 Reviewed-by: Rick Hudson <rlh@golang.org> Reviewed-by: Dave Cheney <dave@cheney.net> Reviewed-by: Michael Hudson-Doyle <michael.hudson@canonical.com>
CL https://golang.org/cl/27403 mentions this issue. |
TestFreeOSMemory was disabled on many arches because of issue #9993. Since that's been fixed, enable the test everywhere. Change-Id: I298c38c3e04128d9c8a1f558980939d5699bea03 Reviewed-on: https://go-review.googlesource.com/27403 Run-TryBot: Austin Clements <austin@google.com> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Reviewed-by: Minux Ma <minux@golang.org>
I am sorry for the imprecise nature of this issue. The symptoms are random SIGSEGV's after processes have been running for several minutes. The most visible effect of this is
go test std
will segfault half way through the test run, making it impossible to complete a CI build.Last year @mwhudson and I saw a similar issue with gccgo-4.9 compiled programs which was eventually traced to the page size of the machine, most ppc64 machines use 64k pages, being out of sync with the page size of that the runtime uses.
In tracing some processes today I have found occurences of
munmap(2)
andmadvise(2)
calls that are not using multiples of 64kMore worrying, I've found examples that don't use a multiple of 4096 bytes!
/cc @randall77
The text was updated successfully, but these errors were encountered: