path | title |
---|---|
/learnings/ops_linux_memory |
Learnings: Ops: Linux Memory |
- Ops: Unix Virtual Memory >
- Total Memory Calculations
- Knowing what your total memory actually is (Linux)
- High vs Low Memory (mostly applicable to 32 bit)
- ... and Physical Memory
- Page Sizes
- Slab Allocator
- Large Allocations via
vmalloc
- kswapd >
- Swap Management
- Operator tools
- Questions to answer
- Operational case studies
- Tl;DR (reasonably sure this is accurate....)
- Vocab
- And Java >
- See Also
From RHL7 Install Guide
Swap file systems support virtual memory; data is written to a swap file system when there is not enough RAM to store the data your system is processing. Swap size is a function of system memory workload, not total system memory and therefore is not equal to the total system memory size.
Physical Memory + size in swap partition = Total Memory
RHEL7 Installation Guide
Amount of RAM in the system | Recommended swap space | Recommended swap space if allowing for hibernation |
---|---|---|
less than 2 GB | 2 times the amount of RAM | 3 times the amount of RAM |
2 GB - 8 GB | Equal to the amount of RAM | 2 times the amount of RAM |
8 GB - 64 GB | 4GB to 0.5 times the amount of RAM | 1.5 times the amount of RAM |
more than 64 GB | workload dependent (at least 4GB) | hibernation not recommended |
Amount of RAM in the system | Max memory size based on recommended |
---|---|
less than 2 GB | up to 6GB memory |
2 GB - 8 GB | 4-16GB memory |
8 GB - 64 GB | 12-96GB memory |
more than 64 GB | workload dependent (at least 4GB) |
- Learning_AWS_EC2_Swap
$ free -g
total used free shared buff/cache available
Mem: 31 9 17 1 4 19
Swap: 0 0 0
or
$ cat /proc/meminfo
MemTotal: 32518844 kB
MemFree: 18227276 kB
MemAvailable: 20657056 kB
Buffers: 4136 kB
Cached: 3988240 kB
SwapCached: 0 kB
Active: 11476648 kB
Inactive: 1902564 kB
Active(anon): 10484292 kB
Inactive(anon): 516772 kB
Active(file): 992356 kB
Inactive(file): 1385792 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 136 kB
Writeback: 0 kB
AnonPages: 9386884 kB
Mapped: 111412 kB
Shmem: 1614228 kB
Slab: 544544 kB
SReclaimable: 450428 kB
SUnreclaim: 94116 kB
KernelStack: 15584 kB
PageTables: 31756 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 16259420 kB
Committed_AS: 13372220 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 67916 kB
VmallocChunk: 34359640064 kB
HardwareCorrupted: 0 kB
AnonHugePages: 8396800 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 98304 kB
DirectMap2M: 4096000 kB
DirectMap1G: 29360128 kB
HugePages: a RedHat extension that allocates 2MB pages for anon use instead of 4K ones
In general, memory is split into zones:
ZONE_DMA
<-- low memory (x86: first 16MB)ZONE_NORMAL
<-- normal memory (x86: 16MB -> 896MB)ZONE_HIGHMEM
<-- high memory
On 32bit systems, When low memory is low, oom-killer will start killing things regardless of highmem.
Why? Because struct page
has a cost of about 11MB memory / 1GB memory described.
Eventually - around 16GB - this will fill up ZONE_NORMAL
and thus trigger oom.
Physical Address Extension (PAE) <-- from Intel. Gives extra bits for addressing, thus can get to 64GB memory.
_alloc_pages
is called for specific zone, and checks zone. If zone not suitable allocator may fall back to other zones.
If number of free pages reaches pages_low
it will wake up kswapd
to being freeing up pages from zones.
If memory super tight, caller will do work of kswapd
itself.
Binary Buddies: when block of required size is found, will be split and moved into two buddies. (Gorman Understanding Linux Virtual Memory, page 106) (Also Knuth documented / described this???!!)
Allocated pages then put in TLB
When single page is freed, Linux checks buddy memory - if also has The Bit flipped, then both are free, and can be combined and moved back up heap. (Or swapped by kswapd???)
External fragmentation (avail memory only exists in small blocks) - usually not an issue because large requests for contiguous blocks are rare and usually done through vmalloc
and friends.
Internal fragmentation (large block had to be assigned to service small request). This is:
- serious failing of binary buddy system
- frags expected to be 28% but could be in region of 60%
Slab allocator with buddy allocation - carves up pages into small blocks for memory
- https://unix.stackexchange.com/questions/128213/how-is-page-size-determined-in-virtual-address-space
Pages read from a file or block device are generally added to the page cache to avoid further disk I/O Types of pages that exist in the cache:
- One is pages that were faulted in as a result of reading a memory mapped file.
- Blocks read from a block device or filesystem are packed into special pages called buffer pages. The number of blocks that may fit depends on the size of the block and the page size of the architecture.
- Anonymous pages exist in a special aspect of the page cache called the swap cache when slots are allocated in the backing storage for page-out, which is discussed further in Chapter 11.
- Pages belonging to shared memory regions are treated in a similar fashion to anonymous pages. The only difference is that shared pages are added to the swap cache and space reserved in backing storage immediately after the first write to the page.
- Gorman, LinuxVirtualMemory, Chapter 10.1, 10.2
Debugging / getting stats about Linux Page Cache Hit Ratio <<Learning_Ops_Unix_File_Cache_Debugging>>
See this blog on measuring page cache hit ratio.
- Gorman Book (Linux Virtual Memory, Chap 10.2)
- DB2 Memory and file cache performance tuning on Linux
- 2017: Future of Page Cache
- Page/slab cache control in a virtualized environment (Singh) - IBM paper
- RedHat Solutions: how to control size of page cache in REHL (paid content)
- The strange details of st::string at Facebook <-- yes, the page cache is involved!
Consists of variable number of caches that are linked together. A Slab allocator cache is manager for objects of a particular type. Each cache maintains block of continuous memory (slabs
) curved into small chunks.
vmalloc
provides a mechanism where noncontiguous physical memory can be used as contiguous.
limitation on how much memory can be allocated this way: VMALLOC_START
- VMALLOC_END
It finds process space large enough to allocate request, then updates the required PGD/PMD/PTE entries.
Gorman, section 10.7
kswapd: responsible for reclaiming/freeing pages when memory is low. Checks need_balance
in zone_t
to see if can (not) sleep
kswapd for every memory node in system.
Called only when physician page allocator needs it.
Calls (eventually): try_to_sawp_out:
- ensures this page can be swapped
- remove page from PTE
- Flushes TLB
- adds pointer in PTE how to get from swap
Two reasons swap is desirable:
- Expands amount of memory a process may use
- Pages may be referenced only during initialization and never used again
Parameters vm.swappiness
can be turned down to 0 to force VM to not happen.
On Ubuntu:
$ sysctl vm.swappiness=10
See also:
- [TODO]: distill https://linuxaria.com/howto/linux-memory-management
Q: When are inactive pages marked for swap?
When you try to allocate memory and the allocation fails
A: responsible function: refil_ inactive: moves pages from LRU list back to active if recently referenced.
And if not, defects page for active lost and adds to the inactive (LRU??) list
Why: Because if the kswapd sends unused pages to disk, but the JVM comes and has to reload those pages into memory, just to garbage collect those objects, then that would be Bad.
Q2: Can avoid / reduce this happening??? (generational garbage collection here????)
- [https://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases](LinkedIn turns off zone reclaim and gets better performance for high memory cache systems)< — see also Gorman Chapter 10
Linux total memory = physical memory + swap. memory is separated into zones, you’lol probably only use low and high (other is for kernel). may be able to create own.
Memory for a process is just lookup tables into this pool (PGDs??). Linux uses TLB is translate addresses from one to other. (Memory for a process includes several chunks for data segment, bss, etc. see Stevens on this.)
On memory allocation it allocates Binary Buddies of what you asked for. if not enough physical memory for this will activate kswapd.
Kswapd is per zone and examines Least Recently Used list to see if anything can be swapped to disk, thus giving enough space to allocate memory. Updates PTE for that memory spot t record offset on risk.
TLB access bumps reference count or avoids memory from hitting LRU(???).
Memory fragmentation most often happens when you have to use too bit of a page for your data: usually doesn’t happen other way (big huge allocation where you can’t fit because all the smalls running around).
Large memory on 32 bit Systems is Intel magic, but also could force out of memory situations(when really not) because the data constraints in that lookup table will overflow past 16GB.
- PGE - Page global enable (CPU level attribute)
- PMD - page middle directory
- PGD - page global directory
- PTE - page table entries
- PFN - Page Frame Number (index within physical memory that is counted in page sized units)
- Page Table Layout - See 54 in pdf (34 printed) of Gorman, Understanding Linux Virtual Memory
- TLB - translation lookaside buffer
- LRU - least recently used
Performance Characteristics of Linux for Java Workloads Oversubscribing Memory ( Nakaike, Ueda, Ueda, Ohara) (Goodreader -> Ops) posits that:
- for some applications, Linuxs preference for swapping out file cache first hurts (see Kafka)
- Because of infrequent use of long lived objects in the tenure generation, swapping out of these objects first may be less performance impacting
- https://www.infoworld.com/article/2617623/linux/making-sense-of-memory-usage-on-linux.html
- http://www.redaht.com/archives/redhat-list/2007-August/msg00060.html <-- memory implementation of 32bit linux with lots of RAM
- https://lwn.net/Articles/317814/ <-- thoughts etc on the Out Of Memory killer