How linux cgroup trigger kernel swap¶
Version History
Date | Description |
---|---|
Dec 3, 2018 | Initial version |
This is a note on how linux cgroup-mm
triggers swap on the user-defined limit_in_bytes
.
This note assumes you have adequate knowledge on linux mm code.
For more information about cgroup, please check the document from RedHat.
This is NOT a complete walk through.
There are several cgroup callbacks at mm/memory.c
. Those functions are called to check if cgroup can honor this page allocation.
All of these functions are located in mm/memcontrol.c
mem_cgroup_try_charge()
mem_cgroup_commit_charge()
mem_cgroup_cancel_charge()
Some facts about the implementation (up to linux 5.2)
- Each memory cgroup has its own LRU list vector
- All memory cgroup’s LRU lists and even the global LRU lists, they share a global LRU lock on a per-node basis. (Weird! Why?).
Take a closer look of mem_cgroup_try_charge()
, whose behavior is actually
quite similar to the case of a real OOM: check if we still available
memory (here means memory usage is smaller than limit_in_bytes
),
if unfortunately we run out of memory, it will then try to reclaim
form the memory cgroup’s LRU lists. If that did not work either,
final step would be do OOM actions.
-
mem_cgroup_try_charge()
try_charge()
- page_counter_try_charge():
- Check if we hit
limit_in_bytes
counter. - Hierarchically charge pages, costly.
- Check if we hit
- try_to_free_mem_cgroup_pages()
- Callback to
mm/vmscan.c
to shrink the list (Bingo!)
- Callback to
- Also, reclaimer will establish swap pte entries
- mem_cgroup_oom()
- page_counter_try_charge():
-
mem_cgroup_lruvec()
- Other than the global zone-wide LRU lists vector, each cgroup has its own LRU lists vector.
Choose the vector that will be passed down to do
shrink_page_list()
etc.
- Other than the global zone-wide LRU lists vector, each cgroup has its own LRU lists vector.
Choose the vector that will be passed down to do
LRU Lists Maintainence¶
Insertion to LRU lists is performed as follows: first, it will be inserted into a
per-cpu array (lru_add_pvec
). Once the array is full (default 15 entries),
it will do a batch insertion into proper LRU lists (depends on mem_cgroup_lruvec
we mentioned above).
Why Linux is doing this way? To scale.