Linux内存从0到1学习笔记(九，内存优化调试之一

写在前面

Linux内核中有一个非常重要的内核线程kswapd，它负责在内存不足时回收页面。kswapd内核线程初始化时会为系统中每个内存结点创建一个名为“kswapd%"内核线程。对于UMA架构下常常是kswapd0线程。

因此我们可以把kswapd0看作是系统的虚拟内存管理程序，如果物理内存不够用，系统就会唤醒 kswapd0 进程。但是需要特别注意的是，由于kswapd0分配磁盘交换空间作缓存，因此会占用大量的CPU资源。

术语 swap

指的是一个交换分区或文件。负责在内存不足时，将部分内存上的数据交换到swap空间上，以便让系统不会因内存不够用而导致oom或者更致命的情况出现。所以，当内存使用存在压力，开始触发内存回收的行为时，就可能会使用swap空间。

在Linux上可以使用swapon -s命令查看当前系统上正在使用的交换空间有哪些，以及相关信息：

kswapd

Linux内核中有一个非常重要的内核线程kswapd，它负责在内存不足时回收页面。

watermark_scale_factor
======================

This factor controls the aggressiveness of kswapd. It defines the
amount of memory left in a node/system before kswapd is woken up and
how much memory needs to be free before kswapd goes back to sleep.

The unit is in fractions of 10,000. The default value of 10 means the
distances between watermarks are 0.1% of the available memory in the
node/system. The maximum value is 3000, or 30% of memory.

A high rate of threads entering direct reclaim (allocstall) or kswapd
going to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate
that the number of free pages kswapd maintains for latency reasons is
too small for the allocation bursts occurring in the system. This knob
can then be used to tune kswapd aggressiveness accordingly.

一、页面回收机制调用路径

首先，我们来看下页面回收机制的主要调用路径(图片来自《奔跑吧，Linux内核！》，如下：

每个内存节点通过一个pg_data_t数据结构来描述物理内存的布局。kswapd传递的参数就是pg_data_t数据结构。

二、kswapd内存回收 2.1 内存水位标记 watermark）

在分配路径上，如果低水位（ALLOC_WMARK_LOW)的情况下无法成功分配内存，那么会通过wakeup_kswaped（）函数唤醒kswapd内核线程来回收页面，以便释放一些内存。

2.2 kswapd唤醒路径

alloc_page --> __alloc_pages_nodemask() --> __alloc_pages_slowpath() --> wake_all_kswapds() --> wakeup_kswapd()

在kswapd的分配路径上的唤醒函数wakeup_kswapd()把kswapd_max_order和classzone_idx作为参数传递给kswapd内核线程。在分配路径上，如果低水位（ALLOC_WMARK_LOW)的情况下无法成功分配内存，那么会通过wakeup_kswaped（）函数唤醒kswapd内核线程来回收页面，以便释放一些内存。

linux_mainline-5.17.0/mm/vmscan.c
4533  
4540  void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
4541  		   enum zone_type highest_zoneidx)
4542  {
4543  	pg_data_t *pgdat;
4544  	enum zone_type curr_idx;
4545  
4546  	if (!managed_zone(zone))
4547  		return;
4548  
4549  	if (!cpuset_zone_allowed(zone, gfp_flags))
4550  		return;
4551  
4552  	pgdat = zone->zone_pgdat;
4553  	curr_idx = READ_ONCE(pgdat->kswapd_highest_zoneidx);
4554  
4555  	if (curr_idx == MAX_NR_ZONES || curr_idx < highest_zoneidx)
4556  		WRITE_ONCE(pgdat->kswapd_highest_zoneidx, highest_zoneidx);
4557  
4558  	if (READ_ONCE(pgdat->kswapd_order) < order)
4559  		WRITE_ONCE(pgdat->kswapd_order, order);
4560  
4561  	if (!waitqueue_active(&pgdat->kswapd_wait))
4562  		return;
4563  
4564  	
4565  	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ||
4566  	    (pgdat_balanced(pgdat, order, highest_zoneidx) &&
4567  	     !pgdat_watermark_boosted(pgdat, highest_zoneidx))) {
4568  		
4575  		if (!(gfp_flags & __GFP_DIRECT_RECLAIM))
4576  			wakeup_kcompactd(pgdat, order, highest_zoneidx);
4577  		return;
4578  	}
4579  
4580  	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, highest_zoneidx, order,
4581  				      gfp_flags);
4582  	wake_up_interruptible(&pgdat->kswapd_wait);
4583  }

在系统中可以从/proc/zoneinfo文件中查看当前系统的相关的信息和使用情况。

2.3 /proc/zoneinfo

我们可以通过“cat /proc/zoninfo”来查看每个Node中不同zone区域的统计信息，如下：

bill@bill-VirtualBox:~$ cat /proc/zoneinfo
Node 0, zone DMA
per-node stats
nr_inactive_anon 4694
nr_active_anon 114895
nr_inactive_file 141277
nr_active_file 66066
nr_unevictable 0
nr_slab_reclaimable 11617
nr_slab_unreclaimable 10038
nr_isolated_anon 0
nr_isolated_file 0
workingset_nodes 0
workingset_refault 0
workingset_activate 0
workingset_restore 0
workingset_nodereclaim 0
nr_anon_pages 115306
nr_mapped 60868
nr_file_pages 211629
nr_dirty 234
nr_writeback 0
nr_writeback_temp 0
nr_shmem 4915
nr_shmem_hugepages 0
nr_shmem_pmdmapped 0
nr_file_hugepages 0
nr_file_pmdmapped 0
nr_anon_transparent_hugepages 0
nr_unstable 0
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 0
nr_dirtied 18242
nr_written 4311
nr_kernel_misc_reclaimable 0
pages free 3977
min 67
low 83
high 99
spanned 4095
present 3998
managed 3977
protection: (0, 3426, 3867, 3867, 3867)
nr_free_pages 3977
nr_zone_inactive_anon 0
nr_zone_active_anon 0
nr_zone_inactive_file 0
nr_zone_active_file 0
nr_zone_unevictable 0
nr_zone_write_pending 0
nr_mlock 0
nr_page_table_pages 0
nr_kernel_stack 0
nr_bounce 0
nr_zspages 0
nr_free_cma 0
numa_hit 0
numa_miss 0
numa_foreign 0
numa_interleave 0
numa_local 0
numa_other 0
pagesets
cpu: 0
count: 0
high: 0
batch: 1
vm stats threshold: 2
node_unreclaimable: 0
start_pfn: 1
Node 0, zone DMA32
pages free 634266
min 14908
low 18635
high 22362
spanned 1044480
present 913392
managed 889680
protection: (0, 0, 441, 441, 441)
nr_free_pages 634266
nr_zone_inactive_anon 4660
nr_zone_active_anon 100019
nr_zone_inactive_file 100862
nr_zone_active_file 35798
nr_zone_unevictable 0
nr_zone_write_pending 74
nr_mlock 0
nr_page_table_pages 5231
nr_kernel_stack 4208
nr_bounce 0
nr_zspages 0
nr_free_cma 0
numa_hit 468245
numa_miss 0
numa_foreign 0
numa_interleave 0
numa_local 468245
numa_other 0
pagesets
cpu: 0
count: 70
high: 378
batch: 63
vm stats threshold: 12
node_unreclaimable: 0
start_pfn: 4096
Node 0, zone Normal
pages free 2372
min 1919
low 2398
high 2877
spanned 131072
present 131072
managed 112936
protection: (0, 0, 0, 0, 0)
nr_free_pages 2372
nr_zone_inactive_anon 34
nr_zone_active_anon 14876
nr_zone_inactive_file 40415
nr_zone_active_file 30268
nr_zone_unevictable 0
nr_zone_write_pending 160
nr_mlock 0
nr_page_table_pages 1176
nr_kernel_stack 2192
nr_bounce 0
nr_zspages 0
nr_free_cma 0
numa_hit 252639
numa_miss 0
numa_foreign 0
numa_interleave 32495
numa_local 252639
numa_other 0
pagesets
cpu: 0
count: 49
high: 186
batch: 31
vm stats threshold: 6
node_unreclaimable: 0
start_pfn: 1048576
Node 0, zone Movable
pages free 0
min 0
low 0
high 0
spanned 0
present 0
managed 0
protection: (0, 0, 0, 0, 0)
Node 0, zone Device
pages free 0
min 0
low 0
high 0
spanned 0
present 0
managed 0
protection: (0, 0, 0, 0, 0)

以normal为例进行解读，如下：

Node 0, zone   Normal
  pages free     2372
        min      1919
        low      2398
        high     2877
        spanned  131072     ==>  spanned_pages
        present  131072     ==>  present_pages
        managed  112936     ==>  managed_pages
        protection: (0, 0, 0, 0, 0)
......

page_low: 当空闲页面的数量达到page_low所标定的数量的时候，kswapd线程将被唤醒，并开始释放回收页面。这个值默认是page_min的2倍。

page_min: 当空闲页面的数量达到page_min所标定的数量的时候, 分配页面的动作和kswapd线程同步运行

page_high: 当空闲页面的数量达到page_high所标定的数量的时候, kswapd线程将重新休眠，通常这个数值是page_min的3倍

spanned_pages: 代表的是这个zone中所有的页，包含空洞，计算公式是: zone_end_pfn - zone_start_pfn
present_pages：代表的是这个zone中可用的所有物理页，计算公式是：spanned_pages-hole_pages
managed_pages：代表的是通过buddy管理的所有可用的页，计算公式是：present_pages - reserved_pages，三者的关系是spanned_pages > present_pages > managed_pages

如上Normal zone有三个水位：min, low和high（单位为page大小），标志着当前zone中内存分配压力。当系统中可用内存紧张时，内核kswapd将会被被唤醒，并依据对应水位执行相应的回收操作。

kswapd和这3个参数的互动关系如下图：

总结下来就是，当系统剩余内存低于watermark[low]的时候，内核的kswapd开始起作用，进行内存回收。直到剩余内存达到watermark[high]的时候停止。如果内存消耗导致剩余内存达到了或超过了watermark[min]时，就会触发直接回收（direct reclaim）。

三、/proc/sys/vm/swappiness

/proc/sys/vm/swappiness的值如下，

linux_mainline-5.17.0/Documentation/admin-guide/sysctl/vm.rst
swappiness
==========

This control is used to define the rough relative IO cost of swapping
and filesystem paging, as a value between 0 and 200. At 100, the VM
assumes equal IO cost and will thus apply memory pressure to the page
cache and swap-backed pages equally; lower values signify more
expensive swap IO, higher values indicates cheaper.

Keep in mind that filesystem IO patterns under memory pressure tend to
be more efficient than swap's random IO. An optimal value will require
experimentation and will also be workload-dependent.

The default value is 60.

For in-memory swap, like zram or zswap, as well as hybrid setups that
have swap on faster devices than the filesystem, values beyond 100 can
be considered. For example, if the random IO against the swap device
is on average 2x faster than IO from the filesystem, swappiness should
be 133 (x + 2x = 200, 2x = 133.33).

At 0, the kernel will not initiate swap until the amount of free and
file-backed pages is less than the high watermark in a zone.

这个值范围定义在0~200。如果这个值设置为100表示内存发生回收时，从swap交换内存和从cache回收内存的优先级一样。

如果值为0，则kernel不会初始化swap，直到当前的剩余内存和文件映射内存的总和低于high的水位值。

最后

我的理解总结下来就是，当系统剩余内存低于watermark[low]的时候会触发kswapd，kswapd会根据swappiness的值来确认是否优先进行内存页的换出。需要注意的是，zoneinfo中，不同的zone有不同的水位值，我们在从meminfo计算剩余内存的时候，free的值是总的剩余内存，需要减去其他zone的剩余内存后再和当前对应zone的水位进行比较，以确认是否会触发kswapd进行页换出。

在日常工作当中，如果遇到内存低导致kswapd0占用cpu高的问题，应该如何确认是哪里的占用高了呢？以下面一份meminfo举例：

MemTotal:         4955096 kB
MemFree:          816396  kB
MemAvailable:     788180  kB
Buffers:             3280 kB
Cached:             83860 kB
SwapCached:             0 kB
Active:           2729596 kB
Inactive:           41644 kB
Active(anon):     2689336 kB
Inactive(anon):      8240 kB
Active(file):       40260 kB
Inactive(file):     33404 kB
Unevictable:            0 kB
Mlocked:                0 kB
SwapTotal:              0 kB
SwapFree:                0kB
Dirty:                172 kB
Writeback:              0 kB
AnonPages:        2684368 kB
Mapped:             65436 kB
Shmem:              13208 kB
KReclaimable:       46600 kB
Slab:               69556 kB
SReclaimable:       28836 kB
SUnreclaim:        114344 kB
KernelStack:        46600 kB
PageTables:         69556 kB
NFS_Unstable:           0 kB
Bounce:                 0 kB
WritebackTmp:           0 kB
CommitLimit:      2477548 kB
Committed_AS:    29801856 kB
VmallocTotal:135390390112 kB
VmallocUsed:            0 kB
VmallocChunk:           0 kB
Percpu:               360 kB
HardwareCorrupted:      0 kB
AnonHugePages:     498752 kB
ShmemHugePages:         0 kB
ShmemPmdMapped:         0 kB
FileHugePages:          0 kB
FilePmdMapped:          0 kB
CmaTotal:         1048576 kB
CmaFree:           740176 kB
HugePages_Total:        0
HugePages_Free:         0
HugePages_Rsvd:         0
HugePages_Surp:         0
Hugepagesize:        2048 kB
Hugetlb:                0 kB

从上面的数据来看，soc大概分得了5G（MemTotal: 4955096 kB），其中有1G（CmaTotal:1048576 kB）作为预留内存分给了CMA。其中用户态进程约使用了2.9G(Active: 2729596 kB) + (Inactive:41644 kB) + (Slab:69556 kB) + (KernelStack:46600 kB) + (PageTables:69556 kB) + (Mapped: 65436 kB) + (Shmem:13208 kB)，那么也就是说由1个G左右的内存消失不见。而(Active: 2729596 kB)占用内存才2.7G，CMA使用约260M。那么怀疑内核发生了泄漏或有DMA设备没有通过CMA进行内存申请，这种属于内存黑洞，这部分的调试会放到内存调试系列中来讲。

Linux内存从0到1学习笔记(九，内存优化调试之一 - kswapd0）

Linux相关栏目本月热门文章