最近线上遇到一个问题,运行的好好的Java应用莫名其妙的不见了,记录的日志也到了某个时刻"戛然而止",就好像应用对应的进程突然"人间蒸发"了似的。
经过一番查找,终于找到的系统日志中找到了一些猫腻:
[8314616.900000] java invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 [8314616.901123] java cpuset=docker-4da60884c4c56e1c58f86fcd853ad7510f1f1ed586623a5d07a5b3ce7918e626.scope mems_allowed=0 [8314616.902417] CPU: 0 PID: 5798 Comm: java Kdump: loaded Tainted: G ------------ T 3.10.0-1160.15.2.el7.x86_64 #1 [8314616.904101] Hardware name: OpenStack Foundation OpenStack Nova, BIOS rel-1.10.2-0-g5f4c7b1-20181220_000000-szxrtosci10000 04/01/2014 [8314616.905871] Call Trace: [8314616.906467] [] dump_stack+0x19/0x1b [8314616.907292] [ ] dump_header+0x90/0x229 [8314616.908117] [ ] ? ep_poll_callback+0xf8/0x220 [8314616.909004] [ ] ? find_lock_task_mm+0x56/0xc0 [8314616.909873] [ ] ? try_get_mem_cgroup_from_mm+0x28/0x60 [8314616.910858] [ ] oom_kill_process+0x2cd/0x490 [8314616.911706] [ ] mem_cgroup_oom_synchronize+0x55c/0x590 [8314616.912693] [ ] ? mem_cgroup_charge_common+0xc0/0xc0 [8314616.913650] [ ] pagefault_out_of_memory+0x14/0x90 [8314616.914536] [ ] mm_fault_error+0x6a/0x157 [8314616.915361] [ ] __do_page_fault+0x491/0x500 [8314616.916200] [ ] trace_do_page_fault+0x56/0x150 [8314616.917072] [ ] do_async_page_fault+0x22/0xf0 [8314616.917920] [ ] async_page_fault+0x28/0x30 [8314616.918761] Task in /system.slice/docker-4da60884c4c56e1c58f86fcd853ad7510f1f1ed586623a5d07a5b3ce7918e626.scope killed as a result of limit of /system.slice/docker-4da60884c4c56e1c58f86fcd853ad7510f1f1ed586623a5d07a5b3ce7918e626.scope [8314616.921191] memory: usage 262144kB, limit 262144kB, failcnt 40450 [8314616.922103] memory+swap: usage 262144kB, limit 524288kB, failcnt 0 [8314616.922984] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 [8314616.923858] Memory cgroup stats for /system.slice/docker-4da60884c4c56e1c58f86fcd853ad7510f1f1ed586623a5d07a5b3ce7918e626.scope: cache:1692KB rss:260448KB rss_huge:0KB mapped_file:100KB swap:0KB inactive_anon:0KB active_anon:260448KB inactive_file:880KB active_file:800KB unevictable:0KB [8314616.929470] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [8314616.930563] [18920] 0 18920 1156 55 8 0 0 sh [8314616.931594] [18943] 0 18943 1141 45 8 0 0 tail [8314616.932646] [21759] 0 21759 4660 174 15 0 0 bash [8314616.933688] [22149] 0 22149 4625 144 15 0 0 bash [8314616.934702] [22190] 0 22190 9148 128 23 0 0 top [8314616.935703] [ 4079] 0 4079 4652 167 15 0 0 bash [8314616.936711] [ 5767] 0 5767 4651 162 14 0 0 bash [8314616.937711] [ 5797] 0 5797 787447 64256 179 0 0 java [8314616.938708] Memory cgroup out of memory: Kill process 5808 (Common-Cleaner) score 983 or sacrifice child [8314616.939859] Killed process 5797 (java), UID 0, total-vm:3149788kB, anon-rss:257012kB, file-rss:12kB, shmem-rss:0kB
分析日志,可以发现有两个名词从未见过,分别是 oom-killer 和 cgroup。因此详细这里详细了解下:
oom-killeroom-killer 是 Linux 内中中的一个机制,该机制对内存进行监控。当内核态中发生内存分配失败,或者内存不足等情况时,会触发该机制根据算法对每个进程的内存占用情况进行评分,并 kill 掉分数最高的进程。
内核使用cgroup对进程进行分组,并限制进程资源和对进程进行跟踪。内核通过名为cgroupfs类型的虚拟文件系统来提供cgroup功能接口。
参考资料Linux vm运行参数之(二):OOM相关的参数
Linux进程突然被杀掉(OOM killer)
cgroup & oom-killer 简介
容器学习笔记——OOM Killer与Memory Cgroup
docker cgroup 技术之memory(首篇)
Linux实战-2:内存不足触发Linux OOM-killer机制分析
linux下OOM问题排查



