If your Elastic Compute Service (ECS) instance goes down and the Out of memory and no killable processes error message appears in an error log, you can use the solution described in this topic to fix the issue.
Problem description
An instance goes down at runtime and a call stack similar to the following one is displayed:
[28663.625353] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[28663.625363] [ 1799] 0 1799 26512 245 56 3 0 -1000 sshd
[28663.625367] [29219] 0 29219 10832 126 26 3 0 -1000 systemd-udevd
[28663.625375] Kernel panic - not syncing: Out of memory and no killable processes...
[28663.634374] CPU: 1 PID: 3578 Comm: kworker/u176:4 Tainted: G OE 3.10.0-1062.9.1.el7.x86_64 #1
[28663.676873] Call Trace:
[28663.679312] [<ffffffff8139f342>] dump_stack+0x63/0x81
[28663.684421] [<ffffffff811b2245>] panic+0xf8/0x244
[28663.689184] [<ffffffff811b98db>] out_of_memory+0x2eb/0x550
[28663.694726] [<ffffffff811be254>] __alloc_pages_may_oom+0x114/0x1c0
[28663.700959] [<ffffffff811bedb3>] __alloc_pages_slowpath+0x7d3/0xa40
[28663.707279] [<ffffffff811bf229>] __alloc_pages_nodemask+0x209/0x260
[28663.713599] [<ffffffff81216535>] alloc_pages_current+0x95/0x140
[28663.719573] [<ffffffff811ba5ee>] __get_free_pages+0xe/0x40
[28663.725113] [<ffffffff81075dae>] pgd_alloc+0x1e/0x160
[28663.730225] [<ffffffff810875e4>] mm_init+0x184/0x240
[28663.735249] [<ffffffff81088102>] mm_alloc+0x52/0x60
[28663.740186] [<ffffffff81257640>] do_execveat_common.isra.37+0x250/0x780
[28663.759839] [<ffffffff81257b9c>] do_execve+0x2c/0x30
[28663.764864] [<ffffffff810a231b>] call_usermodehelper_exec_async+0xfb/0x150
[28663.777246] [<ffffffff81741dd9>] ret_from_fork+0x39/0x50
Cause
When the operating system kernel of an instance fails to allocate memory to processes and attempts to kill specific processes to release memory, no processes that run on the instance can be killed. As a result, the instance goes down. The issue may be caused by the following reasons:
A memory leak occurs in the operating system kernel, which causes insufficient available memory in the system.
The processes whose
oom_score_adj
value is set to-1000
use excessive memory and cannot be killed. This also causes insufficient available memory in the system.NoteThe value of
oom_score_adj
is an integer that indicates the likelihood of a process being selected to be killed by the kernel under Out of Memory (OOM) conditions. A lower value indicates that a process is less likely to be selected for OOM killing by the kernel, while a higher value indicates that a process is more likely to be selected.
Solution
Before you perform the operations in the solution on a Linux instance on which the issue occurred, we recommend that you create snapshots for the Linux instance to back up data. This prevents data loss caused by accidental operations. For information about how to create a snapshot, see Create a snapshot for a disk.
Check whether a memory leak occurs in the operating system kernel.
For more information, see What do I do if an instance has a high percentage of slab_unreclaimable memory?
Check whether the
oom_score_adj
value is properly set.Run the
ps
,top
, orpgrep
command to obtain the PID of a specific process. Sample command:ps aux | grep <Process name>
Replace
<Process name>
with the name of the process whose PID you want to obtain.Run the following command to check the
oom_score_adj
value:cat /proc/<PID>/oom_score_adj
Replace
<PID>
with the actual PID that you obtained.In combination with your environment and requirements, you can evaluate whether the OOM killing settings for processes are reasonable based on the value of
oom_score_adj
. If the value ofoom_score_adj
for a process is-1000
, the process has a lower priority and is less likely to be selected for OOM killing by the kernel. As a result, the available memory in the system may become insufficient.