【k8s pod container内存指标说明】

一、问题描述

我司平台研发的devops平台底层采用k8s实现，k8s自带cadvisor进行集群指标收集，根据官网，我们选用了container_memory_working_set_bytes（容器的工作集使用量）作为内存使用量的观察项，但随着后续使用过程中发现该指标上升到一定大小后就会维持不变，并不像应用实际内存使用量，没出现波动；

来自kubernetes对该问题的讨论（讨论了5年多了）：https://github.com/kubernetes/kubernetes/issues/43916

二、原因分析

⚠️以下是建立在关闭swap交换分区的前提下分析

经过一系列分析发现使用container_memory_working_set_bytes不合理，应该使用container_memory_rss来表示应用实际内存使用量；

我最开始受cadvisor git问题栏中描述的影响，里面描述说：k8s kill pod时是根据该指标使用情况来判断的，所以下意识认为该指标表示pod实际内存使用量；官网说的没有问题，container_memory_working_set_bytes指标就是k8s来控制pod是否被kill的依据参考，但不代表container_memory_working_set_bytes接近pod limit后就一定会被kill，这里涉及到下面要详细分析的pod内存分布情况；

获取pod内存记录信息
在cadvisor中，采集应用内存信息具体实现实际是获取docker cgroup中memory.stat中的内容，下面是通过cat /sys/fs/cgroup/memory/memory.stat获取到的pod内存数据：

cache 118784
rss 767553536
rss_huge 0
mapped_file 32768
swap 0
pgpgin 215484
pgpgout 28064
pgfault 388710
pgmajfault 0
inactive_anon 0
active_anon 767496192
inactive_file 32768
active_file 86016
unevictable 0
hierarchical_memory_limit 1073741824
hierarchical_memsw_limit 1073741824
total_cache 118784
total_rss 767553536
total_rss_huge 0
total_mapped_file 32768
total_swap 0
total_pgpgin 215484
total_pgpgout 28064
total_pgfault 388710
total_pgmajfault 0
total_inactive_anon 0
total_active_anon 767496192
total_inactive_file 32768
total_active_file 86016
total_unevictable 0

pod内存记录说明
上面获取内存记录，主要关心total_cache、total_rss、total_inactive_anon、total_active_anon、total_inactive_file、total_active_file

total_cache：表示当前pod缓存内存量
total_rss：表示当前应用进程实际使用内存量
total_inactive_anon：表示匿名不活跃内存使用量
total_active_anon：表示匿名活跃内存使用量，jvm堆内存使用量会被计算在此处
total_inactive_file：表示不活跃文件内存使用量
total_active_file：表示活跃文件内存使用量

container_memory_working_set_bytes、container_memory_rss指标的详细分析

⚠️ 建议先查看最后附录内容，了解cadvisor中每个指标的意义，有助于帮助下面内容理解。

容器当前使用内存量： container_memory_usage_bytes = total_cache + total_rss
容器当前使用缓存内存： total_cache = total_inactive_file + total_active_file
container_memory_working_set_bytes=container_memory_usage_bytes - total_inactive_file为官方源代码提供，带入上面两个公式，容器的工作集的等式可以拆解为：

container_memory_working_set_bytes 
= container_memory_usage_bytes - total_inactive_file
= total_cache + total_rss - total_inactive_file
= total_inactive_file + total_active_file + total_rss - total_inactive_file
= total_active_file + total_rss

即：container_memory_working_set_bytes = total_active_file+ total_rss，其中total_rss为应用真实使用内存量，正常情况下该指标数值稳定，那为何该指标会持续上升而且一直维持很高呢？其实问题就出现在total_active_file上；

Linux系统为了提高文件读取速率，会划分出来一部分缓存内存，即cache内存，这部分内存有个特点，当应用需要进行io操作时，会向Linux申请一部分内存，这部分内存归属于操作系统，当应用io操作完毕后，操作系统不会立即回收；
当操作系统认为系统剩余内存不足时（判断依据未深究），才会主动回收这部分内存。

这部分内存属于操作系统，jvm无任何管理权限，故也不会把这部分内存计算到JVM中。container_memory_working_set_bytes指标升高一部分是应用本身JVM内存使用量增加，另一部分就是进行了io操作，total_active_file升高；该指标异常一般都是应用进行了io相关操作。

模拟io操作

你可以在pod中创建一个大一点的日志文件，通过 cat 文件名 > /dev/null 将其加载到缓存内存中，你会发现total_active_file升高并且不会被释放；

删除加载后的日志文件，模拟系统回收内存操作，你会发现total_active_file降低；
或者执行echo 3 > /proc/sys/vm/drop_caches命令，前提是用户有足够权限

三、合理监控

关心应用使用内存大小用container_memory_rss;
排查pod为何被kill，关注container_memory_working_set_bytes、total_active_file、total_inactive_file

附录 k8s pod container内存指标说明

container_memory_cache表示容器使用的缓存内存。
container_memory_mapped_file表示容器使用的映射文件内存。
container_memory_rss表示容器的常驻内存集（Resident Set Size），即实际使用的物理内存量。
container_memory_swap表示容器使用的交换内存量。
container_memory_usage_bytes表示容器当前使用的内存量，包括常驻内存和缓存，**缓存部分往往会有很多处于空闲**。
container_memory_working_set_bytes表示容器的工作集，即容器当前活跃使用的内存量，不包括缓存。
container_memory_max_usage_bytes表示容器历史上使用过的最大内存量。
container_memory_failcnt表示容器内存失败计数，即无法分配所需内存的次数。
container_memory_failures_total表示容器内存分配失败的总次数。

原文地址：https://blog.csdn.net/qq_34468174/article/details/130643948