从 C++ 内存监控方案拆出。
memleak 的 Demo
https://github.com/CalvinNeo/ebpf-heap-profile
1 | export port=5761 |
输出的栈大概是这样的
1 | 15728640 bytes in 2 allocations from stack |
TiFlash 的大事务测试
我们基于这个 demo,用 TiFlash 的大事务进行了测试。
combined mode 和 outstanding mode
There are two modes in memleak, the outstanding mode and the combined outstanding mode(will be refered as combined mode later).
combined mode 比 outstanding mode 返回的数据要少不少。例如在一次测试中,我们发现:
- jeprof 是 2188 MiB。
- combined mode 为 4.40GB
- 1GB 左右的数据显示为 stack information lost
- 836 MB 左右的数据为
read_carllerche_bytes,后面会详细说。
- outstanding mode 为 189MB
outstanding 模式返回少的原因是:
stack_id返回 -17
stack_id = -17(即 -EFAULT)通常表示 BPF 程序在尝试读取用户空间堆栈时失败,原因可能是:- 用户空间堆栈不可访问(例如,目标进程的虚拟内存未映射或权限不足)。
- 堆栈跟踪深度超过最大限制(max_stack_depth 配置)。
- 进程已退出,导致堆栈信息无效。
- 栈太多了
When using outstanding mode, memleak will generate thousands of stacks, which is much more than--top. As a result, many of them are ignored when we output the result.
read_carllerche_bytes 的奇怪函数
Still, the read_carllerche_bytes function takes 19% memory. However, if we only hook the malloc call, such function disappears. So it may due to some malfunction of monitoring different memory allocation methods. In TiFlash, we seldomly using mmap to allocate memory, so it works fine if we only hook malloc calls.
1 | 876616479 bytes in 16369 allocations from stack |
memleak、jemalloc 和 RSS 的比较
- memleak 输出的内存分配情况,可能会超过 RSS。
- 但是无论使用 memleak 还是 jeprof,对于一个具体的函数,例如
DB::Region::insert,它占用的内存比例是相近的。
为什么 RSS, jeprof 和 ebpf 的输出是不同的呢?
- According to detailed test result, there is no significant consistency between RSS, ebpf and jeprof’s record.
- ebpf 追踪的内存分配会被 RSS 大一点点。
- jeprof 追踪的内存分配比 RSS 和 ebpf 更大。
- To be short, that is
jeprof > ebpf >~ RSS
- 当然,如下的测试也发现 RSS 记录的内存分配可能会被 jemalloc 更大
1
2
3rss 31904 allocated 14220 retained 6152 mapped 26616
rss 31904 allocated 14860 retained 5508 mapped 27260 <== rss grow slower
rss 33952 allocated 15500 retained 4864 mapped 27904 <== rss grow faster
Jemalloc 和 RSS 差别比较大
- Mapped 的内存不一定总是被分配到物理内存中,所以
mapped/allocated> RSS 是可能的。 - Jemalloc 不一定每一次都从系统 mmap 出来我们所要 malloc 的内存,它会一次性向系统请求一个 chunk,然后每次从这个 chunk 中分配一段返回给 malloc 调用。具体可以见jemalloc 的实现。因此,RSS/
mapped>allocated是可能的。
The result of ebpf and RSS are close to the real payload. ebpf may show a bit more allocations, it may because that:
- The allocation points to an allocated memory by jemalloc.
- Jemalloc only allocated the memory by
mmap, and never access it, so the system may lazily allocate the memory, until a PAGE FAULT comes in. - In Linux, if we want a immediate memory allocation on physical RAM, we need to call
mmapwith the flagMAP_POPULATE. Other ways such as immediate touch after allocation, or calling mlock can also be considered.
进一步实验
为了更精确的实验,就不能用 TiFlash 了,而是应该精确地控制内存分配,所以设计了如下的实验
编译 https://github.com/CalvinNeo/jemalloc/tree/debug 中的 test.cpp 得到 test2。它会尝试分 5 次,每次分配 512 KiB
1 | nohup ./test2 > b.log 2>&1 & |
介绍下这个程序的输出:
- RSS:从 /proc/self/status 中读到的 VmRSS
- stats.allocated
应用程序实际请求的内存,不包括 jemalloc 内部的碎片,也不包含已经被调用free,但是还被 jemalloc 归还给 OS 的内存。 - stats.mapped
jemalloc 通过 mmap 从 OS 映射的虚拟地址空间总量。 - stats.retained
jemalloc 已经从 OS mmap 过来,但当前没有被任何 arena 使用,只是“留着以后可能再用”的内存。
1 | stats.mapped |
RSS 如下
1 | RSS 2560 KiB, allocated 3372 - 108 = 3264 KiB, mapped 15700 - 12408 = 3292 KiB |
从上可以总结得到
1 | 3372 - 108 = 3264 KiB, mapped 15700 - 12408 = 3292 KiB |
而 memleak 的输出如下,其中 ===============*********=============== 向上的是 combined,向下的是 outstanding。
1 | Attaching to pid 37627, Ctrl+C to quit. |
memleak 简介
blazesym
It uses blazesym to parse the symbols:
- If not compiled with
-g, can’t get the line number - If stripped, can only have addresses
stack_id
Memleak uses stack_id to track different call stacks.
- Nature and Storage Mechanism
- stack_id is an integer assigned by the kernel through interfaces like
bpf_get_stackid()from aBPF_MAP_TYPE_STACK_TRACE-type map. - This map uses the hash value of the call stack as a key, stores the compressed stack information, and returns a unique ID.
- The same call stack pattern (i.e., an identical sequence of instruction addresses at each level) always receives the same
stack_id.
- Core Role
- Avoids repeatedly transmitting complete call stack data by replacing it with a lightweight ID.
- Enables efficient indexing and correlation of stack traces in performance analysis or debugging scenarios.
- Lifecycle
- Created in BPF maps and persists as long as the BPF program is running.
- May be automatically freed when the map is destroyed or manually deleted.
- Usage Constraints
The validity of a
stack_idis context-dependent, which means it is meaningful only within the same BPF map.
For example,1
BPF_STACK_TRACE(stack_traces, 10240);
User-space tools typically resolve these IDs back to symbolic stack traces using additional metadata (e.g., /proc/kallsyms or debug symbols).
环境依赖
The kernel has to be compiled with the support for bpf.
1
2
3
4
5CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_BPF_JIT=y
CONFIG_HAVE_EBPF_JIT=y
CONFIG_BPF_EVENTS=yThe docker for TiFlash has to be bootstrapped with the privileges required by bpf.
1
2
3
4
5
6
7
8
9
10
11
12securityContext:
privileged: true
capabilities:
add:
- SYS_ADMIN
- SYS_RESOURCE
- SYS_PTRACE
- NET_ADMIN
- SYSLOG
- IPC_LOCK
seccompProfile:
type: UnconfinedWe need to refer to the configs in
/boot.The kernel need
kernel-develinstalled, with the correct version.1
sh-5.1# yum install kernel-devel