使用 eBPF 监控内存分配

C++ 内存监控方案拆出。

memleak 的 Demo

https://github.com/CalvinNeo/ebpf-heap-profile

1
2
3
export port=5761
sudo lsof -i:$port
sudo python memleak.py 20 1 --pid XXX --obj /bin/tiflash/tiflash

输出的栈大概是这样的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15728640 bytes in 2 allocations from stack
0x000055d982b5062e os_pages_map.llvm.2428227075819520184+0x4e [tiflash]
0x000055d982b503b6 pages_map+0x26 [tiflash]
0x000055d982b1744a ehooks_default_alloc_impl+0xca [tiflash]
0x000055d982b1b5cd ecache_alloc_grow+0x35d [tiflash]
0x000055d982b4e4d2 pac_alloc_impl.llvm.2573546850293068282+0x142 [tiflash]
0x000055d982b0d48f arena_extent_alloc_large+0x15f [tiflash]
0x000055d982b497c1 large_palloc+0x301 [tiflash]
0x000055d982af2706 malloc_default+0x7a6 [tiflash]
0x000055d97b7a4074 Allocator<false>::alloc(unsigned long, unsigned long)+0xd4 [tiflash]
...
0x000055d9825bf310 PreHandleSnapshot+0x250 [tiflash]
0x00007fd51ab59f48 proxy_ffi::engine_store_helper_impls::_$LT$impl$u20$proxy_ffi..interfaces..root..DB..EngineStoreServerHelper$GT$::pre_handle_snapshot::h9e183a85dcb8c111+0x168 [libtiflash_proxy.so]
16777216 bytes in 1 allocations from stack

TiFlash 的大事务测试

我们基于这个 demo,用 TiFlash 的大事务进行了测试。

combined mode 和 outstanding mode

There are two modes in memleak, the outstanding mode and the combined outstanding mode(will be refered as combined mode later).

combined mode 比 outstanding mode 返回的数据要少不少。例如在一次测试中,我们发现:

  • jeprof 是 2188 MiB。
  • combined mode 为 4.40GB
    • 1GB 左右的数据显示为 stack information lost
    • 836 MB 左右的数据为 read_carllerche_bytes,后面会详细说。
  • outstanding mode 为 189MB

outstanding 模式返回少的原因是:

  • stack_id 返回 -17
    stack_id = -17(即 -EFAULT)通常表示 BPF 程序在尝试读取用户空间堆栈时失败,原因可能是:
    1. 用户空间堆栈不可访问(例如,目标进程的虚拟内存未映射或权限不足)。
    2. 堆栈跟踪深度超过最大限制(max_stack_depth 配置)。
    3. 进程已退出,导致堆栈信息无效。
  • 栈太多了
    When using outstanding mode, memleak will generate thousands of stacks, which is much more than --top. As a result, many of them are ignored when we output the result.

read_carllerche_bytes 的奇怪函数

Still, the read_carllerche_bytes function takes 19% memory. However, if we only hook the malloc call, such function disappears. So it may due to some malfunction of monitoring different memory allocation methods. In TiFlash, we seldomly using mmap to allocate memory, so it works fine if we only hook malloc calls.

1
2
876616479 bytes in 16369 allocations from stack
protobuf::stream::CodedInputStream::read_carllerche_bytes::he73c610ff7dfb55f+0x186 [libtiflash_proxy.so]

memleak、jemalloc 和 RSS 的比较

  • memleak 输出的内存分配情况,可能会超过 RSS。
  • 但是无论使用 memleak 还是 jeprof,对于一个具体的函数,例如 DB::Region::insert,它占用的内存比例是相近的。

为什么 RSS, jeprof 和 ebpf 的输出是不同的呢?

  • According to detailed test result, there is no significant consistency between RSS, ebpf and jeprof’s record.
    • ebpf 追踪的内存分配会被 RSS 大一点点。
    • jeprof 追踪的内存分配比 RSS 和 ebpf 更大。
    • To be short, that is jeprof > ebpf >~ RSS
  • 当然,如下的测试也发现 RSS 记录的内存分配可能会被 jemalloc 更大
    1
    2
    3
    rss 31904 allocated 14220 retained 6152 mapped 26616
    rss 31904 allocated 14860 retained 5508 mapped 27260 <== rss grow slower
    rss 33952 allocated 15500 retained 4864 mapped 27904 <== rss grow faster

Jemalloc 和 RSS 差别比较大

  • Mapped 的内存不一定总是被分配到物理内存中,所以 mapped/allocated > RSS 是可能的。
  • Jemalloc 不一定每一次都从系统 mmap 出来我们所要 malloc 的内存,它会一次性向系统请求一个 chunk,然后每次从这个 chunk 中分配一段返回给 malloc 调用。具体可以见jemalloc 的实现。因此,RSS/mapped > allocated 是可能的。

The result of ebpf and RSS are close to the real payload. ebpf may show a bit more allocations, it may because that:

  • The allocation points to an allocated memory by jemalloc.
  • Jemalloc only allocated the memory by mmap, and never access it, so the system may lazily allocate the memory, until a PAGE FAULT comes in.
  • In Linux, if we want a immediate memory allocation on physical RAM, we need to call mmap with the flag MAP_POPULATE. Other ways such as immediate touch after allocation, or calling mlock can also be considered.

进一步实验

为了更精确的实验,就不能用 TiFlash 了,而是应该精确地控制内存分配,所以设计了如下的实验

编译 https://github.com/CalvinNeo/jemalloc/tree/debug 中的 test.cpp 得到 test2。它会尝试分 5 次,每次分配 512 KiB

1
2
3
nohup ./test2  > b.log 2>&1 &
sudo python /ebpf-heap-profile/memleak.py 5 1 --pid `ps aux | grep test2 | grep -v grep | awk '{print $2}'` --obj /jemalloc/test2 > mleak.txt
kill `ps aux | grep test2 | grep -v grep | awk '{print $2}'`

介绍下这个程序的输出:

  • RSS:从 /proc/self/status 中读到的 VmRSS
  • stats.allocated
    应用程序实际请求的内存,不包括 jemalloc 内部的碎片,也不包含已经被调用 free,但是还被 jemalloc 归还给 OS 的内存。
  • stats.mapped
    jemalloc 通过 mmap 从 OS 映射的虚拟地址空间总量。
  • stats.retained
    jemalloc 已经从 OS mmap 过来,但当前没有被任何 arena 使用,只是“留着以后可能再用”的内存。
1
2
3
4
5
6
7
stats.mapped
└── jemalloc 管理的所有虚拟内存
├── stats.allocated ← 应用正在用
├── arena 内部空闲内存
│ ├── dirty
│ └── muzzy
└── stats.retained ← jemalloc 暂存、未分配

RSS 如下

1
2
3
4
5
6
7
8
9
10
11
RSS 2560 KiB, allocated 3372 - 108 = 3264 KiB, mapped 15700 - 12408 = 3292 KiB
rss 19292 allocated 108 retained 1928 mapped 12408, (19292, 108, 1928, 12408)

rss 19292 allocated 806 retained 1220 mapped 13116, (0, 698, 18014398509481276, 708)
rss 19292 allocated 1448 retained 572 mapped 13764, (0, 641, 18014398509481336, 648)
rss 21852 allocated 2088 retained 2488 mapped 14408, (2560, 640, 1916, 644)
rss 21852 allocated 2728 retained 1844 mapped 15052, (0, 640, 18014398509481340, 644)
rss 21852 allocated 3372 retained 1196 mapped 15700, (0, 644, 18014398509481336, 648)

rss 21852 allocated 4012 retained 3624 mapped 16344, (0, 640, 2428, 644)
rss 21852 allocated 4652 retained 2980 mapped 16988, (0, 640, 18014398509481340, 644)

从上可以总结得到

1
3372 - 108 = 3264 KiB, mapped 15700 - 12408 = 3292 KiB

而 memleak 的输出如下,其中 ===============*********=============== 向上的是 combined,向下的是 outstanding。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Attaching to pid 37627, Ctrl+C to quit.
[13:38:02] Top 99 stacks with outstanding allocations:
64 bytes in 1 allocations from stack
operator new(unsigned long)+0x9 [test2]
...
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*&&)+0x2e [test2]
keep_allocate()+0x6c [test2]
main+0x9 [test2]
__libc_start_call_main+0x80 [libc.so.6]
160 bytes in 5 allocations from stack
operator new(unsigned long)+0x9 [test2]
main+0x9 [test2]
__libc_start_call_main+0x80 [libc.so.6]
2621445 bytes in 5 allocations from stack
operator new(unsigned long)+0x9 [test2]
===============*********===============
[13:38:02] Top 99/11 stacks with outstanding allocations:
64 bytes in 1 allocations from stack
0x0000000000472949 operator new(unsigned long)+0x9 [test2]
...
0x0000000000403ffc keep_allocate()+0x6c [test2]
0x0000000000404081 main+0x9 [test2]
0x00007f210c029590 __libc_start_call_main+0x80 [libc.so.6]
160 bytes in 5 allocations from stack
0x0000000000472949 operator new(unsigned long)+0x9 [test2]
0x0000000000404081 main+0x9 [test2]
0x00007f210c029590 __libc_start_call_main+0x80 [libc.so.6]
2621445 bytes in 5 allocations from stack
0x0000000000472949 operator new(unsigned long)+0x9 [test2]

memleak 简介

blazesym

It uses blazesym to parse the symbols:

  • If not compiled with -g, can’t get the line number
  • If stripped, can only have addresses

stack_id

Memleak uses stack_id to track different call stacks.

  1. Nature and Storage Mechanism
  • stack_id is an integer assigned by the kernel through interfaces like bpf_get_stackid() from a BPF_MAP_TYPE_STACK_TRACE-type map.
  • This map uses the hash value of the call stack as a key, stores the compressed stack information, and returns a unique ID.
  • The same call stack pattern (i.e., an identical sequence of instruction addresses at each level) always receives the same stack_id.
  1. Core Role
  • Avoids repeatedly transmitting complete call stack data by replacing it with a lightweight ID.
  • Enables efficient indexing and correlation of stack traces in performance analysis or debugging scenarios.
  1. Lifecycle
  • Created in BPF maps and persists as long as the BPF program is running.
  • May be automatically freed when the map is destroyed or manually deleted.
  1. Usage Constraints
  • The validity of a stack_id is context-dependent, which means it is meaningful only within the same BPF map.
    For example,

    1
    BPF_STACK_TRACE(stack_traces, 10240);
  • User-space tools typically resolve these IDs back to symbolic stack traces using additional metadata (e.g., /proc/kallsyms or debug symbols).

环境依赖

  • The kernel has to be compiled with the support for bpf.

    1
    2
    3
    4
    5
    CONFIG_BPF=y
    CONFIG_BPF_SYSCALL=y
    CONFIG_BPF_JIT=y
    CONFIG_HAVE_EBPF_JIT=y
    CONFIG_BPF_EVENTS=y
  • The docker for TiFlash has to be bootstrapped with the privileges required by bpf.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    securityContext:
    privileged: true
    capabilities:
    add:
    - SYS_ADMIN
    - SYS_RESOURCE
    - SYS_PTRACE
    - NET_ADMIN
    - SYSLOG
    - IPC_LOCK
    seccompProfile:
    type: Unconfined
  • We need to refer to the configs in /boot.

  • The kernel need kernel-devel installed, with the correct version.

    1
    sh-5.1# yum install kernel-devel