Skip to content

Linux Tracing

Version History
Date Description
Sep 6, 2020 Add more eBPF
Jun 10, 2019 Initial version

Notes about on various tracers and profilers from the Linux kernel. Also links to some previous notes about the profilers/tracers in LegoOS: notes, and profile points.

Under Linux development envionment, we have:

  • ftrace
  • kprobe
  • uprobe
  • perf_event
  • tracepoints
  • eBPF

For all these tools, I tend to think this way:

  • Tracing needs two parts, 1) Mechanims to get data and do callback. This means we need a way to let our tracing/profiling code got invoked on a running system. This can be static or dynamic. Static means we added our tracing code to source code, like tracepoints. Dynamic means we added our tracing code when system is running, like ftrace and kprobe. 2) Do our stuff within callback. All of them provide some sort of handling. But eBPF is the most extensive one.
  • For example, ftrace, kprobe, and perf_event include the callback facilities, although they are not just limited to this. ftrace has the call mount way to do callback on every single function invocation. kprobe dynamically patch instructions and to do callback within exception handlers. perf_event can let CPU fire NMI interrupt. Those are all mechanisms to catch perf data.
  • In all, ftrace, kprobe, uprobe, perf_event, tracepoints all have mechanisms to get data and do callback. ftrace is not programmable by normal users, it only prints the info. kprobe allows us to attach customized pre-/post-handlers. perf_event is not programmable, it only reports numbers. Unlike all above subsystems, eBPF itself cannot intercept any programs, but it can be attached to any of the above probes and run customized programs. That’s why eBPF looks so versatile!

The BPF Performance Tools book section 2 also takes a deep dive into this topic, and it links all subsystems together with a bit of history as well. Also see the blog from Julia: Linux tracing systems & how they fit together.

ftrace:

  • Mechanism
    • For each un-inlined function, gcc inserts a call mcount, or a call fentry instruction at the very beginning. This means whenever a function is called, the mcount() or the fentry() callback will be invoked.
    • Having these call instructions introduce a lot overheads. So by default kernel replace call with nop. Only after we echo something > setup_filter_functions will the ftrace code replace nop with call. Do note, Linux uses the linker magic again here, check Steven’s slides.
    • You can do a objdump vmlinux -d, and able to see the following instructions for almost all functions: callq ffffffff81a01560 <__fentry__>.
    • x86 related code: arch/x86/kernel/ftrace_64.S, arch/x86/kernel/ftrace.c
    • Questions: it seems we can know when a function got called by using fentry, but how can we know the function has returned? The trick is: the returning address is pushed to stack when a function got called. So ftrace, again, can replace that return address, so it can catch the exit time, and calculate the latency of a function. Neat!!
  • Resources
  • Usage
    • Files under /sys/kernel/debug/tracing/*
    • perf help ftrace

kprobe:

  • Mechanism
    • Kprobe replaces the original assembly instruction with an int3 trap instruction. So when we ran into the PC of the original instruction, an int3 CPU exception will happen. Within do_in3(), kernel will callback to core kprobe layer to do pre-handler. After singlestep, CPU have debug exception. Kernel walks into do_debug(), where kprobe run post-handler.
    • Kprobe is powerful, because it’s able to trace almost everything at instruction level.
    • Kprobe can NOT touch things inside entry.S. It needs a valid pt_regs to operate.
  • Resources

eBPF:

  • Mechanism
    • I think the most important thing is to understand what’s the relationship between eBPF and the others.
    • Part I: Hook. eBPF attach its program to kprobe/uprobe/ftrace/perf_event. You can think eBPF of a generic callback layer for kprobe/uprobe/ftrace/perf_event. It’s essentially the second part of tracing as we mentioned above. (see include/uapi/linux/bpf.h, you can find BPF_PROG_TYPE_KPROBE, BPF_PROG_TYPE_PERF_EVENT)
    • Part II Run: eBPF run program when the above hook got invoked. eBPF is event-driven. The program can be user-written eBPF code. Other articles explained it well.
  • Resources
  • To use eBPF, use:

perf:

Trace in real time:

Print the number of page faults happen in every one second:

perf stat -e "page-faults" -I 1000 -a -- sleep 10

Print the numberf of mmap syscall happen in every one second:

perf stat -e "syscalls:sys_enter_mmap" -I 1000 -a -- sleep 10

Dynamically trace kernel functions:

perf probe --add do_anonymous_page
perf stat -I 5000 -e "page-faults,probe:do_anonymous_page" -- sleep 10
perf probe --del=probe:do_anonymous_page


Last update: September 7, 2020

Comments