Linux Tracing¶

Version History

This blog tries to explain how various linux tracers work, especially their core low-level mechanisms and relationships with each other.

Intro¶

In Linux, we have:

For all these tools, we can think this way:

Tracing needs two parts, 1) Mechanims to get data and do callback. This means we need a way to let our tracing/profiling code got invoked on a running system. This can be static or dynamic. Static means we added our tracing code to source code, like tracepoints. Dynamic means we added our tracing code when system is running, like ftrace and kprobe. 2) Do our stuff within callback. All of them provide some sort of handling. But eBPF is the most extensive one.
For example, ftrace, kprobe, and perf_event include the callback facilities, although they are not just limited to this. ftrace has the call mount way to do callback on every single function invocation. kprobe dynamically patch instructions and to do callback within exception handlers. perf_event can let CPU fire NMI interrupt. Those are all mechanisms to catch perf data.
In all, ftrace, kprobe, uprobe, perf_event, tracepoints all have mechanisms to get data and do callback. ftrace is not programmable by normal users, it only prints the info. kprobe allows us to attach customized pre-/post-handlers. perf_event is not programmable, it only reports numbers. Unlike all above subsystems, eBPF itself cannot intercept any programs, but it can be attached to any of the above probes and run customized programs. That’s why eBPF looks so versatile!

The BPF Performance Tools book section 2 also takes a deep dive into this topic, and it links all subsystems together with a bit of history as well. Also see the blog from Julia: Linux tracing systems & how they fit together.

Mechanism
- Kprobe replaces the original assembly instruction with an int3 trap instruction. So when we ran into the PC of the original instruction, an int3 CPU exception will happen. Within do_in3(), kernel will callback to core kprobe layer to do pre-handler. After singlestep, CPU have debug exception. Kernel walks into do_debug(), where kprobe run post-handler.
- Kprobe is powerful, because it’s able to trace almost everything at instruction level.
- Kprobe can NOT touch things inside entry.S. It needs a valid pt_regs to operate.
Resources
- An introduction to kprobes (LWN)

Mechanism
- I think one of the most important things is to understand what’s the relationship between eBPF and the others.
- Part I: Hook. eBPF attach its program to kprobe/uprobe/ftrace/perf_event. You can think eBPF of a generic callback layer for kprobe/uprobe/ftrace/perf_event. It’s essentially the second part of tracing we mentioned above. (see include/uapi/linux/bpf.h, you can find BPF_PROG_TYPE_KPROBE, BPF_PROG_TYPE_PERF_EVENT)
- Part II: Run. eBPF runs user supplied programs when the above hooks are invoked. eBPF is event-driven.
Usually as a user, we do not need to write and load eBPF programs directly. That process is quite intense, you need to compile programs into eBPF bytecode, and then use eBPF SYSCALL to load into kernel. Quite a lot higher-level frameworks have been introduced. For example, bcc a layer on top of raw eBPF and smooth the process. bpftrace is even a layer higher than bcc, where users can write scripts to control eBPF. There are more frameworks on this space. Once you understand how it works below, it is not hard to understand and use high-level frameworks.
Resources

Trace in real time:

Print the number of page faults happen in every one second:

perf stat -e "page-faults" -I 1000 -a -- sleep 10

Print the numberf of mmap syscall happen in every one second:

perf stat -e "syscalls:sys_enter_mmap" -I 1000 -a -- sleep 10

Dynamically trace kernel functions:

perf probe --add do_anonymous_page
perf stat -I 5000 -e "page-faults,probe:do_anonymous_page" -- sleep 10
perf probe --del=probe:do_anonymous_page

I have built the LegoOS profilers, and profile points before.

Kernel maintains a top-level trace index file here.