Skip to content

Linux Tracing

Version History
Date Description
Jan 18, 2021 Minor update
Sep 6, 2020 Add more eBPF
Jun 10, 2019 Initial version

This blog tries to explain how various linux tracers work, especially their core low-level mechanisms and relationships with each other.


Intro

In Linux, we have:

  • ftrace
  • kprobe
  • uprobe
  • perf_event
  • tracepoints
  • eBPF

For all these tools, we can think this way:

  • Tracing needs two parts, 1) Mechanims to get data and do callback. This means we need a way to let our tracing/profiling code got invoked on a running system. This can be static or dynamic. Static means we added our tracing code to source code, like tracepoints. Dynamic means we added our tracing code when system is running, like ftrace and kprobe. 2) Do our stuff within callback. All of them provide some sort of handling. But eBPF is the most extensive one.
  • For example, ftrace, kprobe, and perf_event include the callback facilities, although they are not just limited to this. ftrace has the call mount way to do callback on every single function invocation. kprobe dynamically patch instructions and to do callback within exception handlers. perf_event can let CPU fire NMI interrupt. Those are all mechanisms to catch perf data.
  • In all, ftrace, kprobe, uprobe, perf_event, tracepoints all have mechanisms to get data and do callback. ftrace is not programmable by normal users, it only prints the info. kprobe allows us to attach customized pre-/post-handlers. perf_event is not programmable, it only reports numbers. Unlike all above subsystems, eBPF itself cannot intercept any programs, but it can be attached to any of the above probes and run customized programs. That’s why eBPF looks so versatile!

The BPF Performance Tools book section 2 also takes a deep dive into this topic, and it links all subsystems together with a bit of history as well. Also see the blog from Julia: Linux tracing systems & how they fit together.

ftrace

  • Mechanism
    • For each un-inlined function, gcc inserts a call mcount, or a call fentry instruction at the very beginning. This means whenever a function is called, the mcount() or the fentry() callback will be invoked.
    • Having these call instructions introduce a lot overheads. So by default kernel replace call with nop. Only after we echo something > setup_filter_functions will the ftrace code replace nop with call. Do note, Linux uses the linker magic again here, check Steven’s slides.
    • You can do a objdump vmlinux -d, and able to see the following instructions for almost all functions: callq ffffffff81a01560 <__fentry__>.
    • x86 related code: arch/x86/kernel/ftrace_64.S, arch/x86/kernel/ftrace.c
    • Questions: it seems we can know when a function got called by using fentry, but how can we know the function has returned? The trick is: the returning address is pushed to stack when a function got called. So ftrace, again, can replace that return address, so it can catch the exit time, and calculate the latency of a function. Neat!!
  • Resources
  • Usage
    • Files under /sys/kernel/debug/tracing/*
    • perf help ftrace

kprobe

  • Mechanism
    • Kprobe replaces the original assembly instruction with an int3 trap instruction. So when we ran into the PC of the original instruction, an int3 CPU exception will happen. Within do_in3(), kernel will callback to core kprobe layer to do pre-handler. After singlestep, CPU have debug exception. Kernel walks into do_debug(), where kprobe run post-handler.
    • Kprobe is powerful, because it’s able to trace almost everything at instruction level.
    • Kprobe can NOT touch things inside entry.S. It needs a valid pt_regs to operate.
  • Resources

eBPF

  • Mechanism
    • I think one of the most important things is to understand what’s the relationship between eBPF and the others.
    • Part I: Hook. eBPF attach its program to kprobe/uprobe/ftrace/perf_event. You can think eBPF of a generic callback layer for kprobe/uprobe/ftrace/perf_event. It’s essentially the second part of tracing we mentioned above. (see include/uapi/linux/bpf.h, you can find BPF_PROG_TYPE_KPROBE, BPF_PROG_TYPE_PERF_EVENT)
    • Part II: Run. eBPF runs user supplied programs when the above hooks are invoked. eBPF is event-driven.
  • Usually as a user, we do not need to write and load eBPF programs directly. That process is quite intense, you need to compile programs into eBPF bytecode, and then use eBPF SYSCALL to load into kernel. Quite a lot higher-level frameworks have been introduced. For example, bcc a layer on top of raw eBPF and smooth the process. bpftrace is even a layer higher than bcc, where users can write scripts to control eBPF. There are more frameworks on this space. Once you understand how it works below, it is not hard to understand and use high-level frameworks.
  • Resources

perf

Trace in real time:

Print the number of page faults happen in every one second:

perf stat -e "page-faults" -I 1000 -a -- sleep 10

Print the numberf of mmap syscall happen in every one second:

perf stat -e "syscalls:sys_enter_mmap" -I 1000 -a -- sleep 10

Dynamically trace kernel functions:

perf probe --add do_anonymous_page
perf stat -I 5000 -e "page-faults,probe:do_anonymous_page" -- sleep 10
perf probe --del=probe:do_anonymous_page

References

I have built the LegoOS profilers, and profile points before.

Kernel maintains a top-level trace index file here.