Linux Tracing¶
Version History
Date | Description |
---|---|
Jan 18, 2021 | Minor update |
Sep 6, 2020 | Add more eBPF |
Jun 10, 2019 | Initial version |
This blog tries to explain how various linux tracers work, especially their core low-level mechanisms and relationships with each other.
Intro¶
In Linux, we have:
- ftrace
- kprobe
- uprobe
- perf_event
- tracepoints
- eBPF
For all these tools, we can think this way:
- Tracing needs two parts,
1)
Mechanims to get data and do callback. This means we need a way to let our tracing/profiling code got invoked on a running system. This can be static or dynamic. Static means we added our tracing code to source code, like tracepoints. Dynamic means we added our tracing code when system is running, like ftrace and kprobe.2)
Do our stuff within callback. All of them provide some sort of handling. But eBPF is the most extensive one. - For example,
ftrace
,kprobe
, andperf_event
include the callback facilities, although they are not just limited to this.ftrace
has thecall mount
way to do callback on every single function invocation.kprobe
dynamically patch instructions and to do callback within exception handlers.perf_event
can let CPU fire NMI interrupt. Those are all mechanisms to catch perf data. - In all, ftrace, kprobe, uprobe, perf_event, tracepoints all have mechanisms to get data and do callback. ftrace is not programmable by normal users, it only prints the info. kprobe allows us to attach customized pre-/post-handlers. perf_event is not programmable, it only reports numbers. Unlike all above subsystems, eBPF itself cannot intercept any programs, but it can be attached to any of the above probes and run customized programs. That’s why eBPF looks so versatile!
The BPF Performance Tools book section 2 also takes a deep dive into this topic, and it links all subsystems together with a bit of history as well. Also see the blog from Julia: Linux tracing systems & how they fit together.
ftrace
¶
- Mechanism
- For each un-inlined function, gcc inserts a
call mcount
, or acall fentry
instruction at the very beginning. This means whenever a function is called, themcount()
or thefentry()
callback will be invoked. - Having these
call
instructions introduce a lot overheads. So by default kernel replacecall
withnop
. Only after weecho something > setup_filter_functions
will the ftrace code replacenop
withcall
. Do note, Linux uses the linker magic again here, check Steven’s slides. - You can do a
objdump vmlinux -d
, and able to see the following instructions for almost all functions:callq ffffffff81a01560 <__fentry__>
. - x86 related code:
arch/x86/kernel/ftrace_64.S
,arch/x86/kernel/ftrace.c
- Questions: it seems we can know when a function got called by using fentry, but how can we know the function has returned? The trick is: the returning address is pushed to stack when a function got called. So ftrace, again, can replace that return address, so it can catch the exit time, and calculate the latency of a function. Neat!!
- For each un-inlined function, gcc inserts a
- Resources
- Usage
Files under /sys/kernel/debug/tracing/*
perf help ftrace
kprobe
¶
- Mechanism
- Kprobe replaces the original assembly instruction with an
int3
trap instruction. So when we ran into the PC of the original instruction, an int3 CPU exception will happen. Withindo_in3()
, kernel will callback to core kprobe layer to dopre-handler
. After singlestep, CPU have debug exception. Kernel walks intodo_debug()
, where kprobe runpost-handler
. - Kprobe is powerful, because it’s able to trace almost everything at instruction level.
- Kprobe can NOT touch things inside
entry.S
. It needs a validpt_regs
to operate.
- Kprobe replaces the original assembly instruction with an
- Resources
eBPF
¶
- Mechanism
- I think one of the most important things is to understand what’s the relationship between eBPF and the others.
- Part I: Hook. eBPF attach its program to kprobe/uprobe/ftrace/perf_event.
You can think eBPF of a generic callback layer for kprobe/uprobe/ftrace/perf_event.
It’s essentially the second part of tracing we mentioned above.
(see
include/uapi/linux/bpf.h
, you can findBPF_PROG_TYPE_KPROBE
,BPF_PROG_TYPE_PERF_EVENT
) - Part II: Run. eBPF runs user supplied programs when the above hooks are invoked. eBPF is event-driven.
- Usually as a user, we do not need to write and load eBPF programs directly. That process is quite intense, you need to compile programs into eBPF bytecode, and then use eBPF SYSCALL to load into kernel. Quite a lot higher-level frameworks have been introduced. For example, bcc a layer on top of raw eBPF and smooth the process. bpftrace is even a layer higher than bcc, where users can write scripts to control eBPF. There are more frameworks on this space. Once you understand how it works below, it is not hard to understand and use high-level frameworks.
- Resources
perf
¶
- perf tool is simply amazing. It not only use CPU PMU, but also integrated with ftrace/kprobe/eBPF.
- perf is a tool to present data, but also a tool to collect data.
- Good references
Trace in real time:
Print the number of page faults happen in every one second:
perf stat -e "page-faults" -I 1000 -a -- sleep 10
Print the numberf of mmap
syscall happen in every one second:
perf stat -e "syscalls:sys_enter_mmap" -I 1000 -a -- sleep 10
Dynamically trace kernel functions:
perf probe --add do_anonymous_page
perf stat -I 5000 -e "page-faults,probe:do_anonymous_page" -- sleep 10
perf probe --del=probe:do_anonymous_page
References¶
I have built the LegoOS profilers, and profile points before.
Kernel maintains a top-level trace index file here.