Linux/LegoOS x86 Floating Point Unit¶
|Jan 9, 2021
|repolished after reading the why mmap is faster syscall post. Indeed, the difference is that mmap is using user-level AVX-aided memmove while kernel cannot. This reminded me of this post so I decided to move it to here.
|Feb 22, 2018
This blog documents how kernel is dealing with x86 FPU at a high level.
FPU is heavily used by user level code, but not kernel.
You may not use it directly, but glibc is using it all over the place, e.g. the
x86 FPU is really a super complex technology designed by Intel.
Of course its performance is good and also widely used, but the legacy compatible feature? Hmm, not so yummy.
The current x86 FPU code is well-written. Even though I don’t understand some of the low-level code, I do enjoy reading it. The naming convention, the code organization, the file organization, the header files, it is a nice piece of art.
Below I will briefly list kernel subsystems that use FPU. My understanding is based on code before the 2019 FPU patch, so some facts may have changed already.
FPU detection and init happen during early boot.
struct fpu is a dynamically-sized structure.
Its size depends on what features the underlying CPU support.
struct fpu is part of
task_struct is dynamically-sized as well
task_struct -> thread_struct -> fpu).
cpu_init() will also callback to init its local FPU.
FPU consists of a huge amount of registers.
Each thread will have its own FPU context.
However, the CPU itself will not save or restore any FPU registers automatically,
it is software’s duty to save and restore FPU context properly.
And FPU context/registers are saved into
Thus whenever we switch task, we also need to switch FPU context
(note: not always, it is optional, kernel is using a lazy switching trick).
__visible struct task_struct *
__switch_to(struct task_struct *prev_p, struct task_struct *next_p)
fpu_switch = switch_fpu_prepare(prev_fpu, next_fpu, cpu);
- fork() and clone(): When a new thread or process is created, the FPU context is copied from the calling thread.
- execve(): during this syscall, the FPU context will be cleared.
- exit(): When a thread exit, FPU will do cleanup based on whether
device not available exception, which may be triggered if lazyfpu is used.
do_coprocessor_error() are some math related exceptions.
Kernel needs to setup a
sigframe for user level signal handlers.
sigframe is a contiguous stack memory consists of the general purpose registers AND FPU registers.
So signal handling part has to call back to FPU code to setup and copy the FPU registers to the in stack
Signal handling is another beast.
Compatibility is a heavy thing to carry. But it is also a nice thing for marketing. No one can deny the success of Intel on its backward compatibility. Bad for low-level system developers.
- 2019 FPU patch: https://lkml.org/lkml/2019/4/3/877