Feb 2018¶
02/28 Wed¶
- patch fork, and cow handler
- debug pcache, while running python hello world
- add vDSO, gettimeofday
So, it is end of the day. After adding wp handler, I now have the whole picture of pcache activities, and the interactions between them. The reclaim, zap, move, copy, add, operations needs to be carefully synchronized. Also the refcount etc. I feel the ground rule is we need to make sure a PCM that a function is currently using, can not suddenly become invalid due to other operations. This has to be synced by: refcount, lock, flags. Oh well, mm is hard with SMP, but also fun.
We are very close to have a fully working OS.
I did not have time to look into the python hello world bug issue. It is a very serious one. It may also rule out some root bugs.
02/27 Tue¶
Spent two days on CS527 source project, implemented a small SSHD and SSD client. And we have to inject exactly five bugs, or vulnerabilities into the systems. Lol, it is really hard to intentionally plant BUGs!
Anyway, back to Lego. Since others are having a hard time compile program statically, I will try to add dynamic loader today.
The interpreter: /lib64/ld-linux-x86-64.so.2
.
hl_lines="11 15 17 18"
Linux seq.c maps (no randomization):
00400000-00401000 r-xp 00000000 fd:00 18752683 /root/ys/LegoOS/usr/a.out
00600000-00601000 r--p 00000000 fd:00 18752683 /root/ys/LegoOS/usr/a.out
00601000-00602000 rw-p 00001000 fd:00 18752683 /root/ys/LegoOS/usr/a.out
00602000-00604000 rw-p 00000000 00:00 0 [heap]
7ffff7a18000-7ffff7bd0000 r-xp 00000000 fd:00 55051990 /usr/lib64/libc-2.17.so
7ffff7bd0000-7ffff7dd0000 ---p 001b8000 fd:00 55051990 /usr/lib64/libc-2.17.so
7ffff7dd0000-7ffff7dd4000 r--p 001b8000 fd:00 55051990 /usr/lib64/libc-2.17.so
7ffff7dd4000-7ffff7dd6000 rw-p 001bc000 fd:00 55051990 /usr/lib64/libc-2.17.so
7ffff7dd6000-7ffff7ddb000 rw-p 00000000 00:00 0
7ffff7ddb000-7ffff7dfc000 r-xp 00000000 fd:00 55051983 /usr/lib64/ld-2.17.so
7ffff7fde000-7ffff7fe1000 rw-p 00000000 00:00 0
7ffff7ff9000-7ffff7ffa000 rw-p 00000000 00:00 0
7ffff7ffa000-7ffff7ffc000 r-xp 00000000 00:00 0 [vdso]
7ffff7ffc000-7ffff7ffd000 r--p 00021000 fd:00 55051983 /usr/lib64/ld-2.17.so
7ffff7ffd000-7ffff7ffe000 rw-p 00022000 fd:00 55051983 /usr/lib64/ld-2.17.so
7ffff7ffe000-7ffff7fff000 rw-p 00000000 00:00 0
7ffffffde000-7ffffffff000 rw-p 00000000 00:00 0 [stack]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
``` hl_lines=”25” lego after loading 00400000-00401000 r-xp 00000000 /root/ys/LegoOS/usr/a.out 00600000-00602000 rw-p 00000000 /root/ys/LegoOS/usr/a.out 00602000-00604000 rw-p 00000000 [heap] 7ffff7ddb000-7ffff7dfc000 r-xp 00000000 /lib64/ld-linux-x86-64.so.2 7ffff7ffc000-7ffff7ffe000 rw-p 00021000 /lib64/ld-linux-x86-64.so.2 7ffff7ffe000-7ffff7fff000 rw-p 00000000 7ffffffde000-7ffffffff000 rw-p 00000000 [stack]
[ 2066.379224] **** Finish dump final mm [ 2066.426023] handle_p2m_execve(): reply_status: OKAY, new_ip: 0x7ffff7ddc170, new_sp: 0x7fffffffede0 [ 2066.628949] handle_p2m_pcache_miss() cpu 4 I nid:0 pid:32 tgid:32 flags:150 vaddr:0x7ffff7ddc170 [ 2066.732034] handle_p2m_pcache_miss() cpu 4 O nid:0 pid:32 tgid:32 flags:150 vaddr:0x7ffff7ddc170 [ 2066.934947] handle_p2m_pcache_miss() cpu 4 I nid:0 pid:32 tgid:32 flags:51 vaddr:0x7fffffffedd8 [ 2067.036978] handle_p2m_pcache_miss() cpu 4 O nid:0 pid:32 tgid:32 flags:51 vaddr:0x7fffffffedd8 [ 2067.238842] handle_p2m_pcache_miss() cpu 4 I nid:0 pid:32 tgid:32 flags:50 vaddr:0x7ffff7ffce00 [ 2067.340880] handle_p2m_pcache_miss() cpu 4 O nid:0 pid:32 tgid:32 flags:50 vaddr:0x7ffff7ffce00 [ 2067.542747] handle_p2m_pcache_miss() cpu 4 I nid:0 pid:32 tgid:32 flags:51 vaddr:0x7ffff7ffd9a8 [ 2067.644774] handle_p2m_pcache_miss() cpu 4 O nid:0 pid:32 tgid:32 flags:51 vaddr:0x7ffff7ffd9a8 [ 2067.846640] handle_p2m_pcache_miss() cpu 4 I nid:0 pid:32 tgid:32 flags:50 vaddr:0x7ffff7ddb8e0 [ 2067.948679] handle_p2m_pcache_miss() cpu 4 O nid:0 pid:32 tgid:32 flags:50 vaddr:0x7ffff7ddb8e0 [ 2068.355424] ------------[ cut here ]------------ [ 2068.408568] WARNING: CPU: 4 PID: 31 at managers/memory/handle_pcache/fault.c:54 handle_p2m_pcache_miss+0x29d/0x380 [ 2068.532327] src_nid:0,pid:32,vaddr:0x7ffff7e0e000 [ 2068.588487] CPU: 4 PID: 31 Comm: mc-manager 4.0.0-lego-ys+ #100 [ 2068.659207] Stack: ```
[root@wuklab13: lib64] $ ll ld-*
-rwxr-xr-x 1 root root 164112 Nov 30 13:53 ld-2.17.so
lrwxrwxrwx 1 root root 10 Jan 8 12:34 ld-linux-x86-64.so.2 -> ld-2.17.so
[root@wuklab13: lib64]
It turns out there is a bug in mmap code: forgot to increment the file ref count when a file-backed vma is created. Some put_file in loader accidentally free the ld-linux file. Bug fixed, dyloader works like a charm.
02/24 Sat¶
Well. PhDs do not have weekends. Anyway, it is Saturday after all, relaxed a little bit. I was looking into the pcache issue. Also added our own kernel version strace.
02/23 Fri¶
Solved FPU BUG¶
current
is fine. I should not compare the old implementation with the new per-cpu current. I forgot that the kernel stack is switched in the __switch_to_asm
. This means in __switch_to()
, we are actually using the next_p
‘s kernel stack. So there is small time frame, where current_thread_info()
points to next_p
, while current_task
is still prev_p
. Since interrupts are disabled during context switch, we are good with this mismatch.
Rule out current, the only thing left is fpu__copy
warning, which happens during copy_process()
. One weird thing is this function has been called multiple times before it showed a warning. System itself use this function to create a lot background threads, which are fine. Only when it was triggered by sys_clone
then we have the warning:
c hl_lines="3 4 5 31 32 33 34"
[ 3213.055639] CPU: 6 PID: 17 sys_clone+0x0/0x30
[ 3213.056584] new task_struct: ffff88083e4c9838
[ 3213.057530] arch_dup_task_struct cpu6 dst:ffff88083e4c9838 17 word_count-seq src:ffff88083e457838 17 word_count-seq
[ 3213.059536] TRAP do_general_protection in CPU6, error_code: 0 current:ffff88083e457838 17 word_count-seq
[ 3213.061289] fixup_exception pid(17) cpu(6) insn:0xffffffff81009a21(fpu__copy+0x81/0x260) fixup:0xffffffff8105d9b2(__fixup_text_start+0xc2/0x322) handler:ex_handler_default+0x0/0x20
[ 3213.064114] ------------[ cut here ]------------
[ 3213.065040] WARNING: CPU: 6 PID: 17 at ./arch/x86/include/asm/fpu/internal.h:354 fpu__copy+0xc3/0x260
[ 3213.066760] CPU: 6 PID: 17 Comm: word_count-seq 4.0.0-lego+ #6
[ 3213.067855] Stack:
[ 3213.068424] ffff88083e4c7dd0 ffffffff810124b5 ffff88083e4c9bf8 ffff88083e4c9c38
[ 3213.070133] ffff88083e4c9838 00007ffff7ffd700 ffff88083e4c7de0 ffffffff8101258f
[ 3213.071775] ffff88083e4c7e08 ffffffff81009a63 ffff88083e457838 ffff88083e4c9838
[ 3213.073419] ffff88083e457838 ffff88083e4c7e40 ffffffff81000ebb ffff88083e457838
[ 3213.075057] ffff880800000011 ffff88083e457a68 00000000003d0f00 ffff88083e457838
[ 3213.076703] Call Trace:
[ 3213.077295] <TSK>
[ 3213.077828] [<ffffffff810124c1>] __warn.constprop.0+0x91/0xd0
[ 3213.078855] [<ffffffff8101258f>] warn_slowpath_null+0xf/0x20
[ 3213.081653] [<ffffffff81009a63>] fpu__copy+0xc3/0x260
[ 3213.082543] [<ffffffff81000ebb>] arch_dup_task_struct+0x7b/0x90
[ 3213.083667] [<ffffffff8101d32e>] copy_process+0x14e/0x10e0
[ 3213.084618] [<ffffffff8103a3c6>] ? n_tty_write+0x166/0x3c0
[ 3213.085564] [<ffffffff8101e2e6>] do_fork+0x26/0x140
[ 3213.086439] [<ffffffff8101e4a0>] ? sys_vfork+0x40/0x40
[ 3213.087333] [<ffffffff8101e4a0>] ? sys_vfork+0x40/0x40
[ 3213.088232] [<ffffffff8101e4c9>] sys_clone+0x29/0x30
[ 3213.089109] [<ffffffff8100e719>] do_syscall_64+0x69/0xf0
[ 3213.090030] [<ffffffff8100d5ec>] entry_SYSCALL64_slow_path+0x25/0x25
[ 3213.091078] <EOT>
[ 3213.091580] ---[ end trace 0000000000000000 ]---
[ 3213.093250] TRAP do_general_protection in CPU7, error_code: 0 current:ffff88083fd0f008 0 swapper/7
[ 3213.096526] fixup_exception pid(0) cpu(7) insn:0xffffffff81000c62(__switch_to+0x452/0x630) fixup:0xffffffff8105d922(__fixup_text_start+0x32/0x322) handler:ex_handler_default+0x0/0x20
[ 3213.101241] ------------[ cut here ]------------
[ 3213.103285] WARNING: CPU: 7 PID: 0 at ./arch/x86/include/asm/fpu/internal.h:369 __switch_to+0x47e/0x630
So, dig into fpu__copy()
, find out why it fails at this certain point. Glad I have something to dig into.
The instruction leads to GP is:
c
ffffffff8100b0f5: 48 0f ae 27 xsave64 (%rdi)
which is generated by: ```c #define XSTATE_XSAVE(st, lmask, hmask, err) \ asm volatile(ALTERNATIVE_2(XSAVE, \ XSAVEOPT, X86_FEATURE_XSAVEOPT, \ XSAVES, X86_FEATURE_XSAVES) \ “\n” \ “xor %[err], %[err]\n” \ “3:\n” \ “.pushsection .fixup,"ax"\n” \ “4: movl $-2, %[err]\n” \ “jmp 3b\n” \ “.popsection\n” \ _ASM_EXTABLE(661b, 4b) \ : [err] “=r” (err) \ : “D” (st), “m” (*st), “a” (lmask), “d” (hmask) \ : “memory”) static inline void copy_xregs_to_kernel(struct xregs_state *xstate) { u64 mask = -1; u32 lmask = mask; u32 hmask = mask >> 32; int err;
1 2 3 4 5 6 |
|
} ```
From SDM on XSAVE
: Use of a destination operand not aligned to 64-byte boundary (in either 64-bit or 32-bit modes) results in a general-protection (#GP) exception. In 64-bit mode, the upper 32 bits of RDX and RAX are ignored.
%rdi
is struct xregs_state *xstate
in above code. Thus, check if xstate
if 64-bytes aligned. Of course, it is not:
[10894.999997] copy_xregs_to_kernel CPU6 xstate: ffff88083e4c8c38
Hehe. Criminal identified. But why? The xstate structure is already marked as __attribute__(aliged 64)
in the code. It is the task_struct, which is NOT 0x40 aligned. But god why? Because we currently use kmalloc
to allocate new task_struct, whose minimum alignment is 8 bytes
. Anyway, use __alloc_pages
instead.
Such an deeply hidden bug. Took me almost a month to find out.
IB¶
Seen this during boot (at both P and M, although lego continue running correctly):
c hl_lines="8"
[54017.712533] *** NodeID Hostname LID QPN
[54017.770776] *** -------------------------------------
[54017.834220] *** 0 wuklab12 13 72
[54017.892462] *** 1 wuklab14 16 72 <---
[54017.955906] *** 2 wuklab16 20 74
[54018.014149] ***
[54074.552844] *** Start establish connection (mynodeid: 1)
[54102.554407] ib_process_mad mad_ifc fails
[54130.960691] *** recvpollcq runs on CPU2
[54131.070918] *** Successfully built QP for node 0 [LID: 13 QPN: 72]
[54131.152936] *** Successfully built QP for node 2 [LID: 20 QPN: 74]
[54161.228245] *** FIT layer ready to go!
[54161.272034] ***
Another one:
c
[ 1966.930409] ***
[ 1966.951210] *** FIT_initial_timeout_s: 30
[ 1967.002168] *** FIT_local_id: 0
[ 1967.052087] ***
[ 1967.072887] *** NodeID Hostname LID QPN
[ 1967.131126] *** -------------------------------------
[ 1967.194567] *** 0 wuklab12 13 72 <---
[ 1967.258005] *** 1 wuklab14 16 72
[ 1967.316244] *** 2 wuklab16 20 74
[ 1967.374484] ***
[ 2032.926448] *** Start establish connection (mynodeid: 0)
[ 2032.996068] Fail to modify qp[6]
[ 2033.032572] Fail to do client_init_ctx
[ 2033.077287] client_establish_conn: ctx (null) fail to init_interface
[ 2033.164646] ibapi_establish_conn: ctx (null) fail to init_interface
[ 2033.250967] ***
[ 2035.620167] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
[ 2035.713763] IP: [<ffffffff8105c589>] client_send_reply_with_rdma_write_with_imm+0x69/0x3b0
[ 2035.812562] PGD 0
[ 2035.836482] Oops: 0002 [#1] SMP PROCESSOR
[ 2035.884321] CPU: 0 PID: 1 Comm: kernel_init 4.0.0-lego-ys+ #253
[ 2035.955041] RIP: 0010:[<ffffffff8105c589>] [<ffffffff8105c589>] client_send_reply_with_rdma_write_with_imm+0x69/0x3b0
...
[ 2037.313267] <TSK>
[ 2037.336146] [<ffffffff8105a377>] ibapi_send_reply_timeout+0x57/0x70
[ 2037.411025] [<ffffffff81033d24>] ? net_send_reply_timeout+0x94/0x132
[ 2037.486944] [<ffffffff81033d24>] net_send_reply_timeout+0x94/0x132
pcache¶
Running word_count-pthread, with 100MB dataset, finally got some reasonable bug:
c
[54211.243181] pcache_evict_line(): pset: ffff88207f86e3c0, for uva: 0x7ffff1b8f000
[54211.385654] pcache:ffff88207f86e3a8 mapcount:8 refcount:0 flags:()
[54211.510447] pcache dumped because: PCACHE_BUG_ON_PCM(!PcacheLocked(pcm))
[54212.080336] BUG: failure at managers/processor/pcache/evict.c:240/pcache_evict_line()!
[54212.664785] Kernel Panic - not syncing: BUG!
[54212.715742] CPU: 8 PID: 81 Comm: word_count-pthr 4.0.0-lego-ys+ #252
...
[54213.391706] <TSK>
[54213.414584] [<ffffffff81024180>] panic+0xc2/0xeb
[54213.524818] [<ffffffff8101b81c>] ? task_tick_rt+0x2c/0xd0
[54213.589295] [<ffffffff81018f75>] ? scheduler_tick+0x55/0x60
[54213.655850] [<ffffffff81016625>] ? tick_handle_periodic+0x45/0x70
[54213.728647] [<ffffffff81006634>] ? apic_timer_interrupt+0x54/0x90
[54213.801443] [<ffffffff8100e22a>] ? smp__apic_timer_interrupt+0x6a/0x70
[54213.879439] [<ffffffff8101256d>] ? printk+0x11d/0x1b0
[54214.103027] [<ffffffff8102ecf4>] pcache_evict_line+0x134/0x220
[54214.172703] [<ffffffff8102c6ae>] pcache_alloc+0x22e/0x2e0
[54214.237179] [<ffffffff8102be0a>] common_do_fill_page+0x2a/0x1f0
[54214.307895] [<ffffffff8102baf0>] ? move_page_tables+0x4c0/0x4c0
[54214.378612] [<ffffffff8102c172>] pcache_handle_fault+0x1a2/0x3a0
[54214.450367] [<ffffffff8100fc02>] do_page_fault+0xa2/0x1a0
[54214.514843] [<ffffffff8100d85f>] page_fault+0x1f/0x30
[54214.575161] [<ffffffff81034842>] ? copy_user_enhanced_fast_string+0x2/0x10
[54214.657316] [<ffffffff81032368>] ? seq_read+0x248/0x360
[54214.719714] [<ffffffff810307af>] sys_read+0x3f/0xc0
[54214.777949] [<ffffffff81030770>] ? sweep_pset_lru+0x220/0x220
[54214.846587] [<ffffffff8100e619>] do_syscall_64+0x69/0xf0
[54214.910022] [<ffffffff8100d4ec>] entry_SYSCALL64_slow_path+0x25/0x25
[54214.985939] <EOT>
Another one:
c
[ 735.393244] pcache_evict_line(): pset: ffff88207f86e3c0, for uva: 0x7ffff1b8fd90
[ 735.537804] pcache:ffff88207f86e3a8 mapcount:8 refcount:0 flags:()
[ 735.663642] pcache dumped because: PCACHE_BUG_ON_PCM(!PcacheLocked(pcm))
Do note this happens after computation. This happens when phoenix create a lot threads to sort the results.
Both bug happen to the same set, same user page. The pcache is clearly corrupted: mapcount:8, refcount:0, flags:().
Come back after dinner. Remember to check altenative, cause the XSAVE above should be XSAVEOPT. Make sure it does not override other memory. Also, check linker script. Do not forget to link any sections.
Another several bug logs in wuklab13 and wuklab15: 022318-*
. I’m really tired today after fixing the FPU bug. But I’m also pretty confident pcache is something I’m able to debug. Even thought it is hard in SMP case.
Anyway, I gonna call for the day.
02/22 Thur¶
context switch fpusignal compat check, all good.makecurrent
use percpu current_task, so all code in Lego is consistent.checkedentry_SYSCALL-64
again, which looks good to me.- The only concern is
rsp_scratch
andcurrent_top_of_stack
, which are per-cpu variables. If these per-cpu is setup wrong, then we are doomed. - Also check if per-cpu is all cleared up?
- try big syscall lock
- does x86 has to use different kernel stacks? Interrupt is using different stack in Linux, has to do so???
- check current is correct. compare with old implementation.
First of all, FPU is definitely functional for now.
Since I replaced the current macro today, I add some code to check if this current matches our old implementation:
```
static __always_inline struct task_struct *get_current(void)
{
return this_cpu_read_stable(current_task);
}
//#define current get_current()
#define current \ ({ \ struct task_struct *old = current_thread_info()->task; \ struct task_struct *new = get_current(); \ \ if (old != new) { \ printk(“%s:%d() cpu:%d old:%pS %d %s new:%pS %d %s\n”, \ func, LINE, smp_processor_id(), old, old->pid, old->comm, \ new, new->pid, new->comm); \ BUG(); \ } \ get_current(); \ }) ```
Combined with some FPU warning, it is now like this:
```c hl_lines=”1 5 28 29 30”
[ 3273.748819] CPU:5 PID:32 sys_clone+0x0/0x30
[ 3273.800808] alloc_task_struct_node: size:740 ffff88107e831838
[ 3273.869451] arch_dup_task_struct() CPU5 current:32 new: ffff88107e831838 old: ffff88107e827838 32
[ 3273.975533] ------------[ cut here ]------------
[ 3274.030651] WARNING: CPU: 5 PID: 32 at ./arch/x86/include/asm/fpu/internal.h:354 fpu__copy+0xe2/0x310
[ 3274.140895] CPU: 5 PID: 32 Comm: word_count-pthr 4.0.0-lego-ys-gdbe6dbe-dirty #249
[ 3274.231377] Stack:
[ 3274.255298] ffff88107e82fd68 ffffffff81016dbf 00000000ffffffff 0000000000000000
[ 3274.342659] 00000000ffffffff 0000000000000000 ffff88107e831bf8 ffff88107e831c38
[ 3274.430021] ffff88107e831838 000000207fe64000 ffff88107e82fd78 ffffffff810170af
[ 3274.517382] ffff88107e82fdc0 ffffffff8100b052 0000000000000020 ffff88107e831838
[ 3274.604745] ffff88107e827838 ffff88107e827838 ffff88107e831838 ffff88107e827838
[ 3274.692106] Call Trace:
[ 3274.721229]
[ 3275.871493] ------------[ cut here ]------------ [ 3275.926614] BUG: failure at arch/x86/kernel/traps.c:186/do_general_protection()! [ 3276.015018] Kernel Panic - not syncing: BUG! [ 3276.065978] panic:107() cpu:6 old:0xffff88107e831838 33 word_count-pthr new:0xffff88107fcaf008 0 swapper/6 ```
Based on the switch code: ```c __switch_to(struct task_struct *prev_p, struct task_struct *next_p) { this_cpu_write(current_task, next_p);
1 2 |
|
} ```
Based on log line 30, load_sp0()
already happened, which means this_cpu_write(..)
happened too. If this_cpu_write(..)
happened, then log line 30’s new should have been updated to 0xffff88107e831838
. Something wrong with percpu?
02/21 Wed¶
- irq_regs, old code, check
- signal frame, and fpu hook together Done
in_interrupt()
, it is empty, TODO- check arch/x86/Makefile, it introduce a lot FPU flags.
- added more than 4K lines today. Damn FPU. Ugh go home sleep.
02/20 Tue Cloudy¶
Not too many Sunny days recently. Well, continue yesterday’s work. I don’t think I can easily find out why so many /proc/memoinfo
open happened. Instead, I’m trying to enable the flush_thread
in P’s exec code.
During the way, I found some issue related to __ARCH_HAS_SA_RESTORER
in signal code. I need to check if these x86 macros are defined, but lego does not port them.
Well, it turns out flush_thread does not make too much difference. Next I’m going to try to disable exit_thread
, which uses fpu__drop()
.
Hmm, disable exit_thread
also does not work.
02/19 Mon Rainy¶
It is another week. I can not deny I’m a little tired about the bug. Tried so many possible solutions, but none of them work. Well, today I first need to test the vma changes (pgoff and anon_vma) thing. Especially the vma merge and split.
This morning I fixed a bug in kernel_init process: make kernel_init able to run all possible CPUs. Because the first user process is forked from kernel_init, it is quite important that it gets the right cpu affinity:
c
static int kernel_init(void *unused)
{
...
set_cpus_allowed_ptr(current, cpu_possible_mask);
...
}
Well, interestingly, the unmodified word_count-pthread succeed with 50MB dataset… with or without any DEBUG option! Amazing! I need to find out why the cpus_allowed becomes 0 at the beginning of kernel_init. Because init_task
actually has:
c
.cpus_allowed = CPU_MASK_ALL,
.nr_cpus_allowed= NR_CPUS,
Things to do next:
- check why the cpus_allowed changed
- check why word_count-pthread open
/dev/../cpu
so many times. Anything wrong with ourcopy_files
, or open, close? - here is an idea, to verify if FPU code is correct, run some scientific benchmarks.
Okay, findings:
-
cpus_allowd is fine, it is reset inside
sched_init()
, when it tries make theinit_task
as theidle
thread. Thus it is reasonable to set cpus_allowed again atkernel_init
thread. And it should NOTHING to do with the bug. -
about the second, check the following log: ```c hl_lines=”35 36 61 62 63 64 65 66 67” [11838.364543] STDOUT: —[ Wordcount: Running… ]— [11838.422886] STDOUT: —[
]— [11838.463445] SYSC_open(cpu5 pid:32): f_name: /root/ys/phoenix/phoenix-2.0/tests/word_count/word_count_datafiles/word_50MB.txt, flags: 0, mode: 900 [11838.619460] SYSC_open(cpu5 pid:32): fd: 3 [11838.667406] SYSC_open(cpu5 pid:32): f_name: /sys/devices/system/cpu/online, flags: 80000, mode: 0 [11838.773351] SYSC_open(cpu5 pid:32): fd: 4 [11838.821239] seq_file: dest_uva: 00007fffffffc8d0, nr_chars: 5 string: [0-23 ] [11838.913791] SYSC_close(cpu5 pid:32): fd: 4 [11838.962622] SYSC_close(): [4] -> [/sys/devices/system/cpu/online] [11840.223255] STDOUT: —[ Word Count: Computation Completed 1.555581 sec
]— [11840.309678] SYSC_open(cpu5 pid:32): f_name: /sys/devices/system/cpu/online, flags: 80000, mode: 0 [11840.415754] SYSC_open(cpu5 pid:32): fd: 4 [11840.463593] seq_file: dest_uva: 00007fffffffc8a0, nr_chars: 5 string: [0-23 ] [11840.556147] SYSC_close(cpu5 pid:32): fd: 4 [11840.605024] SYSC_close(): [4] -> [/sys/devices/system/cpu/online] [11840.677821] STDOUT: —[ THe number of processors is 24
รด
]—
[11840.753769] SYSC_open(cpu7 pid:80): f_name: /proc/meminfo, flags: 80000, mode: 1b6
[11840.844212] SYSC_open(cpu19 pid:92): f_name: /proc/meminfo, flags: 80000, mode: 1b6
[11840.935728] SYSC_open(cpu7 pid:80): fd: 4
[11840.983567] SYSC_open(cpu19 pid:92): fd: 5
[11841.032444] seq_file:
dest_uva: 00007ffff444c000, nr_chars: 172
string: [MemTotal: 115355128 kB
MemFree: 115355128 kB
MemAvailable: 115355128 kB
DirectMap4k: 5812 kB
DirectMap2M: 1861632 kB
DirectMap1G: 134217728 kB
]
[11841.305953] seq_file:
dest_uva: 00007ffff444b000, nr_chars: 172
string: [MemTotal: 115355128 kB
MemFree: 115355128 kB
MemAvailable: 115355128 kB
DirectMap4k: 5812 kB
DirectMap2M: 1861632 kB
DirectMap1G: 134217728 kB
]
[11841.579460] SYSC_close(cpu7 pid:80): fd: 4
[11841.628339] SYSC_close(cpu19 pid:92): fd: 5
[11841.678257] SYSC_close(): [4] -> [/proc/meminfo]
[11841.733375] SYSC_close(): [5] -> [/proc/meminfo]
[11841.788493] SYSC_open(cpu18 pid:91): f_name: /proc/meminfo, flags: 80000, mode: 1b6
[11841.880008] SYSC_open(cpu6 pid:102): f_name: /proc/meminfo, flags: 80000, mode: 1b6
[11841.971523] SYSC_open(cpu12 pid:85): f_name: /proc/meminfo, flags: 80000, mode: 1b6
[11842.063040] SYSC_open(cpu0 pid:97): f_name: /proc/meminfo, flags: 80000, mode: 1b6
[11842.153516] SYSC_open(cpu14 pid:87): f_name: /proc/meminfo, flags: 80000, mode: 1b6
[11842.245032] SYSC_open(cpu16 pid:89): f_name: /proc/meminfo, flags: 80000, mode: 1b6
[11842.336548] SYSC_open(cpu4 pid:100): f_name: /proc/meminfo, flags: 80000, mode: 1b6
[11842.428064] SYSC_open(cpu16 pid:89): fd: 9
[11842.476942] SYSC_open(cpu4 pid:100): fd: 10
[11842.526860] seq_file:
dest_uva: 00007ffff444c000, nr_chars: 172
string: [MemTotal: 115355128 kB
MemFree: 115355128 kB
MemAvailable: 115355128 kB
DirectMap4k: 5812 kB
DirectMap2M: 1861632 kB
DirectMap1G: 134217728 kB
]
[11842.800368] seq_file:
dest_uva: 00007ffff444b000, nr_chars: 172
string: [MemTotal: 115355128 kB
MemFree: 115355128 kB
MemAvailable: 115355128 kB
DirectMap4k: 5812 kB
DirectMap2M: 1861632 kB
DirectMap1G: 134217728 kB
]
[11843.073877] SYSC_close(cpu16 pid:89): fd: 9
- However, in a normal Linux exeution:
c hl_lines=”11 14 15”
strace -C -o strace_2 ./word_count-pthread ./word_count_datafiles/word_50MB.txt
% time seconds usecs/call calls errors syscall
86.41 0.052074 1736 30 futex 6.89 0.004151 67 62 munmap 2.47 0.001490 17 88 mmap 2.12 0.001278 14 93 clone 1.51 0.000912 14 64 mprotect 0.19 0.000117 7 16 write 0.15 0.000092 46 2 open
$ cat strace_2 | grep open open(“./word_count_datafiles/word_50MB.txt”, O_RDONLY) = 3 open(“/sys/devices/system/cpu/online”, O_RDONLY|O_CLOEXEC) = 4 ```
-
It opened the
/proc/meminfo
for way too many times. In the normal Linux execution, this should not happen. Is it because our meminfo is faked, so glibs is complaining? But why it does not open meminfo while running in Linux? Or does our entry assembly messed up some stuff in stack, so the return path changed? -
oh, about the FPU. It reminds our
flush_thread
function actually has an issue before. When I enabled this function during loading in P, the P will crash. Withinflush_thread
, there is afpu_clear
!!! So, check this tomorrow! (12:00am, need to go home)
02/18 Sun Sunny¶
It is a nice day. Yesterday I’ve changed one line of code in mmap code path: change anonymous vma’s pgoff from some value to 0. The result is I got several succeed work-count-pthread(bind to one core) testing. However, it still fail with unmodified word-count-pthread.
It brings me to inspect pgoff manipulation code and all mmap.c code. We ported everything from linux without almost zero modification. That means we ported all those useless anon_vma
and pgoff code, which is used a lot by vma_merge, vma_split code. The thing is: our memory manager, our vma code do not need such anon_vma
structure, and do not maintain pgoff. Thus, I’m a little bit worried linux code may doing some crazy behind our back: mess vma and pages, then pcache miss gets some wrong pages
Well. Lego does not use anon_vma
, and pgoff should only be used by file-backed vma. So, I decided to remove anon_vma
from our code, and make sure pgoff is used properly. Of course, the goal is to make vma_merge, split,
copy, do the things we intended.
Lesson learned.
02/17 Sat Snowy¶
Fixed the bss bug. It comes from loader. We did not implement the lego_clear_user
function, so some part of bss is non-zero.
Bad news is word_count-pthread still fail at same fpu instruction. Have to look into memory code more.
This is actually a fun debugging story. We should always add TODO or XXX or some warnings to unfinished code, no matter what. Lesson learned.
02/16 Fri Cloudy¶
Yilun found a major loader bug yesterday: the .bss
section variables are not 0, in the iozone
benchmark. I did not encounter this issue before with simple test program. This is pretty serious.
02/15 Thur Rainy¶
Today is Chinese New Year.
Line 7 and 8 show the uva belong to the same page. Need to revisit get_arg_pages
etc functions.
hl_lines="7 8"
[ 108.393991] handle_p2m_execve(): pid:22,argc:2,envc:2,file:/root/ys/phoenix/phoenix-2.0/tests/word_count/word_count-pthread
[ 108.395255] argc[0] (len: 65): /root/ys/phoenix/phoenix-2.0/tests/word_count/word_count-pthread
[ 108.396329] argc[1] (len: 82): /root/ys/phoenix/phoenix-2.0/tests/word_count/word_count_datafiles/word_100MB.txt
[ 108.397530] envc[0] (len: 7): HOME=/
[ 108.398069] envc[1] (len: 11): TERM=linux
[ 108.398640] __bprm_mm_init vma: ffff88083effe6b8
[ 108.399226] faultin_page vma: ffff88083effe6b8 uva: 0x7fffffffefed
[ 108.399949] faultin_page vma: ffff88083effe6b8 uva: 0x7fffffffef94
Well, this is 100% fine. I wrote this loader code long time ago and need some time to pickup. So, after I read the loader code, especially the copy_strings
function, I found this is okay. Because copy_strings will be invoked three times, so the faultin_page
basically will be invoked at least three times. That is why it went to that pte fault handling code.
Although actually I think copy_strings
should not use faultin_page
, instead, it should use get_user_pages
, which will walk through the pgtable first, then went to handle_lego_mm_fault
.
02/14 Wed Rainy¶
Hmm, tried to make kmalloc behave as kzalloc, and bind all threads to one core, still gave the same old bug: ``` hl_lines=”1” 42731a: f3 0f 6f 16 movdqu (%rsi),%xmm2
[93182.657376] word_count-pthr[85] general protection ip:42731a sp:7fffe3ffed28 error:0 [93182.747959] CPU: 8 PID: 85 Comm: word_count-pthr 4.0.0-lego+ #170 [93182.820758] RIP: 0033:[<000000000042731a>] [<000000000042731a>] 0x42731a [93182.901878] RSP: 002b:00007fffe3ffed28 EFLAGS: 00010283 [93182.965317] RAX: 000000000000001f RBX: 00007ffff001b010 RCX: 0000000000000005 [93183.050596] RDX: 0000000000000000 RSI: 5345485355420045 RDI: 00007ffff294791f [93183.135876] RBP: 00007ffff294791f R08: 000000000000ffff R09: 0000000000000008 [93183.221156] R10: fffffffffffff048 R11: 00000000004acfc0 R12: 0000000000001cde [93183.306435] R13: 00000000006e4a8c R14: 0000000000001cd7 R15: 0000000000001cda [93183.391716] FS: 00007fffe3fff700(0000) GS:ffff88107fc80000(0000) knlGS:0000000000000000 [93183.488434] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [93183.557075] CR2: 00007ffff27a4000 CR3: 000000107e924000 CR4: 00000000000406a0 ```
``` hl_lines=”1” 427377: 66 0f 6f 17 movdqa (%rdi),%xmm2
[93180.527248] word_count-pthr[93]: segfault at 0x0 ip 0000000000427377 sp 00007fffdfff6d28 error 4 [93180.630314] CPU: 8 PID: 93 Comm: word_count-pthr 4.0.0-lego+ #170 [93180.703114] RIP: 0033:[<0000000000427377>] [<0000000000427377>] 0x427377 [93180.784234] RSP: 002b:00007fffdfff6d28 EFLAGS: 00010297 [93180.847674] RAX: 0000000000000000 RBX: 000000000073c4c0 RCX: 000000000000000d [93180.932953] RDX: 000000000000ffff RSI: 00007ffff4999070 RDI: 0000000000000000 [93181.018233] RBP: 00007ffff499907d R08: 000000000000ffff R09: 0000000000000000 [93181.103513] R10: 0000000000427760 R11: 00007ffff49982c0 R12: 0000000000000118 [93181.188791] R13: 00000000006e4aac R14: 0000000000000116 R15: 0000000000000117 [93181.274072] FS: 00007fffdfff7700(0000) GS:ffff88107fc80000(0000) knlGS:0000000000000000 [93181.370790] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [93181.439430] CR2: 0000000000000000 CR3: 000000107e924000 CR4: 00000000000406a0 ```
Tried several ways to ensure memory safety. It still failed even if I enabled all of them. So, I guess the memory safety is ensured? Still some other things?
- force
alloc_pages
to use__GFP_ZERO
- make
kmalloc
behave askzalloc
- make
kfree
empty
I also suspect munmap
may free extra wrong pgtable entries. Although I’ve went through all the code and checked, but in addition to the above things, I’m going to:
- make munmap dummy (no p2m_munmap, return 0 directly)
Failed.
Next, I’m going to:
- add checksum for every page transferred across network.
- add warning for unnormal cases
Bang! I found something while running P+M:
c hl_lines="9 10 11 12 13"
[ 115.727597] Memory-component manager is up and running.
[ 116.691723] handle_p2m_fork(): nid:0,pid:22,tgid:22,parent_tgid:1
[ 116.697038] handle_p2m_fork(): reply: 0:OKAY
[ 116.791088] handle_p2m_execve(): pid:22,argc:2,envc:2,file:/root/ys/phoenix/phoenix-2.0/tests/word_count/word_count-pthread
[ 116.792357] argc[0] (len: 65): /root/ys/phoenix/phoenix-2.0/tests/word_count/word_count-pthread
[ 116.793439] argc[1] (len: 82): /root/ys/phoenix/phoenix-2.0/tests/word_count/word_count_datafiles/word_100MB.txt
[ 116.794653] envc[0] (len: 7): HOME=/
[ 116.795196] envc[1] (len: 11): TERM=linux
[ 116.795772] __bprm_mm_init vma: ffff88083effe6b8
[ 116.796209] faultin_page vma: ffff88083effe6b8
[ 116.796729] faultin_page vma: ffff88083effe6b8
[ 116.797150] handle_pte_fault vma: ffff88083effe6b8 entry: 0xffff88083e8c1067
[ 116.798044] pte:ffff88083e8c0ff0 pfn:0x8083e8c1 flags:(present|writable|user|accessed|dirty|softw4|pkey0|pkey1|pkey2|pkey3|nx|0x3ff800000000000)
[ 116.799462] ------------[ cut here ]------------
[ 116.800049] WARNING: CPU: 4 PID: 15 at managers/memory/vm/fault.c:148 handle_lego_mm_fault+0x4d8/0x550
[ 116.801148] CPU: 4 PID: 15 Comm: mc-manager 4.0.0-lego+ #78
[ 116.801818] Stack:
[ 116.802179] ffff88083e893c50 ffffffff8100e827 00007fffffffef94 ffff88083effe6b8
[ 116.803283] ffff88083e894008 ffff88083e8c1067 ffff88083e893c60 ffffffff8100e91f
[ 116.804387] ffff88083e893cf0 ffffffff8102b008 0000000000000031 ffff88083e893cf0
[ 116.805488] 0000000000000296 00003fffffe00000 ffff800000000067 ffff88083e893d50
[ 116.806590] ffff880000000001 ffffffff81066798 ffff88083effe6b8 ffff88083e893d50
[ 116.807691] Call Trace:
[ 116.808087] <TSK>
[ 116.808448] [<ffffffff8100e836>] __warn.constprop.0+0xa6/0x100
[ 116.809126] [<ffffffff8100e91f>] warn_slowpath_null+0xf/0x20
[ 116.809802] [<ffffffff8102b008>] handle_lego_mm_fault+0x4d8/0x550
[ 116.810505] [<ffffffff8102cfe3>] faultin_page+0x43/0xb0
[ 116.811131] [<ffffffff8102dab1>] copy_strings.isra.1+0xe1/0x130
[ 116.811819] [<ffffffff8102dd1e>] exec_loader+0x21e/0x350
[ 116.812457] [<ffffffff8102680a>] handle_p2m_execve+0x1aa/0x290
This is a temporary stack vma that loader created for saving argv and envp. So, this vma was created here:
c
static int __bprm_mm_init(struct lego_binprm *bprm)
{
...
bprm->vma = vma = kzalloc(sizeof(*vma), GFP_KERNEL);
...
}
And then copy_strings
will call faultin_page
to populate a page for a specific user virtual adddress:
c
int faultin_page(struct vm_area_struct *vma, unsigned long start,
unsigned long flags, unsigned long *kvaddr)
{
...
ret = handle_lego_mm_fault(vma, start, flags, kvaddr);
...
}
Eventually, the handle_lego_mm_fault
will call handle_pte_fault
:
```c static int handle_pte_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags, pte_t *pte, pmd_t *pmd, unsigned long *mapping_flags) { … if (!pte_present(entry)) { … }
1 2 3 4 |
|
} ```
Apparently, pte is wrong! But I don’t have time today. Continue tomorrow. Hmm forgot that we are saving kernel virtual addresses in the pte. Just take a quick look at the lego_pud_alloc things, seems will have some issues. I defenitly need to check all these stuff tomorrow. I’ve not touch this part for too long!
02/13 Tue Sunny¶
Checking our SLOB allocator today. So I found Yutong’s code is using set_page_private
when slob get a new page from buddy. This private field is only intended to be used by buddy to record the order
. This mixed usage will confuse buddy and create bug.
Even though I removed the `:::c set_page_private(page, 0)` after free_page
, word_count-pthread still fails. Damn.
02/12 Mon Cloudy¶
Add this commit 4cb3a8b6a943c90714fd9bb5e5465ee315f0aa30
:
```
memory: Use kzalloc instead of kmalloc in __bprm_mm_init (loader)
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
```
The story is:
I patched the stop_machine code today, and tried to run code with P+M on VM, everything works fine. However, when I tried to run the new code with P+M+S on physical machine, M crashed at a very weird point:
hl_lines="7 12 13 14 15 16"
[ 7791.998168] handle_p2m_execve(): pid:81,argc:2,envc:2,file:/root/ys/phoenix/phoenix-2.0/tests/word_count/word_count-pthread
[ 7792.129312] BUG: unable to handle kernel NULL pointer dereference at 0000000000000031
[ 7792.222889] IP: [<ffffffff8102c180>] handle_lego_mm_fault+0x160/0x4b0
[ 7792.299842] PGD 0
[ 7792.323760] Oops: 0000 [#1] PREEMPT SMP MEMORY
[ 7792.376794] CPU: 4 PID: 79 Comm: mc-manager 4.0.0-lego+ #29
[ 7792.443349] RIP: .. [<ffffffff8102c180>] handle_lego_mm_fault+0x160/0x4b0
......
....
[ 7793.750506] Call Trace:
[ 7793.779623] <TSK>
[ 7793.802501] [<ffffffff810053f4>] ? apic_timer_interrupt+0x54/0x90
[ 7793.875295] [<ffffffff8102e469>] faultin_page+0x9/0x70
[ 7793.936649] [<ffffffff8102ef01>] copy_strings.isra.1+0xe1/0x130
[ 7794.007362] [<ffffffff8102f11e>] exec_loader+0x1ce/0x340
[ 7794.070796] [<ffffffff81027def>] handle_p2m_execve+0x12f/0x200
[ 7794.140469] [<ffffffff810274fb>] mc_manager+0x1ab/0x2b0
[ 7794.202864] [<ffffffff81027350>] ? bitmap_fill+0x33/0x33
[ 7794.266298] [<ffffffff8101c6b7>] kthread+0x107/0x130
[ 7794.325572] [<ffffffff8101c5b0>] ? __kthread_parkme+0x90/0x90
[ 7794.394205] [<ffffffff8100b462>] ret_from_fork+0x22/0x30
So faulting source code is:
C hl_lines="5 6"
static int handle_pte_fault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags, pte_t *pte, pmd_t *pmd)
{
....
if (vma->vm_ops && vma->vm_ops->fault)
return do_linear_fault(vma, address, flags,
pte, pmd, entry)
....
Something wrong with vma
? At this loader stage, this vma is a temporaty stack vma created for saving argv
and envp
. So I look back into the code that created this vma:
```c hl_lines=”8”
managers/memory/loader/core.c:
static int __bprm_mm_init(struct lego_binprm *bprm)
{
int err;
struct vm_area_struct *vma = NULL;
struct lego_mm_struct *mm = bprm->mm;
1 2 3 |
|
```
The code after this does NOT do necessary cleanup. The vm_ops
happens to have some garbage value from last user. So it is not 0, so the above vma->vm_ops
is true, and it will try to read vma->vm_ops->fault
. And that, my friend, is where garbage turns into crash.
This presents a series of potential bugs. Ugh, memory safety
!
02/09 Fri Cloudy
¶
Tried to modify Phoneix code: replace realloc
with malloc+mempcy
. Thus the mremap
syscall is avoided, but it still has general protection fault. Same with yesterday, corrupted at __strcmp_sse42
, with corrupted RSI
or RDI
. So I guess it is not about mremap
itself at all. I will follow yesterday’s checking list.
02/08 Thur Cloudy
¶
``` hl_lines=”12 37” 00000000004272d0 <__strcmp_sse42>:
4272d0: 89 f1 mov %esi,%ecx
4272d2: 89 f8 mov %edi,%eax
4272d4: 48 83 e1 3f and $0x3f,%rcx
4272d8: 48 83 e0 3f and $0x3f,%rax
4272dc: 83 f9 30 cmp $0x30,%ecx
4272df: 77 3f ja 427320 <__strcmp_sse42+0x50>
4272e1: 83 f8 30 cmp $0x30,%eax
4272e4: 77 3a ja 427320 <__strcmp_sse42+0x50>
4272e6: f3 0f 6f 0f movdqu (%rdi),%xmm1
* 4272ea: f3 0f 6f 16 movdqu (%rsi),%xmm2
4272ee: 66 0f ef c0 pxor %xmm0,%xmm0
4272f2: 66 0f 74 c1 pcmpeqb %xmm1,%xmm0
4272f6: 66 0f 74 ca pcmpeqb %xmm2,%xmm1
4272fa: 66 0f f8 c8 psubb %xmm0,%xmm1
4272fe: 66 0f d7 d1 pmovmskb %xmm1,%edx
427302: 81 ea ff ff 00 00 sub $0xffff,%edx
427308: 0f 85 42 0d 00 00 jne 428050 <__strcmp_sse42+0xd80>
42730e: 48 83 c6 10 add $0x10,%rsi
427312: 48 83 c7 10 add $0x10,%rdi
427316: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
42731d: 00 00 00
427320: 48 83 e6 f0 and $0xfffffffffffffff0,%rsi
427324: 48 83 e7 f0 and $0xfffffffffffffff0,%rdi
427328: ba ff ff 00 00 mov $0xffff,%edx
42732d: 45 31 c0 xor %r8d,%r8d
427330: 83 e1 0f and $0xf,%ecx
427333: 83 e0 0f and $0xf,%eax
427336: 66 0f ef c0 pxor %xmm0,%xmm0
42733a: 39 c1 cmp %eax,%ecx
42733c: 74 32 je 427370 <__strcmp_sse42+0xa0>
42733e: 77 07 ja 427347 <__strcmp_sse42+0x77>
427340: 41 89 d0 mov %edx,%r8d
427343: 91 xchg %eax,%ecx
427344: 48 87 f7 xchg %rsi,%rdi
* 427347: 66 0f 6f 17 movdqa (%rdi),%xmm2
(RDI: 0000000000000000)
```
Frustrating! What is wrong with multithread program? Because of broken FPU-switch code? of inappropriate TLB flush? of IB corrupts memory? of what? ugh?
I’m done with this random guess and frustrated general protection or segfault, I need to first make sure underlying kernel is 100% percent correct, this is a checking list:
- fpu save/restore
- always fail at some XMM instruction
- always with corrupted RDI or RSI
- switch_to_asm
- %gs and %fs
- switch_mm (pgd)
- stack frame
- set_arch_tls (%fs)
- glibc’s way of using per thread data
- some cpu may miss tlb flush
- kernel entry/exit assembly
- current_task macro
- stack_stratch
- per-cpu data in entry.S
- futex
- clear_tid
- set_tid
- shared mm
- robust list
- interrupts
- vector array
- APIC setup
- IO-APIC
- timer interrupt
- cpu_init and Trampoline
- faked kernel version
- P side pgfault handling code (SMP)
- and M side pgfault handling (SMP)
- mremap, munmap
- check pgtable boundary
- In all, check SMP implications
Is there any code, that is solely used to test if the underlying kernel has appropriate behaviors? Like glibc test code?
How to protect kernel virtual memory? Any existing solutions in Linux?
What is the implication of multiple CPU entering kernel at the same time? How can it corrupt user pages? Maybe: kernel entry code, per-cpu data in entry code, fpu code, switch_to, scheduler.
Why it always fail at those FPU code i.e. the strcmp function? I failed to compile without those sse, any solution? How it hurt performance?
02/07 Wed Cloudy
¶
20:07
Pushed a small patch on mremap issue. Hope it will work. mremap really makes the whole thing very interesting, will be a very good research finding on combing virtual cache and operating system. Need to go gym with a friend, will be back on debugging late tonight.
9:30
Have two meetings to do today, and an security class, won’t have too much time coding during daytime.
02/06 Tue Sunny
¶
Well. We’ve ruled out both smp_call_function
and workqueue
yesterday with Yiying’s help. But the multi-thread word-count still fails :-(
Single thread word-count just finished 4GB dataset (with 8GB pcache). So what could be still wrong with multithread one????
- chill
- check exit code
(Checked)
check pcache’s usage of task_struct, should always use the group_leader- check cpu boot code and check the switch code again
- I believe pinpoint the issue in multithread word-count can solve a lot issues, it must be some thread creation, removal, schedule things.
- How about adding a lock for ibapi, make it sequential? Sweet, I tried, finally it is
a bug that we are able to debug
.
22:39
Done for today. I’m trying to patch move_pte
and pcache_move_pte
. Although in theory we defenitly need to patch it, I keep thinking the code before should not trigger any serious bus or memory corruption. Ugh. Maybe it is concurrent mremap
that one of them remap from A to B, while another one remap from C to A. It is possible. But my dead brain can not think of this anymore. I’m going to hit the gym and do some squats.
17:01
Criminal found: mremap()
and virtual cache
did the crime. Interesting, I have not seen any research paper, tech-reports, writeup, code about this, not even the OVC paper, which, by the way, I think they must consider this case. Otherwise, a mremap will simply crash its virtual cache. Many thanks went to my smoke-and-think time.
15:14
Something new came up! After adding a spinlock for ibapi, this showed up (I tried one more time after this, which does not show up). We are lucky to catch this. At least I know where to look at. Also, this is defenitly triggered by mremap
. It is seems it is overlapped mremap()
. One thing I did not know is which thread trigger this bug, the sweep thread? Cause mremap related pcache rmap functions do not use rmap_get_locked_pte
.
``` [ 3826.048774] normal_p2s_open(): f_name: word_100MB.txt, mode: 04400, flags: 0 [ 3827.891622] SYSC_mremap(cpu18): move: [0x7fffe5788000 - 0x7fffe5806000] -> [0x7fffe531b000 - 0x7fffe5399000] [ 3828.178643] SYSC_mremap(cpu14): move: [0x7fffe5941000 - 0x7fffe5980000] -> [0x7fffe57c7000 - 0x7fffe5806000]
**** ERROR: mismatched PTE and rmap **** rmap->owner_process: word_count-pthr uva: 0x7fffe57c8000 ptep: ffff88107efe0e40, rmap->page_table: ffff88107efe0e40 **** pcache_pfn: 0x1257c8, pte_pfn: 0x125942 ```
14:00
word_count-pthread
: 100MB dataset
pcache
: 8GB, 8-way
victim
: 8 entries
```
[ 1294.845313] STDOUT: —[
Wordcount: Running…
]—
[ 1294.903661] STDOUT: —[
o;
]—
[ 1294.946301] normal_p2s_open(): f_name: /root/ys/phoenix/phoenix-2.0/tests/word_count/word_count_datafiles/word_100MB.txt, mode: 04400, flags: 0
[ 1295.100517] SYSC_close(): [4] -> [/sys/devices/system/cpu/online]
[ 1295.594658] word_count-pthr[59] general protection ip:4272ea sp:7ffff1b8ed28 error:0
[ 1295.685236] CPU: 10 PID: 59 Comm: word_count-pthr 4.0.0-lego+ #113
[ 1295.759070] RIP: 0033:[<00000000004272ea>] [<00000000004272ea>] 0x4272ea
[ 1295.840184] RSP: 002b:00007ffff1b8ed28 EFLAGS: 00010283
[ 1295.903621] RAX: 000000000000000f RBX: 00007fffe5a3d010 RCX: 0000000000000001
[ 1295.988893] RDX: 0000000000000000 RSI: 4854005942004441 RDI: 00007ffff1c1e80f
[ 1296.074166] RBP: 00007ffff1c1e80f R08: 0000000000000000 R09: 0000000000000010
[ 1296.211435] R10: 0000000000427ce0 R11: 00007ffff1bbb3ba R12: 0000000000001de4
[ 1296.296711] R13: 00000000006e4a80 R14: 0000000000001d9e R15: 0000000000001dc1
[ 1296.433978] FS: 00007ffff1b8f700(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000
[ 1296.582686] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1296.963297] CR2: 00007ffff1c1e000 CR3: 000000207fd8a000 CR4: 00000000000406a0
So what is this `ip:4272ea`, let us objdump the binary:
0000000000425e60 ``
You can see
%rsihas some garbage value
RSI: 4854005942004441`. Something went wrong. Will it be our FPU? I’m not quite sure. If FPU code has error, why single-thread one succeed? Why it only shows up at multithread ones?
02/05 Mon Sunny
¶
From yesterday’s testing of Phoenix, it looks like something is wrong in smp_call_functions()
. They are invoked through tlb flush
, which was further invoked by mremap
, or munmap
. The warning from smp is:
C
[ 1260.586696] WARNING: CPU: 0 PID: 73 at kernel/smp.c:129 generic_smp_call_function_single_interrupt+0xb8/0x160
[ 1260.705251] CPU: 0 PID: 73 Comm: word_count-pthr 4.0.0-lego+ #99
[ 1260.777008] Stack:
[ 1260.800927] ffff88207fdffef8 ffffffff8100ec67 ffff88107fc00000 ffff88107fc00000
[ 1260.888283] ffffffff8100d410 ffff88207fe23df0 ffff88207fdfff08 ffffffff8100ed5f
[ 1260.975639] ffff88207fdfff38 ffffffff8100fe68 00007fffe58c3010 0000000000000f96
[ 1261.062995] 000000000000f960 0000000000000f95 ffff88207fdfff48 ffffffff810020dd
[ 1261.150351] 00007ffff58869c1 ffffffff8100b2e9 0000000000000f96 0000000000000f95
[ 1261.237707] Call Trace:
[ 1261.266825] <TSK>
[ 1261.289704] [<ffffffff8100ec76>] __warn.constprop.0+0xa6/0x100
[ 1261.359381] [<ffffffff8100d410>] ? pgd_free+0x90/0x90
[ 1261.419699] [<ffffffff8100ed5f>] warn_slowpath_null+0xf/0x20
[ 1261.487295] [<ffffffff8100fe68>] generic_smp_call_function_single_interrupt+0xb8/0x160
[ 1261.581931] [<ffffffff810020dd>] call_function_interrupt+0x1d/0x20
[ 1261.655767] [<ffffffff8100b2e9>] smp__call_function_interrupt+0x69/0x70
So I decided to look into smp.c a little bit to find out if there is something wrong (I wrote it long time ago). The warning itself is true, it means some inconsistent behavior.. I saw alloc_percpu
stuff during call_function_init
, hence probably I also need to check percpu code a little code cause I’m not sure if I port all the functionalities.
In all, today’s task, check percpu
and smp_call_function
code. Esp, percpu
code, they are crucial and very hard to relate real bugs to it.
Well… things changed. I found a more serious bug: something about cpuhotplug
, even though lego is not using it. cpuhotplug
is a set of implict callbacks to all different subsystems who want to do some initialization work on each offline->online
cpu.
Let us dig into how secondary cpu boots:
C
Trampoline.. setup 64bit mode
start_secondary()
smp_callin()
notify_cpu_starting()
...
while (st->state < target) {
st->state++;
cpuhp_invoke_callback(cpu, st->state, true, NULL);
}
cpuhp_invoke_callback()
See? There will be some callbacks! What are those callbacks exactly? Well, they are predefined at the kernel/cpu.c
. To save the trouble of reading code, I just print what functions are executed, the log is:
```
[ 0.118235] cpuhp_invoke_callback(): 136 CPU:0 page_writeback_cpu_online+0x0/0x20
[ 0.368478] cpuhp_invoke_callback(): 136 CPU:1 smpboot_create_threads+0x0/0x90 [ 0.370196] cpuhp_invoke_callback(): 136 CPU:1 perf_event_init_cpu+0x0/0xa0 [ 0.370403] cpuhp_invoke_callback(): 136 CPU:1 workqueue_prepare_cpu+0x0/0x80 [ 0.371112] cpuhp_invoke_callback(): 136 CPU:1 hrtimers_prepare_cpu+0x0/0x60 [ 0.371339] cpuhp_invoke_callback(): 136 CPU:1 smpcfd_prepare_cpu+0x0/0x80 [ 0.371584] cpuhp_invoke_callback(): 136 CPU:1 relay_prepare_cpu+0x0/0xe0 [ 0.371794] cpuhp_invoke_callback(): 136 CPU:1 rcutree_prepare_cpu+0x0/0x170 [ 0.372333] cpuhp_invoke_callback(): 136 CPU:1 notify_prepare+0x0/0xa0 [ 0.372744] cpuhp_invoke_callback(): 136 CPU:1 bringup_cpu+0x0/0x100 [ 0.008000] cpuhp_invoke_callback(): 136 CPU:1 sched_cpu_starting+0x0/0x60 [ 0.926124] cpuhp_invoke_callback(): 136 CPU:1 smpboot_unpark_threads+0x0/0x90 [ 0.926124] cpuhp_invoke_callback(): 136 CPU:1 perf_event_init_cpu+0x0/0xa0 [ 0.927028] cpuhp_invoke_callback(): 136 CPU:1 workqueue_online_cpu+0x0/0x2a0 [ 0.927768] cpuhp_invoke_callback(): 136 CPU:1 rcutree_online_cpu+0x0/0x70 [ 0.928045] cpuhp_invoke_callback(): 136 CPU:1 notify_online+0x0/0x20 [ 0.928256] cpuhp_invoke_callback(): 136 CPU:1 page_writeback_cpu_online+0x0/0x20 [ 0.928527] cpuhp_invoke_callback(): 136 CPU:1 sched_cpu_activate+0x0/0x190
[ 0.929084] cpuhp_invoke_callback(): 136 CPU:2 smpboot_create_threads+0x0/0x90 [ 0.930240] cpuhp_invoke_callback(): 136 CPU:2 perf_event_init_cpu+0x0/0xa0 [ 0.930434] cpuhp_invoke_callback(): 136 CPU:2 workqueue_prepare_cpu+0x0/0x80 [ 0.931070] cpuhp_invoke_callback(): 136 CPU:2 hrtimers_prepare_cpu+0x0/0x60 [ 0.931264] cpuhp_invoke_callback(): 136 CPU:2 smpcfd_prepare_cpu+0x0/0x80 [ 0.931464] cpuhp_invoke_callback(): 136 CPU:2 relay_prepare_cpu+0x0/0xe0 [ 0.931649] cpuhp_invoke_callback(): 136 CPU:2 rcutree_prepare_cpu+0x0/0x170 [ 0.932245] cpuhp_invoke_callback(): 136 CPU:2 notify_prepare+0x0/0xa0 [ 0.932475] cpuhp_invoke_callback(): 136 CPU:2 bringup_cpu+0x0/0x100 [ 0.008000] cpuhp_invoke_callback(): 136 CPU:2 sched_cpu_starting+0x0/0x60 [ 1.005023] cpuhp_invoke_callback(): 136 CPU:2 smpboot_unpark_threads+0x0/0x90 [ 1.005065] cpuhp_invoke_callback(): 136 CPU:2 perf_event_init_cpu+0x0/0xa0 [ 1.005408] cpuhp_invoke_callback(): 136 CPU:2 workqueue_online_cpu+0x0/0x2a0 [ 1.005729] cpuhp_invoke_callback(): 136 CPU:2 rcutree_online_cpu+0x0/0x70 [ 1.006029] cpuhp_invoke_callback(): 136 CPU:2 notify_online+0x0/0x20 [ 1.006206] cpuhp_invoke_callback(): 136 CPU:2 page_writeback_cpu_online+0x0/0x20 [ 1.006549] cpuhp_invoke_callback(): 136 CPU:2 sched_cpu_activate+0x0/0x190 ```
Interesting! Currently, Lego need to add the smpboot_create_threads()
, workqueue_prepare_cpu()
, workqueue_prepare_cpu()
, bringup_cpu()
, smpboot_unpark_threads()
, workqueue_online_cpu()
.
This hidden things is really hard to find and not easy to track during boot. Especially during boot, they should do something like for_each_online_cpu
and init one by one. But I guess, after adding support of cpu hotplug, code kind of merged. Some stuff will be executed whenever a cpu has been teardown or bought up. And bang, why not use the same set of hotplug during boot, right?
Well.