April 2018¶

05/04 Fri¶

We made it. We’ve done our part, now, it depends on reviewers. Please, be mercy, our hardworking deserves something good.

04/29 Sun¶

Rolling.

04/26 Thus¶

Fix the victim pte_same issue in SMP race cases. SMP is really pain in the ass, how many times? But… another victim ref count bug show up in SMP. First log in 0426-w15-½

0426-w15-1/3

[  206.381646] CPU12 PID28  victim:ffff88207ff69120 index:4 refcount:0 nr_fill:0 locked:0 flags:(0x2e)(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207ff72000
[  206.416658] CPU12 PID28     hit[0] owner:21 m_nid:1 rep_nid:1 addr: 0x7fffd0000000
[  206.433431] CPU12 PID28  victim:ffff88207ff69120 index:4 refcount:0 nr_fill:0 locked:0 flags:(0x4e)(allocated|usable|hasdata|flushed) pcm:          (null) pset:ffff88207ff72000
[  206.468429] CPU12 PID28     rmap to pset:ffff88207ff72000 set_idx: 0 nr_lru:63
[  206.484425] CPU12 PID28     victim dumped because: PCACHE_BUG_ON_VICTIM(!VictimAllocated(v) || !VictimUsable(v) || !VictimFlushed(v) || VictimWriteback(v) || VictimLocked(v))
[  206.543952] CPU: 12 PID: 28 Comm: python 4.0.0-lego+ #274
[  206.521849] WARNING: CPU: 12 PID: 28 at managers/processor/pcache/victim.c:196 __put_victim_nolist+0xa5/0xd0
[  206.722631] [<ffffffff8103b555>] __put_victim_nolist+0xa5/0xd0
[  206.729127] [<ffffffff8103c419>] victim_try_fill_pcache+0x2d9/0x460
[  206.736107] [<ffffffff8103b740>] ? victim_insert_hit_entry+0x170/0x170
[  206.743378] [<ffffffff810371ea>] pcache_handle_fault+0x18a/0x750

[  206.399206] CPU8 PID19  victim:ffff88207ff69120 index:4 refcount:0 nr_fill:0 locked:0 flags:(0x4e)(allocated|usable|hasdata|flushed) pcm:          (null) pset:ffff88207ff72000
[  206.425092] CPU8 PID19     hit[0] owner:21 m_nid:1 rep_nid:1 addr: 0x7fffd0000000
[  206.450977] CPU8 PID19  victim:ffff88207ff69120 index:4 refcount:0 nr_fill:0 locked:0 flags:(0x4e)(allocated|usable|hasdata|flushed) pcm:          (null) pset:ffff88207ff72000
[  206.476475] CPU8 PID19     rmap to pset:ffff88207ff72000 set_idx: 0 nr_lru:63
[  206.501779] CPU8 PID19     victim dumped because: PCACHE_BUG_ON_VICTIM(victim_ref_count(v) == 0)
[  206.549963] CPU: 8 PID: 19 Comm: kvictim_flushd 4.0.0-lego+ #274
[  206.532803] WARNING: CPU: 8 PID: 19 at ./include/processor/pcache_victim.h:119 __victim_flush_func+0x1e4/0x1f0

04/25¶

Stay humble. Be real.

04/22 Sun¶

Testing. Hardworking!

04/21 Sat¶

Another major bug report in 0421-w15-19. Rmapped corrupted. lock issue?

Fixed. It is handle_m2m_fork bug.

pcache_miss_error+0x20

Keep it going.

I can not remember how many times I have seen this bug issue. And I have no idea.

[  714.144354] IP: [<ffffffffffff8100>] 0xffffffffffff8100
[  714.150171] PGD 115c067 PUD 115e067 PMD 0
[  714.154729] Oops: 0010 [#1] SMP PROCESSOR
[  714.159189] CPU: 0 PID: 15 Comm: ib_mad_completi 4.0.0-lego+ #245
[  714.165976] BUG: unable to handle kernel paging request at ffffffffffff8100
[  714.173732] IP: [<ffffffffffff8100>] 0xffffffffffff8100
[  714.179549] PGD 115c067 PUD 115e067 PMD 0
[  714.184106] RIP: 0010:[<ffffffffffff8100>]  [<ffffffffffff8100>] 0xffffffffffff8100
[  714.192638] RSP: 0000:ffff88103e88fc80  EFLAGS: 00010046
[  714.198552] RAX: 6e82000000000098 RBX: 7b0bffffffffffff RCX: 0000000000000001
[  714.206503] RDX: ffff88103e88fd28 RSI: 0000000000000000 RDI: 44c0ffffffff8116
[  714.214453] RBP: ffff88103e88fcd0 R08: 000000000000001f R09: ffff88103e8643c0
[  714.222403] R10: ffff88103e88fe68 R11: 0000000000000001 R12: a9670000018d71ba
[  714.230354] R13: 0000000000000000 R14: ffff88103e85d0f8 R15: ffff88103dd58000
[  714.238304] Oops: 0010 [#2] SMP PROCESSOR
[  714.242763] FS:  0000000000000000(0000) GS:ffff88107fc00000(0000) knlGS:0000000000000000
[  714.251781] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  714.258180] CR2: ffffffffffff8100 CR3: 0000000001159000 CR4: 00000000000406b0
[  714.266130] CPU: 10 PID: 20 Comm: python 4.0.0-lego+ #245
[  714.272141] RIP: 0010:[<ffffffffffff8100>]  [<ffffffffffff8100>] 0xffffffffffff8100
[  714.280673] RSP: 0018:ffff88103dd8fe10  EFLAGS: 00010202
[  714.286588] RAX: ffff88101fa54270 RBX: 00000000000c92a6 RCX: 0000000000000002
[  714.294538] RDX: 00000000ffffffff RSI: 0000000000000000 RDI: 44c0ffffffff8116
[  714.302488] RBP: ffff88103dd8fe20 R08: ffff88101fa6f000 R09: ffff88101fa54400
[  714.310439] R10: ffff880000000000 R11: 00000000407e9c00 R12: ffff88101fa54000
[  714.318389] R13: ffff88103dd68000 R14: ffff88101fa60000 R15: ffff88101fa54000
[  714.326339] Stack:
[  714.328569] FS:  00007ffff7fdf740(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000
[  714.337585] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  714.343984] CR2: ffffffffffff8100 CR3: 000000103dd9a000 CR4: 00000000000406a0
[  714.351936] Stack:
[  714.354165] ffffffff810157f9 00000000003d0f00 ffff88103dd8fec0 ffffffff8101dde5
[  714.362309] ffff88103dd8fe68 ffffffff81036788 0000000000000038 0000000000000038
[  714.370453] 00007fffd89a79c0 ffff88101fa541c0 ffff88101fa54188 0000000000000000
[  714.378598] 000000101fa60000 00007fffd89a79d0 00007fffd89a7700 0000000000000000
[  714.386742] 00007fffd89a6fb0 ffff88103dd8ff58 0000000000000038 00000000003d0f00
[  714.394886] Call Trace:
[  714.397600] ffffffff81014f37 0000000000000086 ffff88107fc05d80 ffff88103e864000
[  714.405745] 0000000000000000 ffff88107fc04980 0000000000000000 0000000000000000
[  714.413889] ffff88103e85d0f8 ffff88103dd58000 ffff88103e88fce8 ffffffff81016bb7
[  714.422034] 000000007fc05d80 ffff88103e88fd10 ffffffff81006754 ffffffffffff0000
[  714.430177] ffff88107fc05d80 ffff88103e864000 ffff88103e88fe00 ffffffff8100e4ea
[  714.438321] Call Trace:
[  714.441037] <TSK>
[  714.443169] [<ffffffff810157f9>] ? ktime_get+0x19/0x60
[  714.448890] [<ffffffff8101dde5>] copy_process+0x2c5/0x1170
[  714.454998] [<ffffffff81036788>] ? strace_printflags+0x88/0xc0
[  714.461495] <TSK>
[  714.463627] [<ffffffff81014f37>] ? update_wall_time+0x47/0x6b0
[  714.470123] [<ffffffff81016bb7>] tick_handle_periodic+0x67/0x70
[  714.476716] [<ffffffff81006754>] apic_timer_interrupt+0x55/0x90
[  714.483309] [<ffffffff8101ecb6>] do_fork+0x26/0x160
[  714.488738] [<ffffffff8101eea9>] sys_clone+0x29/0x30
[  714.494265] [<ffffffff8100e8ad>] do_syscall_64+0x3d/0xd0
[  714.500180] [<ffffffff8100d7ac>] entry_SYSCALL64_slow_path+0x25/0x25
[  714.507257] [<ffffffff8100e4ea>] smp__apic_timer_interrupt+0x6a/0x70
[  714.514335] <EOT>

04/20 Fri¶

Glad TF finally working now!

Keep seeing this message from kernel. It have been many many times. Very deterministic.

BUG: unable to handle kernel paging request at ffffffffffff8100

04/19 Thur¶

Patched clflush to use tgid, n_nid directly without task_struct.
In a 256M excache today (0419-w15-4), a timeout happen first, which will be handled as segfault to kill all threads. In an eviction->victim_prepare_hits, the get_memory_nodes() encounter the NULL again. Looks like the thread_group->mm got cleared before.¶

04/18 Wed¶

Try best to fix the pipe bug. (I found it by using my old way of debugging. By writing a function that test if PTE is corrupted or not. I put that function around the sycall enter/exit. So it help to find which syscall corrupt memory. I have used this stupid technique to find so many hard-to-find memory corruption bugs.....)
- do_close_on_exec
- dup2
Re-read Yutong’s patch again. It touches a lot handler code. This has to be verified before using any nowait reply.
pipe’s wakeup may have issue?
0418-w15-41. 39sec

04/17 Tue¶

Checking list:

~~-pcache: ibapi use va or pa, does it matter?-~~
- No, I change it to use the VA. Then we don’t have the need to use PA reply any more.
=ib_mad, does it really corrupt Memory=
- Still not sure. Should be something come from the ib_poll_cq.
M side per PTE lock, check if the lock is really the same lock!
~~-Mail I20. Check CPT.-~~
Dist-VMA
- First make sure, TF+no-dist-vma work on my own setting. Though sometimes random bug happen (I doubt it is IB).
- Then turn on dist-vma
  - w/wo zerofill
  - w/wo kfree
  - w/wo all-zero Debug.
  - w/wo M side per PTE lock
Change most handlers to use TX buffer. Reduce the random mismatched reply case.
P side watchdog patch: what to print
~~-It looks like it is more easier to have bug when I turn on those debug counter printing. I probably should check those buffer mgmt. All next test have zerofill:-~~
- w print
  - F 0417-w15-2(rmap_walk list_for_each_entrry #GP)
  - F 0417-w15-3(pcache_copy_page_range corrupted PTE)
  - F 0417-w15-4(fit_poll_cq+0x39 ib_poll_cq() …)
  - F 0417-w15-5(pcache_copy_page_range corrupted PTE)
- wo strace exit:
  - S 0417-w15-6(each 100 step take ~39s/ Linux is ~34s)
  - S 0417-w15-7(filling shuffle data, that works)
  - F 0417-w15-8(pcache_copy_page_range+0x5d1)
  - F 0417-w15-9(rmap_walk+0x47 #GP)
- disable strace:
  - F 0417-w15-10(pcache_copy_page_range+0x5d1)
- Conclusion
  - it has nothing to do with the strace thing.
  - most of them fail around nr_reqs=19103
Why the pcache_copy_page_range always happen, after some fork, execve.
- w strace (fork, vfork, clone, execve)
  - F 0417-w15-11 (pcache_cp_pg_range). Understand its flow. Back to make sure P side per PTE lock is correct. If it is pcache_cp fault, it always fail at nr_reqs=19103. And it is: 1) python fork, 2) execve sh.
  - S 0417-w15-12. With global PTE lock. Passed the failed stage above.
  - F 0417-W15-13. With global PTE lock. Failed at pcache_cp. Same place. (Since global PTE lock also fail, so it is not the lock issue. Still someone write to wrong memory.)
  - F 0417-w15-14. With global PTE lock. Same place. Found that I printed a misleading debug info. Modified a little bit to print the actual pte content. Hope can get some valid info next round.
  - F 0417-w15-15. Same place. copy: addr: 0x7fffdca07000, ptecont: 0x8800000000000. zap: ptent: 0x340 address: 0x7fffdca08000.
  - F 0417-w15-16. Well. BUG in ib_mad_send handler. I add the same checking in ib_mad_receive. This is really just used to catch it. Not fixing it.
  - F 0417-w15-17. Again, addr: 0x7fffdc207000, ptecont: 0x8800000000000
  - F 0417-w15-18. addr: 0x7fffdca07000, ptecont: 0x8800000000000
- Conclusion
  - Only these two addresses
  - addr: 0x7fffdca07000, ptecont: 0x8800000000000
  - pte:ffff88103ea87038 (0x8800000000000) pfn:0x0 flags:(0x8800000000000)
  - addr: 0x7fffdc207000, ptecont: 0x8800000000000
  - pte:ffff88103ea97038 (0x8800000000000) pfn:0x0 flags:(0x8800000000000)
Bug found. In pipe_read/write. It somehow corrupted memory. Damn.

-Another first thing, check this weird log.. : Hmm, this log should be fine. mad_post is after recv_done_handler. So even if we detect corrupted memory in handler, it has nothing to do with mad_post. The root cause should come from ib_poll_cq, that is where we pass wc to, and where the wc.wr_id was filled in.-

[ 3850.911144] ib_mad_recv_done_handler(): c1: 2060 c2: 12 wc->wr_id: 0xffff88103eea1398
[ 3850.921881] ib_mad_post_receive_mads(): c1: 2060 c2: 13 recv_wr.wr_id: 0xffff88103eea1008 recv_queue: ffff88103ee42520
[ 3850.933620] ib_mad_completion_handler 2377 got successful send cq op 0 mad_got_one 13
[ 3850.942346] ib_mad_completion_handler 2383 got successful recv cq op 128 mad_got_one 14
[ 3850.951266] ib_mad_recv_done_handler(): c1: 2061 c2: 13 wc->wr_id: 0xffff88103eea1560
[ 3850.961999] ib_mad_post_receive_mads(): c1: 2061 c2: 14 recv_wr.wr_id: 0xffff88103eea11d0 recv_queue: ffff88103ee42520
[ 3850.973737] ib_mad_completion_handler 2377 got successful send cq op 0 mad_got_one 14
[ 3851.257563] ib_mad_completion_handler 2383 got successful recv cq op 128 mad_got_one 15
[ 3851.266295] ib_mad_recv_done_handler(): c1: 2062 c2: 14 wc->wr_id: 0xffff88103eea1728
[ 3851.277029] ib_mad_post_receive_mads(): c1: 2062 c2: 15 recv_wr.wr_id: 0xffff88103eea1398 recv_queue: ffff88103ee42520
[ 3851.288767] ib_mad_completion_handler 2377 got successful send cq op 0 mad_got_one 15
[ 3851.297493] ib_mad_completion_handler 2383 got successful recv cq op 128 mad_got_one 16
[ 3851.306413] ib_mad_recv_done_handler(): c1: 2063 c2: 15 wc->wr_id: 0xffff88103eea18f0
[ 3851.317147] ib_mad_post_receive_mads(): c1: 2063 c2: 16 recv_wr.wr_id: 0xffff88103eea1560 recv_queue: ffff88103ee42520
[ 3851.328886] ib_mad_completion_handler 2377 got successful send cq op 0 mad_got_one 16
[ 3851.903180] ib_mad_completion_handler 2383 got successful recv cq op 128 mad_got_one 17
[ 3851.911913] ib_mad_recv_done_handler(): c1: 2064 c2: 16 wc->wr_id: 0xffff88103eea1ab8
[ 3851.922646] ib_mad_post_receive_mads(): c1: 2064 c2: 17 recv_wr.wr_id: 0xffff88103eea1728 recv_queue: ffff88103ee42520
[ 3851.934384] ib_mad_completion_handler 2377 got successful send cq op 0 mad_got_one 17
[ 3851.943110] ib_mad_completion_handler 2383 got successful recv cq op 128 mad_got_one 18
[ 3851.952030] ib_mad_recv_done_handler(): c1: 2065 c2: 17 wc->wr_id: 0xffff88103eea1c80
[ 3851.962764] ib_mad_post_receive_mads(): c1: 2065 c2: 18 recv_wr.wr_id: 0xffff88103eea18f0 recv_queue: ffff88103ee42520
[ 3851.974502] ib_mad_completion_handler 2377 got successful send cq op 0 mad_got_one 18
[ 3864.723128] ***  FIT layer ready to go!
[ 3864.727206] ***
[ 3867.339488] Processor LLC Configurations:
[ 3867.343760]     PhysStart:         0x100000000
[ 3867.348705]     VirtStart:         0xffff880100000000
[ 3867.354329]     Registered Size:   0x400000000
[ 3867.359274]     Actual Used Size:  0x208000000
[ 3867.364219]     NR cachelines:     2097152
[ 3867.368776]     Associativity:     8
[ 3867.372751]     NR Sets:           262144
[ 3867.377210]     Cacheline size:    4096 B
[ 3867.381672]     Metadata size:     64 B
[ 3867.385937]     NR cacheline bits: 12 [ 0 - 11] 0x0000000000000fff
[ 3867.392821]     NR set-index bits: 18 [12 - 29] 0x000000003ffff000
[ 3867.399705]     NR tag bits:       34 [30 - 63] 0xffffffffc0000000
[ 3867.406588]     NR pages for data: 2097152
[ 3867.411147]     NR pages for meta: 32768
[ 3867.415509]     Cacheline (pa) range:   [       0x100000000 -        0x2ffffffff]
[ 3867.423848]     Metadata (pa) range:    [       0x300000000 -        0x307ffffff]
[ 3867.432186]     Cacheline (va) range:   [0xffff880100000000 - 0xffff8802ffffffff]
[ 3867.440524]     Metadata (va) range:    [  ffff880300000000 - 0xffff880307ffffff]
[ 3867.448862]     pcache_set_map(064B):   [  ffff88207ec00000 - 0xffff88207fbfffff]
[ 3867.457201]     Way cache stride:  0x40000000
[ 3867.462048]     Memmap $ semantic:       memblock reserved
[ 3867.468156]     NR victim $ entries:     8
[ 3867.472725] newpid: 1 home:1 replica: 1
[ 3867.476980] p2m_fork(cpu0): I cur:1-kernel_init new:20
[ 3867.482718] p2m_fork(cpu0): O succeed cur:1-kernel_init new:20
[ 3867.489197] Processor: Processor manager is running.
[ 3867.494724] Online CPU: 0,2,4,6,8,10,12,14,16,18,20,22
[ 3867.500444] Active CPU: 0,2,6,10,12,14,16,18,20,22
[ 3867.505777]   [0] Thread[kvictim_flushd:19] pinned at CPU 8
[ 3867.511982]   [1] Thread[recvpollcq:17] pinned at CPU 4
[ 3867.539217] do_close_on_exec(): TODO, not implemented.
[ 3867.549209] STDOUT: ---[
Before execv^V
]---
[ 3867.553870] STDOUT: ---[

e
---
[ 3867.557880] newpid: 20 home:1 replica: 1
[ 3867.562248] p2m_fork(cpu10): I cur:20-exe.o new:21
[ 3867.567560] p2m_fork(cpu10): O succeed cur:20-exe.o new:21
[ 3867.573670] CPU12 PID21 sys_execve
[ 3867.578681] do_close_on_exec(): TODO, not implemented.
[ 3867.584215] CPU12 PID21 sys_execve = 0, 0x0
[ 3867.599867] BUG: unable to handle kernel paging request at 0000000408446080
[ 3867.607436] IP: [<ffffffff8101bbbf>] task_tick_rt+0x1f/0xd0

04/16 Mon¶

Make dist-vma work with TF first. Tough work.

0416-w14-7: 1) do_wp_page triggered, 2) dealock on per pte lock. This really should not happen. It is single worker. Basically means the page->lock is not intialized. Probabaly our per PTE lock implementation is wrong.

[ 5220.250552] hb: worker[0] CPU 4 stucked
[ 5220.254819] hb:  common_header [op=0x20000000 src_nid:0]
[ 5220.260734] hb:  msg [pid=21,tgid=21,flags=0x51,vaddr=0x7fff7b7fdfb8]
[ 5220.267911] CPU: 4 PID: 31 Comm: thpool-worker0 4.0.0-lego-ys+ #237
[ 5220.274890] RIP: 0010:[<ffffffff81031aa3>]  [<ffffffff81031aa3>] handle_lego_mm_fault+0x373/0x4f0

handle_lego_mm_fault+0x373/0x4ee:                                                                                                                                                                                   
arch_spin_lock at arch/x86/include/asm/spinlock.h:21                                                                                                                                                                
 (inlined by) spin_lock at include/lego/spinlock.h:72                                                                                                                                                               
 (inlined by) do_anonymous_page at managers/memory/vm/fault.c:115                                                                                                                                                   
 (inlined by) handle_pte_fault at managers/memory/vm/fault.c:142                                                                                                                                                    
 (inlined by) handle_lego_mm_fault at managers/memory/vm/fault.c:225

A IB bug during normal run (P M S TF), this is REALLY weird:

[  395.259560] CPU12 PID21 sys_execve
[  395.263345] BUG: unable to handle kernel NULL pointer dereference at 00000000000001a0
[  395.272068] IP: [<ffffffff81064c09>] fit_poll_cq+0x39/0x530

fit_poll_cq+0x39/0x523:git:(test_vma)] $ ./scripts/faddr2line vmImage  fit_poll_cq+0x39
ib_poll_cq at include/rdma/ib_verbs.h:1614
 (inlined by) fit_poll_cq at net/lego/fit_internal.c:1671

Catch the ib_mad bug once.. and mlx4_error follows. I added more checking to where the mad_queue was assigned.

[  787.471385] ib_mad_completion_handler 2365 got successful recv cq op 128 mad_got_one 15
[  787.480124] BUG! mad_list: ffff88103eea1728 mad_queue:           (null)
[  787.487491] ------------[ cut here ]------------
[  787.492630] WARNING: CPU: 0 PID: 15 at drivers/infiniband/core/mad.c:1909 ib_mad_completion_handler+0xa56/0xab0

04/15 Sun¶

Trying TF myself.

Had a bug report on 0415-w15-5, on fork, execve etc.

[  317.436811] newpid: 22 home:1 replica: 1
[  317.477701] pte:ffff88103e94a038 pfn:0x0 flags:()
[  317.482752] pte dumped because: corrupted
[  317.487213] ------------[ cut here ]------------
[  317.492352] WARNING: CPU: 14 PID: 22 at managers/processor/pgtable.c:365 pcache_copy_page_range+0x5d1/0x6c0
[  317.503213] CPU: 14 PID: 22 Comm: python 4.0.0-lego+ #93
[  317.552082] Call Trace:
[  317.554799] <TSK>
[  317.556930] [<ffffffff810123a1>] __warn.constprop.0+0x91/0xd0
[  317.563330] [<ffffffff8101246f>] warn_slowpath_null+0xf/0x20
[  317.569634] [<ffffffff8102d401>] pcache_copy_page_range+0x5d1/0x6c0
[  317.576615] [<ffffffff81037ed7>] fork_dup_pcache+0x27/0x30
[  317.582723] [<ffffffff8101e514>] copy_process+0xcf4/0x1140
[  317.588833] [<ffffffff8101e986>] do_fork+0x26/0x160
[  317.594264] [<ffffffff8101eb89>] sys_clone+0x29/0x30
[  317.599789] [<ffffffff8100e66d>] do_syscall_64+0x3d/0xd0
[  317.605705] [<ffffffff8100d56c>] entry_SYSCALL64_slow_path+0x25/0x25
[  317.612782] <EOT>
[  317.614917] ---[ end trace 0000000000000000 ]---
[  317.625561] p2m_fork(cpu14): I cur:22-python new:36
[  330.209312] p2m_fork(cpu14): O succeed cur:22-python new:36


[  330.310909] ------------[ cut here ]------------
[  330.315864] BUG: failure at managers/processor/pcache/rmap.c:804/pcache_zap_pte()!
[  330.324302] Kernel Panic - not syncing: BUG!
[  330.329050] CPU: 0 PID: 36 Comm: python 4.0.0-lego+ #93
[  330.377824] Call Trace:
[  330.380540] <TSK>
[  330.382672] [<ffffffff81026493>] panic+0xc2/0x105
[  330.387908] [<ffffffff8101bbcc>] ? task_tick_rt+0x2c/0xd0
[  330.393920] [<ffffffff81019245>] ? scheduler_tick+0x55/0x60
[  330.400126] [<ffffffff810168f5>] ? tick_handle_periodic+0x45/0x70
[  330.406913] [<ffffffff81006684>] ? apic_timer_interrupt+0x54/0x90
[  330.413700] [<ffffffff8100e2aa>] ? smp__apic_timer_interrupt+0x6a/0x70
[  330.420973] [<ffffffff810125ad>] ? printk+0x11d/0x1b0
[  330.426597] [<ffffffff810375bc>] pcache_zap_pte+0x14c/0x190
[  330.432802] [<ffffffff81035db0>] ? __pcache_remove_rmap_one+0x70/0x70
[  330.439978] [<ffffffff8102cd25>] unmap_page_range+0x325/0x3f0
[  330.446379] [<ffffffff8102ce0e>] release_pgtable+0x1e/0x40
[  330.452487] [<ffffffff81037ef8>] pcache_process_exit+0x18/0x20
[  330.458984] [<ffffffff8101d3c4>] mmput+0x34/0xb0
[  330.464123] [<ffffffff8102c38d>] do_execve+0x42d/0x760
[  330.469845] [<ffffffff8102c6c9>] sys_execve+0x9/0x10
[  330.475371] [<ffffffff8100e66d>] do_syscall_64+0x3d/0xd0
[  330.481286] [<ffffffff8100d56c>] entry_SYSCALL64_slow_path+0x25/0x25
[  330.488364] <EOT>
[  330.490501] ---[ end Kernel panic - not syncing: BUG!

one more

[  369.223161] newpid: 22 home:1 replica: 1
[  369.264307] pte:ffff88103ea41038 (0x0) pfn:0x0 flags:()
[  369.269938] pte dumped because: corrupted
[  369.274399] ------------[ cut here ]------------
[  369.279538] WARNING: CPU: 14 PID: 22 at managers/processor/pgtable.c:365 pcache_copy_page_range+0x5d1/0x6c0
[  369.290398] CPU: 14 PID: 22 Comm: python 4.0.0-lego+ #94
[  369.296310] Stack:
[  369.341976] <TSK>
[  369.344107] [<ffffffff810123a1>] __warn.constprop.0+0x91/0xd0
[  369.350508] [<ffffffff8101246f>] warn_slowpath_null+0xf/0x20
[  369.356809] [<ffffffff8102d401>] pcache_copy_page_range+0x5d1/0x6c0
[  369.363790] [<ffffffff81037f07>] fork_dup_pcache+0x27/0x30
[  369.369897] [<ffffffff8101e514>] copy_process+0xcf4/0x1140
[  369.376006] [<ffffffff8101e986>] do_fork+0x26/0x160
[  369.381435] [<ffffffff8101eb89>] sys_clone+0x29/0x30
[  369.386960] [<ffffffff8100e66d>] do_syscall_64+0x3d/0xd0
[  369.392875] [<ffffffff8100d56c>] entry_SYSCALL64_slow_path+0x25/0x25
[  369.399952] <EOT>
[  369.402086] ---[ end trace 0000000000000000 ]---

[  369.412750] p2m_fork(cpu14): I cur:22-python new:36
[  369.418215] p2m_fork(cpu14): O succeed cur:22-python new:36
[  369.500829] ptent: 0x340 address: 0x7fffe2408000
[  369.505783] pte:ffff88103dbe5040 (0x340) pfn:0x0 flags:(dirty|global|softw1)
[  369.513637] pte dumped because: corrupted
[  369.518095] ------------[ cut here ]------------
[  369.523236] BUG: failure at managers/processor/pcache/rmap.c:808/pcache_zap_pte()!
[  369.531672] Kernel Panic - not syncing: BUG!

04/14 Sat¶

Check if page table pages, page themselves are freed in munmap, at both P and M. Need to confirm. Will they do harm
Implement replication
Add IB counter

04/13 Fri¶

Patched M side pgtable to use per PTE/PMD lock. So thpool in M will not be bottlnecked by the page_table_lock.

ib_mad_recv_done_handler may corrupt memory, again.¶

Somehow, during testing of this patch. Running with MT-Phoenix 1GB, the P side has reported bad pgd entries. I’m using fork+execve way. The child(phoenix) already exit. This msg is printed when parent exit_mm. The pgd table should either be 0, or valid pud va. Memory corruption happened…

[ 2551.687806] Kernel strace
[ 2551.690715] Task: 21:21 nr_accumulated_threads: 1
[ 2551.696327] % time        seconds  usecs/call     calls    errors syscall
[ 2551.703704] ------ -------------- ----------- --------- --------- ----------------
[ 2551.712141]  98.63   66.942660568    66942661         1         0 sys_wait4
[ 2551.719898]   0.45    0.457060789      457061         1         0 sys_clone
[ 2551.727654]   0.20    0.204320071       51081         4         0 sys_brk
[ 2551.735216]   0.40    0.040378189       40379         1         0 sys_mmap
[ 2551.742876]   0.13    0.013682424        4561         3         0 sys_write
[ 2551.750633]   0.10    0.000001039           2         1         0 sys_newfstat
[ 2551.758681]   0.88    0.000000888           1         2         0 sys_rt_sigaction
[ 2551.767114]   0.79    0.000000792           1         2         0 sys_futex
[ 2551.774871]   0.77    0.000000770           1         1         0 sys_rt_sigprocmask
[ 2551.783501]   0.54    0.000000548           1         1         0 sys_arch_prctl
[ 2551.791742]   0.49    0.000000499           1         1         0 sys_newuname
[ 2551.799789]   0.46    0.000000469           1         1         0 sys_getrlimit
[ 2551.807933]   0.19    0.000000195           1         1         0 sys_set_tid_address
[ 2551.816659]   0.19    0.000000190           1         1         0 sys_set_robust_list
[ 2551.825386]   0.18    0.000000181           1         1         0 sys_ioctl
[ 2551.833143] ------ -------------- ----------- --------- --------- ----------------
[ 2551.841577] 100.00   67.658107612                    22         0 total
[ 2551.848945]
[ 2551.850591]
[ 2551.852240] Kernel Profile Points
[ 2551.855924]  status                  name             total                nr            avg.ns
[ 2551.865621] -------  --------------------  ----------------  ----------------  ----------------
[ 2551.875317]     off      flush_tlb_others       0.000204992                58              3535
[ 2551.885014]     off     __do_kmalloc_node       0.300783843            281501              1069
[ 2551.894709]     off     __pcache_zerofill       0.009844770             16558               595
[ 2551.904404]     off           pcache_miss      54.414457906            257869            211016
[ 2551.914100]     off          pcache_flush       0.000000000                 0                 0
[ 2551.923795] -------  --------------------  ----------------  ----------------  ----------------
[ 2551.933490]
[ 2552.074985] ./arch/x86/include/asm/pgtable.h:579: bad pgd ffff88103e956028(ffffffff81146ca0)
[ 2552.084206] ./arch/x86/include/asm/pgtable.h:579: bad pgd ffff88103e956030(ffff88103e956030)
[ 2552.093611] ./arch/x86/include/asm/pgtable.h:579: bad pgd ffff88103e956038(ffff88103e956030)
[ 2552.103016] ./arch/x86/include/asm/pgtable.h:579: bad pgd ffff88103e956048(ffff88103cc48740)
[ 2552.112421] ./arch/x86/include/asm/pgtable.h:579: bad pgd ffff88103e956050(00000000000001c0)
[ 2552.121825] ./arch/x86/include/asm/pgtable.h:579: bad pgd ffff88103e956058(ffff88103eea2008)
[ 2552.131230] ./arch/x86/include/asm/pgtable.h:579: bad pgd ffff88103e956060(ffff88103eea17d8)
[ 2552.140635] ./arch/x86/include/asm/pgtable.h:579: bad pgd ffff88103e956068(ffff88103ee42520)
[ 2552.150040] ./arch/x86/include/asm/pgtable.h:579: bad pgd ffff88103e9560e8(000000103e9560f0)
[ 2552.159444] ./arch/x86/include/asm/pgtable.h:579: bad pgd ffff88103e956118(0102008081018101)
[ 2552.168849] ./arch/x86/include/asm/pgtable.h:579: bad pgd ffff88103e956120(3c010b0012000000)
[ 2552.178254] ./arch/x86/include/asm/pgtable.h:579: bad pgd ffff88103e956128(0000000000001100)
[ 2552.187659] ./arch/x86/include/asm/pgtable.h:579: bad pgd ffff88103e956138(00000000ffffffff)
[ 2552.197064] ./arch/x86/include/asm/pgtable.h:579: bad pgd ffff88103e956158(03078a2402010101)
[ 2552.206467] ./arch/x86/include/asm/pgtable.h:579: bad pgd ffff88103e956160(03078a2453946600)
[ 2552.215872] ./arch/x86/include/asm/pgtable.h:579: bad pgd ffff88103e956168(03078a2450946600)
[ 2552.225277] ./arch/x86/include/asm/pgtable.h:579: bad pgd ffff88103e956170(0310800051946600)
[ 2552.234682] ./arch/x86/include/asm/pgtable.h:579: bad pgd ffff88103e956178(c902000100000000)
[ 2552.244088] ./arch/x86/include/asm/pgtable.h:579: bad pgd ffff88103e956198(bfd0cc054a122000)
[ 2552.253492] ./arch/x86/include/asm/pgtable.h:579: bad pgd ffff88103e9561a0(000000000098b9c8)
[ 2552.262897] ./arch/x86/include/asm/pgtable.h:579: bad pgd ffff88103e9561a8(bfe0fe0610914e01)
[ 2552.272302] ./arch/x86/include/asm/pgtable.h:579: bad pgd ffff88103e9561b0(000000000050f2c7)
[ 2552.281706] ./arch/x86/include/asm/pgtable.h:579: bad pgd ffff88103e9561b8(bfd9a30000ec5100)
[ 2552.291111] ./arch/x86/include/asm/pgtable.h:579: bad pgd ffff88103e9561c0(bffc91d40f20f2c7)
[ 2552.300516] ./arch/x86/include/asm/pgtable.h:579: bad pgd ffff88103e9561c8(0f20cd054a20f2c7)
[ 2552.309920] ./arch/x86/include/asm/pgtable.h:579: bad pgd ffff88103e9561d0(1094edcf0f60edcf)
[ 2552.319325] ./arch/x86/include/asm/pgtable.h:579: bad pgd ffff88103e9561d8(0000000000000100)
[ 2552.328730] ./arch/x86/include/asm/pgtable.h:579: bad pgd ffff88103e956218(0000000000005aa5)
[ 2552.338151] nr_pgfault: 26

Second run, saw this invalid pointer deference again! Combined with the above log, I think ib_mad is definitely corrupting memory! I have to take a look.

qp_info = mad_list->mad_queue->qp_info;

Patching the handlers to use tx buffer.¶

Patched.

Once race condition: pcache_handle_miss use the page itself as the reply buffer. Assume later on, it changes to use nowait reply. When the reply is buffered in the queue and has not been sent. Another munmap comes in and invalidate this area, then the page will be freed. The data is invalidate.

But this case seems abnormal. The application will not do so I guess.

Check if page table pages, page themselves are freed in munmap, at both P and M. Need to confirm.¶

Tonight task. Think about how to do the VMA replication, how to combine with the $ line replicaiton.¶

04/12 Thur¶

Patched zerofill. All done.

Testing new driver fix with Phoenix - 1^st run, the mismatch reply is still there. mmap() replied address is different from the one printed. So segfault follows. (0412-w15-4) - 2st run, 3st run, succeed.

0412-w15-9 0412-w14-9 First time testing phoenix with zerofill (no net). Somehow, P has pcache timeout, but M’s watchdog show there is no pending requests. This happen once before I remember…

0412-w15-10. Have not seen this ib mad thing for a long time. Indeed somewhere is wrong.

[  297.794969] ib_mad_completion_handler 2344 got successful recv cq op 128 mad_got_one 15
[  297.803706] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
[  297.812431] IP: [<ffffffff81058937>] ib_mad_completion_handler+0xc7/0x810

2

04/11 Wed¶

Adding anon first touch opt.

0411-p/m-9: this log indicate M does not have any unhandled requests, but P side has 1 __pcache_fill timeout. It seems the message is lost somewhere.

0411-p/m-11: catch one with the debug msg Yiying added. She says the M side send queue has 2 reqs. But poll does not return any error. Weird.

Help debugging IB issue.

04/10 Tue¶

Found. IB stuck. Damn.

[ 2240.294960] RIP: 0010:[<ffffffff8104a6d8>]  [<ffffffff8104a6d8>] mlx4_ib_poll_cq+0x378/0x6a0
[ 2242.694733] RIP: 0010:[<ffffffff8104a6d8>]  [<ffffffff8104a6d8>] mlx4_ib_poll_cq+0x378/0x6a0
[ 2245.094524] RIP: 0010:[<ffffffff8104a6e3>]  [<ffffffff8104a6e3>] mlx4_ib_poll_cq+0x383/0x6a0
[ 2247.494306] RIP: 0010:[<ffffffff8104a6d8>]  [<ffffffff8104a6d8>] mlx4_ib_poll_cq+0x378/0x6a0
[ 2249.894088] RIP: 0010:[<ffffffff8104a6d8>]  [<ffffffff8104a6d8>] mlx4_ib_poll_cq+0x378/0x6a0
[ 2252.293870] RIP: 0010:[<ffffffff8104a6d8>]  [<ffffffff8104a6d8>] mlx4_ib_poll_cq+0x378/0x6a0
[ 2254.693651] RIP: 0010:[<ffffffff8104a6d8>]  [<ffffffff8104a6d8>] mlx4_ib_poll_cq+0x378/0x6a0
[ 2257.093431] RIP: 0010:[<ffffffff8104a6e3>]  [<ffffffff8104a6e3>] mlx4_ib_poll_cq+0x383/0x6a0

04/09 Mon¶

thpool testing. 4 workers. MT-phoenix:

[root@wuklab05 ys]# cat 0409-p | grep __munmap
[  227.054974] CPU14 PID22 strace__munmap([0x7fffb0ba9000 - 0x7fffb4000000], 54882304) = 0, 0x0
[  227.093466] CPU16 PID23 strace__munmap([0x7fffab7ff000 - 0x7fffac000000], 8392704) = 0, 0x0
[  227.102773] CPU14 PID22 strace__munmap([0x7fffb8000000 - 0x7fffb8ba9000], 12226560) = 0, 0x0
[  227.141265] CPU18 PID24 strace__munmap([0x7fffa8000000 - 0x7fffac000000], 67108864) = 0, 0x0
[  227.150669] CPU16 PID23 strace__munmap([0x7fffb0000000 - 0x7fffb37ff000], 58716160) = 0, 0x0
[  227.218248] CPU22 PID26 strace__munmap([0x7fffa0000000 - 0x7fffa4000000], 67108864) = 0, 0x0
[  227.285826] CPU2 PID28 strace__munmap([0x7fff98000000 - 0x7fff9c000000], 67108864) = 0, 0x0
[  227.440567] CPU14 PID31 strace__munmap([0x7fff8a7fd000 - 0x7fff8c000000], 25178112) = 0, 0x0
[  227.449972] CPU12 PID30 strace__munmap([0x7fff88000000 - 0x7fff8c000000], 67108864) = 0, 0x0
[  227.459376] CPU14 PID31 strace__munmap([0x7fff90000000 - 0x7fff927fd000], 41930752) = 0, 0x0
[  227.490109] CPU18 PID33 strace__munmap([0x7fff80000000 - 0x7fff84000000], 67108864) = 0, 0x0
[  227.723140] word_count-pthr[29]: segfault at 0x4e842010 ip 0000000000420354 sp 00007fffb17f9bc0 error 6
0x4e842010

Print mmap on M, if segfault. Printed, the 0x4e842010 is never a valid address. thpool makes Memory side SMP. Probably bring some issues.

Found:

P
CPU22 PID26 strace__mmap(addr=0x0, len=0xfb000, prot(0x3)=PROT_READ|PROT_WRITE, flags(0x22)=MAP_PRIVATE|MAP_ANONYMOUS, fd=18446744073709551615( ), off=0x0) = 1317351432, 0x4e853008
word_count-pthr[26]: segfault at 0x4e853010 ip 0000000000420354 sp 00007fff972a8bc0 error 6

M
[  583.120615]   00400000-004d9000 r-xp 00000000 /root/ys/phoenix/phoenix-2.0/tests/word_count/word_count-pthread
[  583.131578]   006d9000-006dc000 rw-p 000d9000 /root/ys/phoenix/phoenix-2.0/tests/word_count/word_count-pthread
[  583.142729]   006dc000-00755000 rw-p 00000000 [heap]
[  583.148254]   7fff529c9000-7fffb93aa000 rw-p 00000000
[  583.153974]   7fffb93aa000-7ffff7fff000 rw-p 00000000 /root/ys/phoenix/phoenix-2.0/tests/word_count/word_count_datafiles/word_1GB.txt
[  583.167355]   7ffffffde000-7ffffffff000 rw-p 00000000 [stack]
[  583.173753] ------------[ cut here ]------------
[  583.178892] WARNING: CPU: 4 PID: 31 at managers/memory/handle_pcache/fault.c:55 handle_p2m_pcache_miss+0x18e/0x1d0
[  583.190430] src_nid:0,pid:21,vaddr:0x4e853010
[  583.195279] CPU: 4 PID: 31 Comm: thpool-worker0 4.0.0-lego-ys+ #90

Confirmed. I printed added a number to mmap requests. And the compare the results of both P and M. The data is wrong. Btw, I’m only running 1 worker thread at M, which makes it single thread handling. So, I’m going to, 1) first use kmalloc to get the reply buffer, and 2) revert back the IB MAX_OUT config, remove the #ifdef COMP_MEMORY. See if it is this patch’s issue.

P:
CPU18 PID24 strace__mmap(30 ..) = -1325940736, 0x7fffb0f7c000
CPU22 PID26 strace__mmap(31 ..) = 2144269992, 0x7fcef6a8

M:
...
handle_p2m_mmap(): 30 7fffb0f7c000
handle_p2m_mmap(): 31 7fffb0efe000
...

Anyway, this is temporary fixed by using kmalloced reply buffer.

Spent whole afternoon and whole night. Finally figure out why timeout happen in P. It is because somewhere in the middle, M has 1 or more requests stucked/unhandled. Deadlock happen in the middle.

Like this one. 5 requests queued waiting, 1 is being handled. And that 1 handler stuck. And it is handle_pcache_miss. Now, I need to find out where it stuck!

thpool-worker0 nr_queued: 5 1

Oh, I really hope we can have some soft/hw lockdep, watchdog stuff. This should make out life much much much much much easier!

04/08 Sun¶

Trying the fit_nowait patch.

First try fit_nowait patch, without any chanegs to other code. See if this patch can work.
Second, modify pcache to use reply_message_nowait. See if this can work. and improve performance.
Third, if 2) can improve, perf. Move on to modify thpool patch.

1^st, P fail at ib_mad, during boot:

[  349.239220] Online CPU: 0,2,4,6,8,10,12,14,16,18,20,22
[  349.244940] Active CPU: 0,2,6,10,12,14,16,18,20,22
[  349.250272]   [0] Thread[kvictim_flushd:19] pinned at CPU 8
[  349.256478]   [1] Thread[recvpollcq:17] pinned at CPU 4
[  356.188819] ib_mad_completion_handler 2344 got successful recv cq op 128 mad_got_one 13
[  356.197545] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
[  356.206270] IP: [<ffffffff81058287>] ib_mad_completion_handler+0xc7/0x810

2st run, P side, config MAX_OUT to 1. Then single-thread pheonix with 1GB data finished. But forgot to turn on the profile point. Run one more time.

3st run. Same with 2st run setting. But with profile on. Bug shows. Ugh. I still think it is because of ib_mad_handler. It must write to wrong memory locations, and corrupt things randomly.

[  456.237913] do_close_on_exec(): TODO, not implemented.
...
[  456.263274] BUG: unable to handle kernel paging request at 00000002f4bfbf58
[  456.270843] IP: [<ffffffff8101bbff>] task_tick_rt+0x1f/0xd0
[  456.277048] PGD 0
[  456.279279] Thread overran stack, or stack corrupted
[  456.284804] Oops: 0000 [#1] SMP PROCESSOR
[  456.289265] CPU: 10 PID: 20 Comm: kevict_sweepd 4.0.0-lego+ #40
[  456.295858] RIP: 0010:[<ffffffff8101bbff>]  [<ffffffff8101bbff>] task_tick_rt+0x1f/0xd0

4st run, succeed. But it looks like the perf is very bad. Oh. but 99% of the pcache miss are file-backed, which will go to storage. So the number is actually doubled.

With fit_nowait patch:
[  308.660051] Kernel Profile Points
[  308.663734]  status                  name             total                nr            avg.ns
[  308.673431] -------  --------------------  ----------------  ----------------  ----------------
[  308.683128]     off      flush_tlb_others       0.000130715                53              2467
[  308.692824]     off     __do_kmalloc_node       0.097344056            265647               367
[  308.702521]     off           pcache_miss       4.504660891            258211             17446
[  308.712218]     off          pcache_flush       0.000000000                 0                 0
[  308.721914] -------  --------------------  ----------------  ----------------  ----------------

5st run. Just run large malloc test. Looks better than yesterday’s result. But I’m using 15 as P today, instead of 13. So, let me try one more time to see if it is the machine.

With fit_nowait patch:
[  674.382592] Kernel Profile Points
[  674.386277]  status                  name             total                nr            avg.ns
[  674.395974] -------  --------------------  ----------------  ----------------  ----------------
[  674.405670]     off      flush_tlb_others       0.000130838                53              2469
[  674.415366]     off     __do_kmalloc_node       1.604700641           1584917              1013
[  674.425062]     off           pcache_miss       6.467938547            786571              8223
[  674.434758]     off          pcache_flush       3.342783614            262225             12748
[  674.444455] -------  --------------------  ----------------  ----------------  ----------------
[  674.554497] nr_pgfault: 786513
[  674.557706] nr_clflush: 262225
[  674.561099] nr_pgfault_wp: 0
[  674.564299] nr_pgfault_wp_cow: 0
[  674.567887] nr_pgfault_wp_reuse: 0
[  674.571668] nr_pgfault_due_to_concurrent_eviction: 0
[  674.577195] nr_pcache_fill_from_memory: 786511
[  674.582139] nr_pcache_fill_from_victim: 2

6st run. Looks like the above fit_nowait can have 400ns improvement. But how come? I did not even change the pcache handling to use ibapi_nowait!!! Maybe random variation. Let me run more.

Without fit_nowait patches
[  428.546738] Kernel Profile Points
[  428.550424]  status                  name             total                nr            avg.ns
[  428.560119] -------  --------------------  ----------------  ----------------  ----------------
[  428.569815]     off      flush_tlb_others       0.000131140                53              2475
[  428.579510]     off     __do_kmalloc_node       1.758704197           1331927              1321
[  428.589205]     off           pcache_miss       6.807601189            786575              8655
[  428.598899]     off          pcache_flush       3.699044847            262227             14107
[  428.608594] -------  --------------------  ----------------  ----------------  ----------------
[  428.618289]
[  428.718670] nr_pgfault: 786515
[  428.721878] nr_clflush: 262227
[  428.725272] nr_pgfault_wp: 0
[  428.728470] nr_pgfault_wp_cow: 0
[  428.732058] nr_pgfault_wp_reuse: 0
[  428.735840] nr_pgfault_due_to_concurrent_eviction: 0
[  428.741365] nr_pcache_fill_from_memory: 786515
[  428.746310] nr_pcache_fill_from_victim: 0

7^th run. without fit_nowait.

without fit_nowait.
[  901.223090] Kernel Profile Points
[  901.226775]  status                  name             total                nr            avg.ns
[  901.236472] -------  --------------------  ----------------  ----------------  ----------------
[  901.246168]     off      flush_tlb_others       0.000130802                53              2468
[  901.255865]     off     __do_kmalloc_node       1.862575608           1331923              1399
[  901.265560]     off           pcache_miss       6.814540477            786572              8664
[  901.275257]     off          pcache_flush       3.699187003            262224             14107
[  901.284953] -------  --------------------  ----------------  ----------------  ----------------

8^th run. without fit_nowait.

[  321.514564] Kernel Profile Points
[  321.518250]  status                  name             total                nr            avg.ns
[  321.527945] -------  --------------------  ----------------  ----------------  ----------------
[  321.537639]     off      flush_tlb_others       0.000130934                53              2471
[  321.547335]     off     __do_kmalloc_node       2.216772665           1331939              1665
[  321.557031]     off           pcache_miss       6.806060415            786573              8653
[  321.566726]     off          pcache_flush       3.725455841            262231             14207
[  321.576421] -------  --------------------  ----------------  ----------------  ----------------

9^th run. with fit_nowait

[  374.847912] Kernel Profile Points
[  374.851597]  status                  name             total                nr            avg.ns
[  374.861293] -------  --------------------  ----------------  ----------------  ----------------
[  374.870989]     off      flush_tlb_others       0.000130858                53              2470
[  374.880684]     off     __do_kmalloc_node       1.485304454           1331934              1116
[  374.890381]     off           pcache_miss       6.615317677            786582              8411
[  374.900076]     off          pcache_flush       3.508328900            262234             13379
[  374.909772] -------  --------------------  ----------------  ----------------  ----------------

10^th run, with fit_nowait

[  225.211058] Kernel Profile Points
[  225.214743]  status                  name             total                nr            avg.ns
[  225.224440] -------  --------------------  ----------------  ----------------  ----------------
[  225.234137]     off      flush_tlb_others       0.000131029                53              2473
[  225.243833]     off     __do_kmalloc_node       1.211421872           1331984               910  
[  225.253529]     off           pcache_miss       6.583096125            786574              8370
[  225.263226]     off          pcache_flush       3.464430818            262227             13212
[  225.272922] -------  --------------------  ----------------  ----------------  ----------------

Sum:

with fit_nowait:

[  225.253529]     off           pcache_miss       6.583096125            786574              8370
[  225.263226]     off          pcache_flush       3.464430818            262227             13212

[  374.890381]     off           pcache_miss       6.615317677            786582              8411
[  374.900076]     off          pcache_flush       3.508328900            262234             13379

[  674.425062]     off           pcache_miss       6.467938547            786571              8223
[  674.434758]     off          pcache_flush       3.342783614            262225             12748

Without fit_nowait:
[  428.589205]     off           pcache_miss       6.807601189            786575              8655
[  428.598899]     off          pcache_flush       3.699044847            262227             14107

[  901.265560]     off           pcache_miss       6.814540477            786572              8664
[  901.275257]     off          pcache_flush       3.699187003            262224             14107

[  321.557031]     off           pcache_miss       6.806060415            786573              8653
[  321.566726]     off          pcache_flush       3.725455841            262231             14207

04/07 Sat¶

Well, now we finished all the profiling stuff. Continue on other work.

Now I like listening Jazz while coding. Amazing Jazz, really good.

Once again, ib_mad_completion_handler bug will happen. During application run, or even after application exit.

[  465.835447] nr_mremap_pset_diff: 0
[  477.086886] ib_mad_completion_handler 2344 got successful recv cq op 128 mad_got_one 21
[  477.095620] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
[  477.104345] IP: [<ffffffff81058277>] ib_mad_completion_handler+0xc7/0x810

ib_mad_completion_handler+0xc7/0x808:
ib_mad_recv_done_handler at drivers/infiniband/core/mad.c:1899
 (inlined by) ib_mad_completion_handler at drivers/infiniband/core/mad.c:2345

After remove net from pcache miss:

[  465.572131] Kernel Profile Points
[  465.575815]  status                  name             total                nr            avg.ns
[  465.585510] -------  --------------------  ----------------  ----------------  ----------------
[  465.595206]     off      flush_tlb_others       0.000000000                 0                 0
[  465.604901]     off     __do_kmalloc_node       0.656371295           1762220               373
[  465.614597]     off           pcache_miss       7.172572671            786596              9119
[  465.624291]     off          pcache_flush       3.698294960            262251             14103
[  465.633987] -------  --------------------  ----------------  ----------------  ----------------

After remove net from pcache flush:

[  684.984000] Kernel Profile Points
[  684.987683]  status                  name             total                nr            avg.ns
[  684.997379] -------  --------------------  ----------------  ----------------  ----------------
[  685.007074]     off      flush_tlb_others       0.000000000                 0                 0
[  685.016770]     off     __do_kmalloc_node       0.627372836           1500543               419
[  685.026464]     off           pcache_miss       7.128702028            786596              9063
[  685.036159]     off          pcache_flush       3.660772506            262251             13960
[  685.045855] -------  --------------------  ----------------  ----------------  ----------------

malloc, miss, flush are too slow. Especially the flush, how can it take 13.9us?

It must be our handlers! lego_copy_to_user stuff.

04/06 Fri¶

Well. Now we have in-kernel strace, in-kernel readprofile. Yummy.

04/05 Thur¶

Discussion with Yilun. 1. munmap+nr_pgfault figure: count number of pgfaults between munmap, it should be an interesting figure. 2. track number of pgfault at: since there is no eviction, so any mmaped area at M should only have exactly one pcache fetch. 3. I probably want to use per-cpu counter.

Anyway, continue strace work first. Finished.

04/04 Wed¶

STRACE Performance¶

TF has very bad performance. It is either due to the syscall or pcache. Now I’m adding facilities to track syscall activities, including average latency, total time.

Basic utilities of strace are done. But I somehow need to change the design of multithread strace. Previously, I naively make the thread group keep some info, and let all other threads use that info to do bookkeeping.

But this is really hard and not accurate. We first need to make sure we are running on a non-preemptable kernel, so the per-cpu time tracking will be accurate. Besides, we also need to make sure threads do not migrate because of syscalls such as sched_setaffinity.

Oh, well, so I though I have to use per-thread strace_info. The first design I thought is: accumulating the counter of one thread to its thread group leader, when it exit. But this is slightly complex, and will affect the thread group leader runtime.

So the second solution I came up is let all threads within a process, chain their straec_info together. And normal thread does not need to accumulate the counter. It can just exit. While the thread group leader exit, it walk through the chain to accumulate the counters. This is simpler. Besides, the strace_info of dead thread is safe. No one will touch it.

Yeh! Let us do this tomorrow. We will have a robust kernel version strace.

SM Heartbeat¶

Continue run some experiments on yesterday’s case.

One we sure is SM will keep sending requests to HCA. And it looks like it does not send in a very deterministic interval:

[ 1224.034898] ib_mad_completion_handler 2344 got successful recv cq op 128 mad_got_one 15
[ 1224.130616] ib_mad_completion_handler 2338 got successful send cq op 0 mad_got_one 15
[ 1224.222189] ib_mad_completion_handler 2344 got successful recv cq op 128 mad_got_one 16
[ 1224.417181] ib_mad_completion_handler 2338 got successful send cq op 0 mad_got_one 16

[ 1393.159845] ib_mad_completion_handler 2344 got successful recv cq op 128 mad_got_one 17
[ 1393.255546] ib_mad_completion_handler 2338 got successful send cq op 0 mad_got_one 17
[ 1393.347132] ib_mad_completion_handler 2344 got successful recv cq op 128 mad_got_one 18
[ 1393.538972] ib_mad_completion_handler 2338 got successful send cq op 0 mad_got_one 18

[ 1449.437542] ib_mad_completion_handler 2344 got successful recv cq op 128 mad_got_one 19
[ 1449.533248] ib_mad_completion_handler 2338 got successful send cq op 0 mad_got_one 19
[ 1449.624833] ib_mad_completion_handler 2344 got successful recv cq op 128 mad_got_one 20
[ 1449.722512] ib_mad_completion_handler 2338 got successful send cq op 0 mad_got_one 20

[ 4322.423624] ib_mad_completion_handler 2344 got successful recv cq op 128 mad_got_one 21
[ 4322.519328] ib_mad_completion_handler 2338 got successful send cq op 0 mad_got_one 21
[ 4322.610914] ib_mad_completion_handler 2344 got successful recv cq op 128 mad_got_one 22
[ 4322.708594] ib_mad_completion_handler 2338 got successful send cq op 0 mad_got_one 22
[ 4350.750574] ib_mad_completion_handler 2344 got successful recv cq op 128 mad_got_one 23
[ 4350.846278] ib_mad_completion_handler 2338 got successful send cq op 0 mad_got_one 23
[ 4350.937863] ib_mad_completion_handler 2344 got successful recv cq op 128 mad_got_one 24
[ 4351.035543] ib_mad_completion_handler 2338 got successful send cq op 0 mad_got_one 24

[ 4519.690559] ib_mad_completion_handler 2344 got successful recv cq op 128 mad_got_one 25
[ 4519.786262] ib_mad_completion_handler 2338 got successful send cq op 0 mad_got_one 25
[ 4519.877848] ib_mad_completion_handler 2344 got successful recv cq op 128 mad_got_one 26
[ 4519.975527] ib_mad_completion_handler 2338 got successful send cq op 0 mad_got_one 26

[ 4576.396279] ib_mad_completion_handler 2344 got successful recv cq op 128 mad_got_one 27
[ 4576.491979] ib_mad_completion_handler 2338 got successful send cq op 0 mad_got_one 27
[ 4576.583565] ib_mad_completion_handler 2344 got successful recv cq op 128 mad_got_one 28
[ 4576.681245] ib_mad_completion_handler 2338 got successful send cq op 0 mad_got_one 28

[ 4942.886820] ib_mad_completion_handler 2344 got successful recv cq op 128 mad_got_one 29
[ 4942.982523] ib_mad_completion_handler 2338 got successful send cq op 0 mad_got_one 29
[ 4943.074108] ib_mad_completion_handler 2344 got successful recv cq op 128 mad_got_one 30
[ 4943.171789] ib_mad_completion_handler 2338 got successful send cq op 0 mad_got_one 30

04/03 Tue¶

BUG BUG BUG¶

Finished basic replication mechanism last night.

Today merged several patches. And both Yilun and I think there is something wrong with ib_mad_completion_handler. It seems it will break things behind our back.

This is one bug catched today:

ib_mad_completion_handler¶

At very early stage:

[ 1174.406177] newpid: 20 home:1 replica: 1
[ 1174.452983] p2m_fork(cpu10): I cur:20-exe.o new:21
[ 1177.462795] ib_mad_completion_handler 2324 got successful recv cq op 128 mad_got_one 22
[ 1177.556502] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
[ 1177.650101] IP: [<ffffffff81059104>] ib_mad_completion_handler+0xb4/0x8a0

./scripts/faddr2line vmImage  ib_mad_completion_handler+0xb4
ib_mad_completion_handler+0xb4/0x899:
ib_mad_recv_done_handler at drivers/infiniband/core/mad.c:1899
 (inlined by) ib_mad_completion_handler at drivers/infiniband/core/mad.c:2325

ib_mad_recv_done_handler():
1899: qp_info = mad_list->mad_queue->qp_info;

A more scared one after I changed ib_mad_completion_handler. Note that recvcq is the only thread running on cpu4:

[  863.887705] p2m_fork(cpu10): I cur:20-exe.o new:21
[  868.478424] p2m_fork(cpu10): O succeed cur:20-exe.o new:21
[  868.541991] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[  868.635569] IP: [<ffffffff810656d4>] __schedule+0x94/0x1e0
[  868.701090] PGD 0
[  868.725010] general protection fault: 0000 [#1] SMP PROCESSOR
[  868.793651] CPU: 4 PID: 17 Comm: recvpollcq 4.0.0-lego-ys+ #737

Source:
clear_tsk_need_resched(prev);

Even this one for Phoenix:

[  763.442043] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[  763.535636] IP: [<ffffffff81018d6f>] task_curr+0xf/0x30
[  763.598035] PGD 103e956067 PUD 103e964067 PMD 0
[  763.653154] Oops: 0000 [#1] SMP PROCESSOR
[  763.700992] CPU: 12 PID: 21 Comm: word_count-pthr 4.0.0-lego-ys+ #740
[  763.777950] RIP: 0010:[<ffffffff81018d6f>]  [<ffffffff81018d6f>] task_curr+0xf/0x30

This NEVER happen before. And this part of code should be correct. We’ve ran a lot things.. I doubt if recent IB merge corrupt things.

fit_poll_cq¶

Another one:

[  690.401626] stat: /root/ys/phoenix/phoenix-2.0/tests/word_count/word_count_datafiles/word_1GB.txt
[  690.507742] SYSC_close() CPU12 PID:21 [fd: 4] -> [/sys/devices/system/cpu/online]
[  713.899884] ib_mad_completion_handler 2337 got successful recv cq op 128 mad_got_one 21
[  713.995606] ib_mad_completion_handler 2331 got successful send cq op 0 mad_got_one 21
[  714.087185] ib_mad_completion_handler 2337 got successful recv cq op 128 mad_got_one 22
[  714.184871] ib_mad_completion_handler 2331 got successful send cq op 0 mad_got_one 22
[  742.078102] ib_mad_completion_handler 2337 got successful recv cq op 128 mad_got_one 23
[  742.173810] ib_mad_completion_handler 2331 got successful send cq op 0 mad_got_one 23
[  742.265399] ib_mad_completion_handler 2337 got successful recv cq op 128 mad_got_one 24
[  742.363085] ib_mad_completion_handler 2331 got successful send cq op 0 mad_got_one 24
[  847.063372] mlx4_ib_handle_error_cqe syndrome 21
[  847.116511] mlx4_ib_handle_error_cqe syndrome 5
[  847.170590] send request failed at connection 7 as 12
[  847.230909] mlx4_ib_handle_error_cqe syndrome 5
[  847.284988] mlx4_ib_handle_error_cqe syndrome 5
[  847.339067] mlx4_ib_handle_error_cqe syndrome 5
[  847.393146] fit_poll_cq: failed status (5) for wr_id 1832
[  847.457624] fit_poll_cq: failed status (5) for wr_id 1833
[  847.522103] fit_poll_cq: connection 7 Recv weird event as -1
[  847.589701] fit_poll_cq: failed status (5) for wr_id 1834
[  847.654179] fit_poll_cq: connection 7 Recv weird event as -30704
[  847.725938] fit_poll_cq: failed status (5) for wr_id 1835
[  847.790416] fit_poll_cq: connection 7 Recv weird event as -30704
[  847.862174] mlx4_ib_handle_error_cqe syndrome 5
[  847.916252] mlx4_ib_handle_error_cqe syndrome 5
[  847.970331] mlx4_ib_handle_error_cqe syndrome 5
[  848.024410] mlx4_ib_handle_error_cqe syndrome 5
[  848.078490] fit_poll_cq: failed status (5) for wr_id 1836
[  848.142967] fit_poll_cq: failed status (5) for wr_id 1837
[  848.207446] fit_poll_cq: connection 7 Recv weird event as -1
[  848.275044] fit_poll_cq: failed status (5) for wr_id 1838
[  848.339523] fit_poll_cq: connection 7 Recv weird event as -30704
[  848.411281] fit_poll_cq: failed status (5) for wr_id 1839
[  848.475760] fit_poll_cq: connection 7 Recv weird event as -30704
[  848.547517] mlx4_ib_handle_error_cqe syndrome 5
[  848.601596] mlx4_ib_handle_error_cqe syndrome 5
[  848.655675] mlx4_ib_handle_error_cqe syndrome 5
[  848.709753] mlx4_ib_handle_error_cqe syndrome 5
[  848.763832] fit_poll_cq: failed status (5) for wr_id 1840

[  848.828313] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  848.921908] IP: [<ffffffff8106346d>] fit_poll_cq+0x4ad/0x510
[  848.989507] PGD 0
[  849.013426] Oops: 0002 [#1] SMP PROCESSOR
[  849.061265] CPU: 4 PID: 17 Comm: recvpollcq 4.0.0-lego-ys+ #744
[  849.131983] RIP: 0010:[<ffffffff8106346d>]  [<ffffffff8106346d>] fit_poll_cq+0x4ad/0x510
[  849.228700] RSP: 0000:ffff88103e813d88  EFLAGS: 00010246
[  849.292139] RAX: 0000000000001008 RBX: ffff88103effbad0 RCX: 0000000000000000
[  849.377418] RDX: 0000000000000000 RSI: ffffffff811d46e0 RDI: ffffffff811dbc08
[  849.462695] RBP: ffff88103e813ea8 R08: 0000000000000000 R09: 0000000000000000
[  849.547973] R10: 0000000000000002 R11: 0000000000000004 R12: 0000000000000000
[  849.633251] R13: ffff88103e801008 R14: 0000000000000004 R15: ffff88103e813da0
[  849.718529] FS:  0000000000000000(0000) GS:ffff88107fc40000(0000) knlGS:0000000000000000
[  849.815246] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  849.883884] CR2: 0000000000000000 CR3: 000000000113d000 CR4: 00000000000406a0
[  849.969163] Stack:
[  849.993082] ffffffff81003299 000001b03e813da0 0000000000000004 0000000000000730
[  850.080440] 0000008100000005 00001008000000f9 ffff88103eff8c50 002c222040000000
[  850.167798] 0010004000000002 ffff88107fc20000 0000000000000731 ffffffff00000005
[  850.255156] ffff8810000000f9 ffff88103eff8c50 0000000000000000 ffff88103e813e38
[  850.342513] ffffffff81019854 0000000000000732 ffff881000000005 ffff8810000000f9
[  850.429871] Call Trace:
[  850.458992] <TSK>
[  850.481870] [<ffffffff81003299>] ? native_smp_send_reschedule+0x39/0x50
[  850.560909] [<ffffffff81019854>] ? try_to_wake_up+0xe4/0x1f0
[  850.628506] [<ffffffff81065708>] ? __schedule+0xf8/0x1e0
[  850.691945] [<ffffffff810634d0>] ? fit_poll_cq+0x510/0x510
[  850.757464] [<ffffffff810634e4>] fit_poll_cq_pass+0x14/0x30
[  850.824021] [<ffffffff81020636>] kthread+0xf6/0x120
[  850.882260] [<ffffffff81020540>] ? __kthread_parkme+0x70/0x70
[  850.950898] [<ffffffff8100e572>] ret_from_fork+0x22/0x30

/* handle normal reply */
...
memcpy((void *)ctx->reply_ready_indicators[reply_indicator_index], &length, sizeof(int));
...
(This is a bad memcpy: reply_indicator_index, ctx, etc should be checked.)

IB Spec: QP, CQE, WQE, SEND¶

The channel adapter detects the WQE posting and accesses the WQE. The channel adapter interprets the command, validates the WQE’s virtual 12 addresses, translates it to physical addresses, and accesses the data. The outgoing message buffer is split into one or more packets. To each packet the channel adapter adds a transport header (sequence numbers, opcode, etc.). If the destination resides on a remote subnet the channel adapter adds a network header (source & destination GIDs). The channel adapter then adds the local route header and calculates both the variant and invariant checksums.

For a Send operation, the QP retrieves the address of the receive buffer from the next WQE on its receive queue, translates it to physical addresses, and accesses memory writing the data. If this is not the last packet of the message, the QP saves the current write location in 38 its context and waits for the next packet at which time it continues writing the receive buffer until it receives a packet that indicates it is the last packet of the operation. It then updates the receive WQE, retires it, and sends an acknowledge message to the originator.

When the originator receives an acknowledgment, it creates a CQE on the 5 CQ and retires the WQE from the send queue.

A QP can have multiple outstanding messages at any one time but the 8 target always acknowledges in the order sent, thus WQEs are retired in the order that they are posted.

04/02 Mon¶

Patching storage replica handler, able to finish today.

04/01 Sun¶

Anyway. Summary of the day: replication at M almost done. Only flush part left. Storage also need a handler. But we still need code to recover.

I’m tired. :-( A month to go.

Record a IB error. Using wuklab12 (P) and wuklab14(M+RAMFS), running usr/pcache_conflic.o:

P
[30801.296160] ibapi_send_reply() CPU:8 PID:19 timeout (30010 ms), caller: clflush_one+0x1c9/0x370
[30938.564843] mlx4_ib_handle_error_cqe syndrome 21
[30938.617988] mlx4_ib_handle_error_cqe syndrome 5
[30938.672068] send request failed at connection 6 as 12
[30938.732389] mlx4_ib_handle_error_cqe syndrome 5
[30938.786470] mlx4_ib_handle_error_cqe syndrome 5
[30938.840551] mlx4_ib_handle_error_cqe syndrome 5
[30938.894632] fit_poll_cq: failed status (5) for wr_id 1584
[30938.959112] fit_poll_cq: failed status (5) for wr_id 1585
[30939.023593] fit_poll_cq: connection 6 Recv weird event as -1
[30939.091194] fit_poll_cq: failed status (5) for wr_id 1586
[30939.155676] fit_poll_cq: connection 6 Recv weird event as -30704
[30939.227436] fit_poll_cq: failed status (5) for wr_id 1587
[30939.291917] fit_poll_cq: connection 6 Recv weird event as -30704
[30939.363678] mlx4_ib_handle_error_cqe syndrome 5
[30939.417759] mlx4_ib_handle_error_cqe syndrome 5
[30939.471839] mlx4_ib_handle_error_cqe syndrome 5
[30939.525921] mlx4_ib_handle_error_cqe syndrome 5
[30939.580002] fit_poll_cq: failed status (5) for wr_id 1588
[30939.644483] BUG: unable to handle kernel NULL pointer dereference at           (null)
[30939.738083] IP: [<ffffffff81062fcd>] fit_poll_cq+0x4ad/0x510
[30939.805684] PGD 0
[30939.829604] Oops: 0002 [#1] SMP PROCESSOR
[30939.877445] CPU: 4 PID: 17 Comm: recvpollcq 4.0.0-lego-ys+ #715
[30939.948166] RIP: 0010:[<ffffffff81062fcd>]  [<ffffffff81062fcd>] fit_poll_cq+0x4ad/0x510

fit_poll_cq at net/lego/fit_internal.c:1734
memcpy((void *)ctx->reply_ready_indicators[reply_indicator_index], &length, sizeof(int));

M
[30913.642698] mlx4_ib_handle_error_cqe syndrome 21
[30913.695839] mlx4_ib_handle_error_cqe syndrome 5
[30913.749919] send request failed at connection 1 as 12
[30913.810236] mlx4_ib_handle_error_cqe syndrome 5
[30913.864315] mlx4_ib_handle_error_cqe syndrome 5
[30913.918395] mlx4_ib_handle_error_cqe syndrome 5
[30913.972474] fit_poll_cq: failed status (5) for wr_id 305
[30914.035912] fit_poll_cq: failed status (5) for wr_id 306