1447808361 Q * fstd Remote host closed the connection 1447808373 J * fstd ~fstd@xdsl-81-173-188-54.netcologne.de 1447822303 M * Bertl_oO off to bed now ... have a good one everyone! 1447822307 N * Bertl_oO Bertl_zZ 1447822959 J * yang_ yang@irs.si 1447822959 Q * yang Remote host closed the connection 1447828467 J * thierryp ~thierry@2a01:e35:2e2b:e2c0:a190:7092:7a2b:1f65 1447829006 Q * thierryp Remote host closed the connection 1447832539 J * Ghislain ~aqueos@adsl1.aqueos.com 1447832797 J * thierryp ~thierry@zeta.inria.fr 1447833416 Q * Romster Ping timeout: 480 seconds 1447833691 M * Ghislain daniel_hozac: i am back, the process is the vserver xxx enter one 1447833741 M * Ghislain daniel_hozac: i think you have access to my test machine if you want to try yourself 1447833814 Q * derjohn_mob Ping timeout: 480 seconds 1447837064 J * Romster ~Romster@202.168.100.149.dynamic.rev.eftel.com 1447837452 Q * thierryp Remote host closed the connection 1447837493 J * thierryp ~thierry@zeta.inria.fr 1447837853 J * derjohn_mob ~aj@fw.gkh-setu.de 1447838622 N * yang_ yang 1447839257 M * Ghislain i wonder if signal filtering could be able to tell that the target porocess is in vcontext 0 AND is a parent so it relax some filtering (of course not the kill or such but the lighter ones) 1447839562 M * Ghislain in signal.c there is + /* FIXME: we shouldn't return ESRCH ever, to avoid loops, maybe ENOENT or EACCES? */ 1447839562 M * Ghislain sound like a candidate ;) 1447839834 Q * thierryp Remote host closed the connection 1447841174 M * Ghislain and the comment is not scary in any shape or form ;p 1447841183 A * Ghislain running screaming in fear 1447841488 J * thierryp ~thierry@zeta.inria.fr 1447841524 M * Ghislain as the filter seems to be because of acces restriction EACCES seems legit 1447841636 J * Gremble ~Gremble@cpc87151-aztw31-2-0-cust755.18-1.cable.virginm.net 1447843354 Q * thierryp Remote host closed the connection 1447844135 M * Ghislain i see other code use EPERM in the code 1447844656 M * Ghislain replacing by EACCES just make the error ctrl-c possible but the kernel message continue nonetheless (NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [reboot:2732]) here i start and stop a guest , the stop hanged in vwait 1447844677 M * Ghislain ctrl-c recovered the shell and the cpu but the stuck process is still here 1447846061 J * thierryp ~thierry@zeta.inria.fr 1447846147 Q * thierryp Read error: Connection reset by peer 1447846152 J * thierryp ~thierry@zeta.inria.fr 1447846343 P * thierryp 1447847367 Q * derjohn_mob Ping timeout: 480 seconds 1447847914 Q * ensc Ping timeout: 480 seconds 1447847968 J * derjohn_mob ~aj@fw.gkh-setu.de 1447848395 N * Bertl_zZ Bertl 1447848397 M * Bertl morning folks! 1447848606 M * Ghislain hello 1447848922 M * Ghislain in fact the vserver xx stop is now ctrl-c 'able' but the reboot commands is then the one in the guest context that is at 100% cpu 1447849025 M * Ghislain for me this all seems related to rcu , don't know what it means, let see if this appear in the patch 1447849102 M * Bertl you get a reboot with 100% cpu usage? 1447849157 M * Ghislain yes, i do vserver xx stop, the things hags with 1447849157 M * Ghislain /usr/share/util-vserver/vserver.stop: line 100: 2523 Killed "${IONICE_CMD[@]}" "${NICE_CMD[@]}" "${NETNS_CMD[@]}" "${CHBIND_CMD[@]}" "$_VSPACE" --enter "$S_CONTEXT" "${OPTS_VSPACE[@]}" "${OPTS_VSPACE_SHARED[@]}" -- "$_VTAG" --migrate "${OPTS_VTAG_ENTER[@]}" --silent -- $_VCONTEXT $SILENT_OPT --migrate $OPT_VCONTEXT_CHROOT --xid "$S_CONTEXT" -- "${INITCMD_STOP[@]}" 1447849195 M * Ghislain but then i can ctrl-c it (version where i changed the to -EACESS ) but inside the guest the reboot process is at 100% 1447849223 M * Ghislain the 100% command is "reboot -d -f -i" 1447849328 M * Bertl for a first test I would disable the entire signal permission check 1447849339 M * Bertl i.e. change: 1447849349 M * Bertl if (!vx_check(vx_task_xid(t), VS_WATCH_P | VS_IDENT)) { 1447849357 M * Bertl to 1447849388 M * Bertl if (0 && !vx_check(vx_task_xid(t), VS_WATCH_P | VS_IDENT)) { 1447849419 M * Bertl this will allow the signal to pass unhindered 1447849469 M * Ghislain ok from all those the thing i was able to trace was looping in rcu_something for all the orther i was unable to trace anything it just looped 1447851560 Q * fstd Remote host closed the connection 1447851571 J * fstd ~fstd@xdsl-87-78-182-91.netcologne.de 1447854766 M * Ghislain reuslt: 1447854766 M * Ghislain 2540 root 20 0 176 4 0 R 100.0 0.0 0:31.34 vwait 1447854766 M * Ghislain 2753 root 20 0 4080 616 540 R 100.0 0.0 0:31.34 reboot 1447854804 M * Ghislain if i ctrl-c the sudo vserver stop, then only the reboot remains 1447855282 M * Ghislain is there anyway to inspect what those process are doing ? 1447855306 M * Bertl so even if we allow the signal to pass, the shutdown results in a 100% cpu task, yes? 1447855312 M * Ghislain yes 1447855338 M * Bertl which means the signal is not really to blame 1447855357 M * Bertl you can attach to a running process with strace -p 1447855375 M * Ghislain no i was unable to attach to it 1447855383 M * Bertl I suspect it is banging on the kernel to trigger a reboot 1447855411 M * Bertl which usually results in the util-vserver helper being activated and killing off whatever tasks are left 1447855455 M * Ghislain the strace is just doing process xxx attached then nothing 1447855468 M * Bertl there is a Linux-VServer debug entry in vs_reboot() 1447855468 M * Ghislain the consle prints NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [reboot:2753] 1447855480 M * Bertl vxdprintk(VXD_CBIT(misc, 5) 1447855504 M * Bertl so if you set bit 5 of the misc debug entry, you should see if it is calling this 1447855561 M * Ghislain could you remind me how i do that ? 1447855656 M * Bertl sysctl -w vserver.debug_misc=$[1<<5] 1447855876 M * Ghislain ok, call trace SYSC_kill system_call_fastpasth .. 1447855904 M * Ghislain the console is filled with trace about SYSC_kill 1447855965 M * Bertl we are more interested in 'vs_reboot(...)' messages 1447855989 M * Bertl there should be one when you stop the guest, but I suspect there will be more than one 1447856043 M * Bertl but it might as well be that the "reboot" in the guest just calls the kernel to reboot and then simply spins in a loop 1447856050 M * Ghislain every 22s when NMI watchdog trigger its output 1447856061 M * Ghislain recompiling with debug... this will take soem time 1447856067 M * Bertl yeah, that's the soft lockup which is not helpful 1447856094 M * Bertl note that a proper guest using sysv init style should not call reboot anyway 1447856118 M * Bertl i.e. the problem will probably go away if you clean up your runlevel scripts 1447856207 M * Ghislain i will rebuild the server from scratch but anyway even if the config is not 100% the best the locking is sign of somethign that is not handled right in the kernel somewhere could be mainline, could be vserver and probably interaction between a chnage in mainline that encounter a bad guest config 1447856270 M * Ghislain the system should be able to recover from that and not a 100cpu :) i am kind of idealist that think userland should not be able to lock the kernel 1447856295 M * Ghislain lock or put it in infinite loop 1447856311 M * Bertl well, if you run a while (true); loop, the process will consume 100% cpu 1447856342 M * Bertl nevertheless, it should be killable without problems (note that the kill should now work as signals are not blocked anymore) 1447856343 M * Ghislain yes but you do not need to powercycle the server to recover if you restteained the usage with cgroups 1447856352 M * Ghislain yep you got the point :) 1447856362 M * Ghislain oh 1447856368 M * Ghislain let me try the kill by hand 1447856377 M * Bertl if the reboot spins at 100% and is unkillable, then there might be a problem with the kernel itself 1447856379 M * Ghislain oh dam too late allready rebooted 1447856394 M * Bertl (i.e. with mainline) 1447856590 M * Bertl off for now ... bbs 1447856592 M * Ghislain yes could perfectly be the case 1447856595 M * Ghislain ok ++ 1447856597 N * Bertl Bertl_oO 1447857033 J * thierryp ~thierry@zeta.inria.fr 1447859745 Q * Gremble Quit: I Leave 1447862555 Q * thierryp Remote host closed the connection 1447867777 J * thierryp ~thierry@2a01:e35:2e2b:e2c0:4471:bd74:26b4:b08f 1447870983 Q * derjohn_mob Ping timeout: 480 seconds 1447871521 J * thierryp_ ~thierry@82.226.190.44 1447871807 Q * thierryp Ping timeout: 480 seconds 1447875542 J * sannes ~ace@2a02:fe0:c120:9660:4841:2d7d:140e:acce 1447883002 Q * sannes Quit: Leaving. 1447883372 J * _Shiva_ shiva@whatcha.looking.at 1447889193 Q * Ghislain Quit: Leaving.