1449277143 Q * fstd Remote host closed the connection 1449277334 J * fstd ~fstd@xdsl-84-44-227-80.netcologne.de 1449279746 M * Bertl_oO off to bed now ... have a good one everyone! 1449279749 N * Bertl_oO Bertl_zZ 1449283162 Q * fstd Remote host closed the connection 1449283458 J * fstd ~fstd@xdsl-84-44-227-80.netcologne.de 1449283554 Q * fstd Remote host closed the connection 1449283662 J * fstd ~fstd@xdsl-84-44-227-80.netcologne.de 1449283963 Q * fstd Remote host closed the connection 1449284006 J * fstd ~fstd@xdsl-84-44-227-80.netcologne.de 1449284238 Q * fstd Remote host closed the connection 1449284251 J * derjohn_mobi ~aj@x4db1b06e.dyn.telefonica.de 1449284310 J * fstd ~fstd@xdsl-84-44-227-80.netcologne.de 1449284488 Q * fstd Read error: Connection reset by peer 1449284691 Q * derjohn_mob Ping timeout: 480 seconds 1449284692 J * fstd ~fstd@xdsl-84-44-227-80.netcologne.de 1449284840 Q * fstd Remote host closed the connection 1449284924 J * fstd ~fstd@xdsl-84-44-227-80.netcologne.de 1449285350 Q * fstd Remote host closed the connection 1449285396 J * fstd ~fstd@xdsl-84-44-227-80.netcologne.de 1449285613 J * fstd_ ~fstd@xdsl-84-44-227-80.netcologne.de 1449285613 Q * fstd Read error: Connection reset by peer 1449285629 N * fstd_ fstd 1449287145 Q * fstd Read error: Connection reset by peer 1449287314 J * fstd ~fstd@xdsl-84-44-227-80.netcologne.de 1449288029 Q * fstd Remote host closed the connection 1449288251 J * fstd ~fstd@xdsl-84-44-227-80.netcologne.de 1449288617 Q * fstd Read error: Connection reset by peer 1449288779 J * fstd ~fstd@xdsl-84-44-227-80.netcologne.de 1449300550 J * Ghislain ~aqueos@adsl1.aqueos.com 1449307394 N * Bertl_zZ Bertl 1449307396 M * Bertl morning folks! 1449308974 M * Guy- Bertl: I just managed to hang start-stop-daemon in a vserver guest, leading to a hung task panic 1449308980 M * Guy- Bertl: http://sprunge.us/IQAe 1449308989 M * Guy- this is with 4.1.13+vs2.3.8.3 1449309151 M * Bertl I'm not so sure those are valid issues, the "soft lockup" has been turned off here for quite a while, because for example it triggers happily on "normal" NFS operation and similar 1449309200 M * Bertl IMHO this is either a problem with the soft lockup code or more likely a scheduler issue, i.e. certain cases get sheduled so badly that they trigger the watchdog 1449309202 M * Guy- this was quite certainly not i/o related 1449309234 M * Bertl if you don't want to disable it then at least raise the timeout to something like 5 minutes or so, where you can be somewhat sure that it is a stuck task 1449309270 M * Guy- OK, yes, I already have hung_task_timeout at 7200 1449309281 M * Guy- but this was a "soft lockup", which is something different 1449309315 M * Guy- I don't know how to adjust that timeout 1449309596 M * Guy- hmm, there is some watchdog_thresh 1449309671 M * Bertl http://unix.stackexchange.com/questions/70377/bug-soft-lockup-cpu-stuck-for-x-seconds 1449309676 M * Bertl just as an example :) 1449309676 M * Guy- can't be increased over 60 though 1449309763 M * Guy- I'll see if I can reproduce it 1449309774 M * Guy- it happened under very specific circumstances 1449309801 M * Bertl okay 1449309804 M * Guy- I had nslcd (from libnss-ldapd) running on the host, and I had the directory with its socket bind mounted read-only in a guest 1449309821 M * Guy- and I was installing libnss-ldapd in the guest too, which tried to start nslcd 1449309835 M * Guy- which I assume tried to remove the socket from the read-only bind mount, or kill the running nslcd oslt 1449309862 M * Guy- (so it may have tried to send a signal to a PID that didn't exist in the guest) 1449309883 M * Guy- but now there is some network issue on the box concerned, so I have to fix that first before reproducing this 1449309887 M * Guy- (if it can be) 1449310439 M * Guy- no, couldn't reproduce 1449310726 M * Guy- oh yes I can 1449310751 M * Guy- just hit it again, the same way 1449310767 M * Guy- I'll try to strace to see what happens in userspace, and increase the threshold 1449310801 M * Guy- find_pid_ns() is involved, does that only exist if pid namespaces are enabled? 1449310830 M * Bertl it is probably a dummy without 1449310899 M * daniel_hozac wasn't Ghislain's issue also with signals between context? 1449311062 M * Ghislain yep 1449311170 M * Ghislain when i do a reboot -i -d -f it crash the process at 10% the only thing we traced was last a call to a signal , bert had the image of the guest to do some testing 1449311223 M * Ghislain if i remember it was a sig_cont 1449311263 M * Ghislain it wxas reproductible easely in my 4.1.x kernel a simple vserver enter and reboot triggers it 1449311303 M * Ghislain don't know how to trace it more 1449311474 M * Bertl well, it was a blocked (reported by Linux-VServer debugging) signal which was targeted at the wrong context 1449311486 M * Bertl (or to be precise, a task outside the context) 1449312487 M * Guy- [pid 18341] kill(17297, SIG_0 1449312491 M * Guy- this is where it hangs 1449312504 M * Guy- pid 17297 is a host process 1449312524 M * Guy- and if I try to send it a signal from the guest, it hangs 1449312650 M * Guy- and yes, if I just try to kill any host process from inside a guest, the kill goes to 100% cpu and stays there 1449312669 M * Guy- this does look like the same issue Ghislain reported 1449312691 M * Guy- Bertl: is there anything useful I can look at? 1449312788 M * Bertl first you can enable the debugging to verify that it indeed is a blocked signal 1449312798 M * Bertl (check with Ghislain how to do that) 1449312825 M * Bertl then you can try to disable the Linux-VServer check (sending signals to processes outside the context) 1449312841 M * Bertl (again, Ghislain knows how to do that, I hope :) 1449312888 M * Bertl and finally, if that is the case, investigate with daniel_hozac why something in the start/stop procedure ends up getting signalled outside the context 1449313043 M * Guy- I know why it tries to send a signal outside the context 1449313065 M * Guy- it reads a pidfile that was generated outside the context, and sends a signal to the pid found there 1449313110 M * Guy- but this way a buggy guest can DoS the host by sending signals to arbitrary out-of-context PIDs 1449313190 M * Bertl if that is the case, try to change the return code in the check 1449313212 M * Guy- in which check? 1449313265 M * Guy- I can avoid hitting this particular problem by simply removing the pid file before it gets exposed to a guest 1449313281 M * Bertl +error = -ESRCH; 1449313291 M * Bertl in check_kill_permission() 1449313449 M * Bertl -EPERM would be the obvious return value, but it allows a guest process to figure out what pids actually exist 1449313520 M * Guy- why does ESRCH lead to a loop? it's what kill() returns for nonexistent pids 1449313543 M * Bertl yes, that's why we return it at the moment 1449313584 M * Bertl not sure where this comment came from in the first place, it might not be valid anymore 1449313650 M * Bertl fact is, that the userspace process doesn't give up sending the signal over and over again (at least in Ghislain's case) 1449313686 M * Bertl or at least that is how it seems 1449313688 M * Guy- in my case it's looping in kernel space 1449313725 M * Bertl where? do you have a stack trace which contains the signal action? 1449313765 M * Guy- only this: 1449313765 M * Guy- 2015-12-05T11:18:57+01:00 ? find_pid_ns+0x7a/0x7a 1449313766 M * Guy- 2015-12-05T11:18:57+01:00 ? kill_pid_info+0x38/0x89 1449313766 M * Guy- 2015-12-05T11:18:57+01:00 ? pid_nr_ns+0xe/0x3d 1449313766 M * Guy- 2015-12-05T11:18:57+01:00 SYSC_kill+0x84/0x1ba 1449313767 M * Bertl it should write the debug output (Linux-VServer) over and over 1449313776 M * Guy- I haven't enabled that yet 1449313816 M * Guy- but it's certainly not looping in userspace because if it were, it wouldn't trigger the softlockup watchdog 1449313825 M * Bertl also, you need to enable debug info to show the line numbers 1449313850 M * Guy- do you know which CONFIG option that would be? 1449313875 M * Bertl ha, there is the potential loop: 1449313900 M * Bertl kill_pid_info() has a for (;;) which is not broken by -ESRCH 1449313966 M * Bertl I'll prepare a patch to circumvent the ESRCH issue (but not right now, later tonight) 1449313983 M * Guy- I'll see if I can do it myself 1449313986 M * Bertl the basic idea if somebody wants to hack on it earlier is: 1449314017 M * Bertl return -ENOENT instead of -ESRCH, then convert that to -ESRCH in the userspace return path 1449314079 M * Bertl i.e. have check_kill_permission() return -ENOENT in case the pid is outside the context 1449314114 M * Bertl and do a return error == -ENOENT ? -ESRCH : error; whenever the value is returned to userspace 1449314163 M * Guy- yes, I understand 1449314171 M * Guy- I'm trying to find where the latter bit should be added 1449314202 M * Guy- but I can't seem to find where kill() actually is :) 1449314292 M * Bertl that would be sys_kill() :) 1449314309 M * Bertl but I would move that return up as high as possible 1449314333 M * Bertl i.e. just outside the scope of the kill_pid_info() loop 1449314348 M * Guy- SYSCALL_DEFINE2(kill, pid_t, pid, int, sig) 1449314354 M * Guy- this, probably 1449314461 M * Guy- OK, I think i made the needed change 1449315154 M * Guy- but it didn't help; it's still looping somewhere 1449315321 M * Guy- could be that kill_something_info() is still returning -ESRCH 1449315456 M * Guy- but looking at the code, it shouldn't 1449315482 M * Guy- well, I'm giving up for now; family wants me 1449316061 M * daniel_hozac Guy-: what was your testcase? just signalling something outside of the guest? 1449317280 M * daniel_hozac Bertl: can you ssh to linux-vserver.org? 1449317315 M * daniel_hozac i'm just getting a connection refused. 1449317414 M * daniel_hozac Guy-: Ghislain: Bertl: http://daniel.hozac.com/stuff/delta-signal-fix05.diff fixes it for me. 1449317702 M * Bertl daniel_hozac: please try again 1449317719 M * Bertl @patch: yep, that's another solution to the same issue 1449317780 M * Bertl but I'm not sure it is smart to check twice 1449317812 M * Bertl ah, sorry misread the patch 1449317817 M * Bertl (please ignore) 1449318790 M * AlexanderS Ghislain: If you want to play with pid namespaces, I have written a little tool to manage it like network namespaces: https://github.com/AlexanderS/pidns/ 1449320025 J * BobR odie@IRC.13thfloor.at 1449320286 N * BobR BobR_afk 1449320343 Q * fstd Remote host closed the connection 1449320403 M * Guy- daniel_hozac: yes, signaling a host process from a guest 1449320423 M * Guy- will check the patch later, thanks! 1449320534 J * fstd ~fstd@xdsl-87-78-15-17.netcologne.de 1449320555 Q * fstd Remote host closed the connection 1449320846 J * fstd ~fstd@xdsl-87-78-143-127.netcologne.de 1449321090 N * BobR_afk BobR 1449321121 N * BobR BobR_oO 1449321616 Q * fstd Remote host closed the connection 1449321778 J * fstd ~fstd@xdsl-81-173-191-54.netcologne.de 1449321839 Q * fstd Remote host closed the connection 1449321850 J * fstd ~fstd@xdsl-87-78-142-239.netcologne.de 1449327770 Q * fstd Remote host closed the connection 1449327800 J * fstd ~fstd@xdsl-87-78-142-239.netcologne.de 1449328694 J * fstd_ ~fstd@xdsl-87-78-142-239.netcologne.de 1449328695 Q * fstd Read error: Connection reset by peer 1449328712 N * fstd_ fstd 1449328879 Q * jrklein Remote host closed the connection 1449328890 J * jrklein ~cloud@proxy.dnihost.net 1449329558 Q * fstd Remote host closed the connection 1449329643 J * fstd ~fstd@xdsl-87-78-142-239.netcologne.de 1449331490 Q * eyck Remote host closed the connection 1449342490 M * daniel_hozac Bertl: could we fix the DNS for linux-vserver.org? ns1.linux-vserver.at doesn't seem to respond. 1449344529 Q * Ghislain Quit: Leaving. 1449345454 J * eyck ~eyck@u28n61.nowanet.pl