1320538122 Q * petzsch1 Quit: Leaving. 1320541991 Q * hparker Quit: Quit 1320548904 M * Bertl off to bed now ... have a good one everyone! 1320548909 N * Bertl Bertl_zZ 1320552536 Q * Aiken Quit: Leaving 1320558669 Q * clopez Ping timeout: 480 seconds 1320567444 J * Aiken ~Aiken@2001:44b8:2168:1000:21f:d0ff:fed6:d63f 1320569395 J * fisted_ ~fisted@xdsl-87-78-219-120.netcologne.de 1320569553 Q * fisted Ping timeout: 480 seconds 1320571614 J * bonbons ~bonbons@2001:960:7ab:0:2d7c:a935:ac98:61e 1320572285 J * ghislain ~AQUEOS@adsl2.aqueos.com 1320572595 J * thierryp ~thierry@home.parmentelat.net 1320574721 J * Spl1nt dcec82df@ircip2.mibbit.com 1320574790 Q * Spl1nt Remote host closed the connection 1320575607 Q * thierryp Remote host closed the connection 1320578481 Q * Aiken Quit: Leaving 1320581325 J * hparker ~hparker@2001:470:1f0f:32c:beae:c5ff:fe01:b647 1320581458 N * Bertl_zZ Bertl 1320581462 M * Bertl morning folks! 1320588126 J * mike ~mike@79.133.200.204 1320588255 J * BenG ~bengreen@cpc12-aztw24-2-0-cust146.aztw.cable.virginmedia.com 1320588289 Q * BenG 1320590171 Q * transacid Remote host closed the connection 1320590192 J * transacid ~transacid@transacid.de 1320590222 N * transacid Guest15983 1320592455 J * biz ~biz@baze.de 1320594718 Q * FireEgl Ping timeout: 480 seconds 1320595337 J * FireEgl FireEgl@2001:470:e056:1:c407:36ae:8ef0:bf0e 1320595471 Q * FireEgl Remote host closed the connection 1320599673 Q * mike Quit: leaving 1320601275 M * arekm Bertl: please look at the mnt_is_reachable issue 1320601316 M * Bertl do we have a test case yet? 1320601361 M * arekm no. We have a setup where this happens, pointers on why this might be happening etc 1320601375 M * Bertl yeah, but the locking there looks fine to me 1320601413 M * Bertl on the 'setup where this happens' how easily can it be triggered? 1320601493 M * Bertl I'm asking because we could enable some lock debugging and see what falls out 1320601517 J * pluto ~pluto@89-79-206-92.dynamic.chello.pl 1320601533 M * arekm pluto: hi 1320601538 M * arekm Bertl: it happens on pluto machine 1320601550 M * Bertl hey pluto! how are you? 1320601611 M * pluto very well, i've a cold beer :) 1320601625 M * Bertl that's a good start :) 1320601659 M * Bertl now, please describe your setup and what happens and if you happen to know, when it happens 1320601679 M * Bertl (handwaving and speculations are fine, just let me know what are hard facts and what not) 1320601688 M * pluto basically the deadlock happens on on few machines. 1320601709 M * Bertl okay, similar hardware? identical hardware? 1320601735 M * pluto it happens on dual-opteron-numa server, it happens also on single processor core-i3 (quad-core) 1320601745 M * pluto generally - smp 1320601764 M * Bertl okay, that suggests a synchronization/locking issue 1320601804 M * pluto all machines are tortured by auto-test farm (numerical tasks) 1320601831 M * Bertl so that's rather CPU intense, also I/O? 1320601833 M * arekm are all machines using automounter? (aka did it ever happen without automount ?) 1320601854 M * pluto all farm tests read input data from automounted nfs 1320601865 M * Bertl v3 or v4? 1320601868 M * pluto and store result via nfs 1320601895 M * pluto nfs v3 1320601936 M * Bertl okay, does the autofs trigger often? i.e. are the volumes mounted/unmounted periodically? 1320602092 M * pluto hard to say, there're mixed (short (few minutes) and long (few hours)) tests and the autfs has 60-sec unmount timeout. 1320602139 M * pluto afaics on 16-cores the nfs is mounted all the time. 1320602140 M * Bertl so chances are good that the volumes get unmounted during the tests, yes? 1320602159 M * pluto yes, there's a chance 1320602173 M * Bertl okay, so, how often does that 'deadlock' happen 1320602429 M * pluto on opterons it can deadlock in few minutes or few hours. on desktop core-i3 it always deadlocks during weekend tests. 1320602430 M * Bertl and what does 'deadlock' mean, i.e. is the machine dead, or do some parts of the kernel work, e.g. ping, what about the (serial) console? does magic sysrq work? 1320602498 M * Bertl how many cores do the opterons have in total? 1320602519 M * pluto deadlock means there's a flood on console/ipmi - https://lkml.org/lkml/2011/5/23/398 1320602528 M * pluto sysrq doesn't work, ping works 1320602537 M * pluto ssh doesn't work 1320602559 M * daniel_hozac did you enable sysrq before it hung? 1320602563 M * Bertl okay, so interrupts are working, schedulung not 1320602619 M * pluto server has two 8-core opterons. 1320602655 M * pluto daniel_hozac: afaics the sysrq is enabled by deafult in my distro. 1320602686 M * Bertl IIRC, we did search for the code sequence repeated over and over, do we have a decoded version, i.e. where this sequence is found in your kernel and what line the <8a> corresponds to? 1320602729 M * Bertl pluto: could you test the magic sysrq right now on one of those machines with sysrq-h (help, which is harmless) 1320602760 M * Bertl just to rule out that it might not work for other reasons 1320602889 M * Bertl maybe you could also upload the output of 'dmidecode' for one of each machine type (feel free to anonymize serial numbers and similar)? 1320602896 M * pluto echo h >/proc/sysrq-trigger emits an echo in dmesg. 1320602915 M * Bertl yeah, but you cannot do that when it deadlocks, yes? 1320602928 M * Bertl so you need to send it via console/ipmi/whatever 1320603105 M * pluto yes, it doesn't work after lock. the sysrq and numlock clicking on real console doesn't work after lock, so i assume that kernel locks in weird state. 1320603178 Q * trippeh Remote host closed the connection 1320603275 M * pluto s/yes/no/ - i've forgotten english semantics :) 1320603338 M * daniel_hozac sysrq-trigger works even when sysrq is disabled, IIRC. 1320603359 M * daniel_hozac so an actual sysrq from your console while it's running would be more interesting. 1320603388 M * Bertl shouldn't do any harm as I said, just print the help message as you already tested via /proc 1320603512 M * arekm pluto: tell about commenting out that mnt_is_reachable test 1320603615 M * Bertl I remember that you reported that commenting that out improved things 1320603622 M * pluto so, commenting out mnt_is_reachable makes the kernel so stable as the 2.6.37 was. 1320603667 M * Bertl but if you like to investigate more in this direction, I'd suggest to hammer on /proc/mounts inside the guest for example 1320603690 M * Bertl i.e. start a loop which just does 'cat /proc/mounts' 1320603703 M * Bertl preferably in two or more guests at the same time 1320603726 M * pluto it's funny but i don't have any vserver guests on any machine. 1320603743 M * pluto i have only vserver patchset on distribution kernel. 1320603753 M * Bertl hum, okay, sec, let me check something 1320604015 M * Bertl okay, so then let's hammer on /proc/mounts from separate threads 1320604182 M * Bertl (on the host) 1320604431 J * trippeh ~atomt@cm-84.209.22.46.getinternet.no 1320604545 M * pluto i'll be right back in few minutes, going for next beer :) 1320604838 M * Bertl np 1320605883 M * pluto ok, i've got it :) 1320605929 M * Bertl the beer? the sysrq-trigger? the dmidecode? :) 1320606048 M * pluto beer :) and dmidecode http://pastebin.com/HkZhh387 1320606208 J * Hollow ~Hollow@91-66-255-107-dynip.superkabel.de 1320606237 M * Hollow ?who 1320606239 M * Hollow damn 1320606241 M * Hollow hi uys :) 1320606259 M * Bertl hey Hollow! LTNS! 1320606278 M * Hollow at least in IRC, yes :) 1320606312 M * pluto Bertl: i wonder if http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=5a30d8a2b8ddd5102c440c7e5a7c8e1fd729c818 may help in this issue? 1320606354 M * Hollow i'm currently testing the new experimental releases 1320606447 M * Bertl pluto: maybe ... IMHO it would be great if we could find some kind of quick trigger, i.e. a script or so which triggers it within a few minutes or seconds 1320606486 M * Bertl because if we have that, it should be simple to narrow it down, and of course, to test possible fixes 1320606640 M * Hollow Bertl: are there any known issues with 3.0.7-vs2.3.1? i'm currently experiencing some iowait issues, but the system is completly idle 1320606707 M * Bertl I/O issues are common since 2.6.22 I guess :) 1320606715 M * Hollow :) 1320606776 M * pluto Bertl: i suppose that lock is highly related to i/o peak, e.g. 16-opteron cores are unpacking .7z from remote nfs into local filesystem, run unpacked software and ices in few minutes. 1320606801 M * pluto Bertl: or maybe it ices after i/o peak when auto *un*mount goes in action. 1320606804 M * Hollow well, i'm building some gentoo images and somehow a process (always chmod until now) freezes in D state, and the system is idle (no usage in iostat, iotop, top or whatever), but still the loadavg increases steadily 1320606870 M * Bertl disc activity? 1320606882 M * pluto Bertl: probably i can increase autofs unmount timeout from 60 sec to few days and test... 1320606894 M * Hollow none according to iostat, i cannot physically look at the machine unfortunately 1320606927 M * Bertl pluto: I'd rather go in the other direction and make the umount happen more often ... 1320606950 M * Bertl pluto: does the sysrq trigger now work via console/ipmi? 1320606952 M * Hollow since it seems pretty reproducible, i'm going to try with vanilla 3.0.7 and see if that helps 1320606962 M * Bertl that would be very appreciated! 1320606963 M * Hollow any tools that might be helpful except iostat, top? 1320607010 M * Bertl iotop 1320607799 M * daniel_hozac Hollow: any COW involved there? 1320607816 M * Hollow not, i don't even have any guests created yet 1320607821 M * Hollow just the host system 1320607924 M * daniel_hozac nothing in dmesg? 1320607936 M * Hollow nothing ... 1320608878 M * pluto Bertl: the sysrq works on real console and doesn't work via ipmi (i can't send break ~B - probably crappy ipmi firmware or ipmi-tool bug) 1320608914 M * Bertl okay, so, did you try to send it over real console when the 'deadlock' happens yet? 1320609101 M * pluto Bertl: the real console doesn't work after deadlock. it only floods byte sequences and broken stacktraces :( 1320609133 M * Bertl okay, but you did try to issue a magic sysrq with e.g. 'T' or so, yes? 1320609250 M * pluto yes i did. sysrq terminate/list blocked/etc. doesn't work, numlock clicks doesn't work - total disaster. 1320609274 M * Bertl okay ... 1320609653 Q * fisted_ Ping timeout: 480 seconds 1320610411 J * petzsch ~markus@dslb-092-078-236-063.pools.arcor-ip.net 1320610535 Q * chrissbx Quit: Leaving 1320610781 Q * hparker Quit: Quit 1320611213 J * Aiken ~Aiken@2001:44b8:2168:1000:21f:d0ff:fed6:d63f 1320612641 Q * pluto Quit: leaving 1320612715 J * wiuempe ~wmp@dynamic-78-8-159-154.ssp.dialog.net.pl 1320612719 M * wiuempe hello 1320612748 M * Bertl hello 1320612758 M * wiuempe is possible to keep one vserver files on other path? 1320612787 M * Bertl yep, you can place them anywhere you want 1320612799 M * Bertl just the config needs to point to the proper vdri 1320612801 M * Bertl *vdir 1320612831 M * wiuempe ok 1320612832 M * wiuempe thx 1320612839 M * wiuempe bye and nice day 1320612843 P * wiuempe 1320614357 Q * Hollow Quit: Hollow 1320614890 Q * petzsch Quit: Leaving. 1320615097 J * wifigeek-home ~wifigeek-@5ad939a2.bb.sky.com 1320615099 M * wifigeek-home lo all 1320615231 J * fisted ~fisted@xdsl-87-78-219-120.netcologne.de 1320615673 J * fisted_ ~fisted@xdsl-87-78-211-242.netcologne.de 1320615839 Q * sannes1 Remote host closed the connection 1320616053 Q * fisted Ping timeout: 480 seconds 1320616192 Q * bonbons Quit: Leaving 1320616625 J * hparker ~hparker@2001:470:1f0f:32c:beae:c5ff:fe01:b647 1320617653 Q * ghislain Ping timeout: 480 seconds 1320617881 J * FireEgl ~FireEgl@173-16-9-169.client.mchsi.com 1320617938 Q * hparker Quit: Quit 1320618461 J * ghislain ~AQUEOS@adsl2.aqueos.com 1320621205 Q * jeroen__ Ping timeout: 480 seconds 1320621725 J * jeroen__ ~jeroen@095-097-051-172.static.chello.nl 1320622000 J * black ~black@027cf407.bb.sky.com 1320622451 J * clopez ~clopez@238.10.117.91.dynamic.mundo-r.com