1264119039 M * Bertl okay, I'll have to dig into that deeper ... i.e. there must be at least another cause for the lockup, which I do not see atm 1264119071 M * fzylogic ok 1264119118 M * fzylogic if kbad's on and I'm not, he can help with debugging too 1264119132 M * Bertl okay, thanks a lot so far 1264119141 M * fzylogic sure thing 1264119147 Q * SpComb Ping timeout: 480 seconds 1264119162 M * Bertl if you really manage to pinpoint the kernel change/version when it started, I'd be happy to hear 1264119179 M * fzylogic definitely 1264119296 M * fzylogic hey, I just found something 1264119307 M * Bertl let's hear ... 1264119330 M * fzylogic if instead of reading from /dev/urandom you dd say 20MB into a file from urandom and have the program read from there, it apparently doesn't lock 1264119339 M * fzylogic maybe it's not closing the device properly? 1264119377 M * Bertl hmm, hmm, so sou are saying it is urandom related? 1264119386 M * fzylogic I think it may be, yes 1264119393 M * fzylogic 4 runs now with no hangs 1264119397 M * fzylogic just a normal kill 1264119413 M * Bertl okay, let's check with /dev/one (if you have that) 1264119432 M * fzylogic I don't 1264119488 M * Bertl yeah, not really unexpected :) 1264119571 M * Bertl let's read from /dev/zero then 1264119597 M * Mr_Smoke How about /dev/forty-two ? 1264119602 A * Mr_Smoke hides in a corner 1264119638 M * Bertl actually there was a proposal for /dev/repeat some time ago 1264119676 M * Mr_Smoke Heh. Humour is essential to IT I believe, for sanity purposes. Go avian carrier ! 1264119717 M * fzylogic zero hang it too 1264119720 Q * SubZero 1264119721 M * fzylogic hangs* 1264119737 M * Bertl but a file is fine, yes? 1264119741 M * fzylogic correct 1264119768 M * fzylogic ran the file one in a loop for a short while with nothing but normal OOM kills 1264119775 M * fzylogic any single run on a device kills it 1264119807 M * Bertl try with /dev/random please 1264119827 M * Bertl both, urandom and null are quite fast virtual devices 1264119845 M * Bertl random actually takes some more time, as expected from a file pagein 1264119947 M * fzylogic too bad I only have one crypto accelerator and it's in use :) 1264120407 Q * opuk Ping timeout: 480 seconds 1264120616 M * fzylogic set up rngd to feed it crappy entropy from /dev/urandom 1264120618 M * fzylogic that sped things up 1264120628 M * fzylogic still running though 1264120702 M * fzylogic oh, I also noticed that the second fread() on /dev/ is necessary for the crash to work so perhaps it's a refcount error? 1264120724 M * fzylogic except it's only one open, so probably not...? 1264120746 M * Bertl my theory currently goes like this: 1264120777 M * Bertl an OOM occurs, sets special task properties (memfault exit) and 'signals' the task 1264120827 M * Bertl for whatever reason, the task doesn't want to die, and, on the next scheduling/signalling/etc, gets kind of resurrected, and starts running again, hitting the OOM again, closing the circle 1264120867 M * fzylogic any idea why it only happens if the second fread (fread(rnd,1,chunk,file);) gets called? 1264120904 M * Bertl no idea atm 1264120914 M * fzylogic read from /dev/random just crashed it 1264122011 Q * dowdle Remote host closed the connection 1264122121 Q * orzel Remote host closed the connection 1264122185 M * fzylogic just simplified the code significantly 1264122200 M * fzylogic just need to fread() the right sized chunk to trigger an oom 1264122213 M * fzylogic the rest just makes the window for that chunk size smaller 1264122251 M * Bertl but it doesn't work with a file, right? 1264122257 M * fzylogic for some reason, the malloc() succeeds, but the fread pushes it over the limit 1264122258 M * fzylogic right 1264122274 M * Bertl that's not really unexpected actually 1264122289 M * Bertl the malloc is basically a dummy, it's just reserving the virtual memory 1264122304 M * Bertl the fread() is the one which actually instantiates the memory pages 1264122321 M * fzylogic should it not still fail if the memory's unusable? 1264122343 M * daniel_hozac it doesn't know that yet. 1264122358 M * fzylogic ah, ok 1264122360 M * Bertl with overcommitment enabled in the kernel, the rss limit will not be applied unless the pages are actually instantiated (i.e. filled) 1264122984 M * fzylogic think I just triggered the bug in mainline :-/ 1264123005 M * Bertl that would be excellent! 1264123023 M * fzylogic doesn't have the same debug options in the kernel, but I sure crashed it hard (as a regular user, no less) 1264123048 M * fzylogic I'll try getting a similar kernel on it so I can see if it's the same problem 1264123122 M * Bertl yeah, please test with a compareable set of options and the same kernel version first 1264123144 M * Bertl then, if you can recreate it, try with a recent one 1264123218 M * fzylogic it _is_ a 2.6.31.5 so it's quite similar to the vserver kernel 1264123235 M * Bertl good then 1264123918 Q * hparker Quit: Quit 1264124626 M * Bertl off to bed for today, please keep us updated ... 1264124631 M * fzylogic will do 1264124636 M * fzylogic so far looks like the same crash 1264124640 M * fzylogic also in 2.6.32.2 1264124664 M * Bertl okay, wouldn't hurt to report it on lkml 1264124676 M * Bertl have a good one everyone! cya! 1264124681 M * fzylogic later 1264124685 N * Bertl Bertl_zZ 1264126881 Q * jrklein Quit: jrklein 1264126960 J * jrklein ~jrklein@2001:0:53aa:64c:0:408d:b4d8:690 1264128043 Q * jrklein Ping timeout: 480 seconds 1264128135 J * jrklein ~jrklein@2001:0:53aa:64c:0:408d:b4d8:690 1264131036 J * ktwilight__ ~keliew@146.167-247-81.adsl-dyn.isp.belgacom.be 1264131315 Q * ktwilight_ Ping timeout: 480 seconds 1264132837 J * SauLus_ ~SauLus@c207202.adsl.hansenet.de 1264133245 Q * SauLus Ping timeout: 480 seconds 1264133245 N * SauLus_ SauLus 1264137281 J * sharkjaw ~gab@90.149.121.45 1264138695 J * ghislain ~AQUEOS@81.56.195.31 1264139039 Q * niki Quit: Leaving 1264141252 N * [vortex7] vortex7 1264142053 J * opuk ~kupo@pipe.intertubez.net 1264142661 Q * sharkjaw Remote host closed the connection 1264143203 N * vortex7 [vortex7] 1264143511 J * sharkjaw ~gab@90.149.121.45 1264143597 N * [vortex7] vortex7 1264144101 Q * sharkjaw Ping timeout: 480 seconds 1264144133 Q * balbir Read error: Connection reset by peer 1264144267 Q * sardyno Ping timeout: 480 seconds 1264144732 J * sardyno ~me@pool-173-75-5-88.pitbpa.fios.verizon.net 1264145105 Q * derjohn_mob Ping timeout: 480 seconds 1264145179 J * sharkjaw ~gab@90.149.121.45 1264145744 J * balbir ~balbir@122.172.109.237 1264145962 Q * sardyno Ping timeout: 480 seconds 1264146204 J * ktwilight_ ~keliew@91.179.203.173 1264146436 J * niki ~niki@cpe.fe4-0-120.0x50a6de52.kdnxd4.customer.tele.dk 1264146499 Q * ktwilight__ Ping timeout: 480 seconds 1264146598 J * ktwilight__ ~keliew@86.88-240-81.adsl-dyn.isp.belgacom.be 1264146620 Q * ktwilight_ Read error: Connection reset by peer 1264146738 J * derjohn_mob ~aj@213.238.45.2 1264147743 J * sardyno ~me@pool-173-75-5-88.pitbpa.fios.verizon.net 1264147840 Q * sharkjaw Ping timeout: 480 seconds 1264147979 Q * nenolod Read error: Connection reset by peer 1264148021 J * nenolod ~nenolod@petrie.dereferenced.org 1264149158 J * BenG ~bengreen@cpc2-aztw22-2-0-cust521.aztw.cable.virginmedia.com 1264149756 Q * BenG Quit: I Leave 1264152081 J * ghislain1 ~AQUEOS@adsl2.aqueos.com 1264152467 Q * ghislain Ping timeout: 480 seconds 1264152640 J * BenG ~bengreen@cpc2-aztw22-2-0-cust521.aztw.cable.virginmedia.com 1264153469 J * barismetin ~barismeti@zanzibar.inria.fr 1264154524 Q * BenG Quit: I Leave 1264155024 J * kir ~kir@swsoft-msk-nat.sw.ru 1264155357 P * kir Leaving. 1264155590 J * kir ~kir@swsoft-msk-nat.sw.ru 1264157421 N * vortex7 [vortex7] 1264158198 Q * ktwilight__ Remote host closed the connection 1264158964 N * [vortex7] vortex7 1264159748 J * Loki|muh ~loki@satanix.de 1264160117 Q * vortex7 Remote host closed the connection 1264164294 J * ktwilight ~keliew@86.88-240-81.adsl-dyn.isp.belgacom.be 1264164853 Q * jrklein Ping timeout: 480 seconds 1264164932 J * bzed_ ~bzed@devel.recluse.de 1264165092 J * sid4windr luser@bastard-operator.from-hell.be 1264165123 J * ensc|w_ ~ensc@www.sigma-chemnitz.de 1264165147 Q * kezar charon.oftc.net galapagos.oftc.net 1264165147 Q * cehteh charon.oftc.net galapagos.oftc.net 1264165147 Q * bzed charon.oftc.net galapagos.oftc.net 1264165147 Q * ensc|w charon.oftc.net galapagos.oftc.net 1264165147 Q * PowerKe charon.oftc.net galapagos.oftc.net 1264165147 Q * hijacker charon.oftc.net galapagos.oftc.net 1264165147 Q * Chlorek charon.oftc.net galapagos.oftc.net 1264165147 Q * weasel charon.oftc.net galapagos.oftc.net 1264165147 Q * wibble charon.oftc.net galapagos.oftc.net 1264165147 Q * sid3windr charon.oftc.net galapagos.oftc.net 1264165147 Q * pmjdebruijn charon.oftc.net galapagos.oftc.net 1264165152 J * weasel ~weasel@anguilla.debian.or.at 1264165332 Q * bzed_ charon.oftc.net plasma.oftc.net 1264165374 Q * niki charon.oftc.net magnet.oftc.net 1264165374 Q * fback charon.oftc.net magnet.oftc.net 1264165374 Q * geos_one charon.oftc.net magnet.oftc.net 1264165374 Q * raceme charon.oftc.net magnet.oftc.net 1264165374 Q * grant_ charon.oftc.net magnet.oftc.net 1264165374 Q * snooze charon.oftc.net magnet.oftc.net 1264165374 Q * blathijs charon.oftc.net magnet.oftc.net 1264165374 Q * Mr_Smoke charon.oftc.net magnet.oftc.net 1264165374 Q * Hunger charon.oftc.net magnet.oftc.net 1264165374 Q * kolorafa charon.oftc.net magnet.oftc.net 1264165413 J * PowerKe_ ~tom@94-224-78-36.access.telenet.be 1264165413 J * pmjdebru1jn pascal@jester.pcode.nl 1264165413 J * wibble_ wibble@vortex.ukshells.co.uk 1264165413 J * niki ~niki@cpe.fe4-0-120.0x50a6de52.kdnxd4.customer.tele.dk 1264165413 J * fback fback@red.fback.net 1264165413 J * geos_one ~chatzilla@chello084115149052.4.graz.surfer.at 1264165413 J * raceme ~tof@ombos.raceme.org 1264165413 J * grant_ mep@87-98-246-129.ovh.net 1264165413 J * snooze ~o@1-1-4-40a.gkp.gbg.bostream.se 1264165413 J * blathijs ~matthijs@drsnuggles.stderr.nl 1264165413 J * Mr_Smoke smokey@layla.lecoyote.org 1264165413 J * Hunger ~Hunger@Hunger.hu 1264165413 J * kolorafa ~kolorafa@irc.kolorafa.dlk.pl 1264165508 J * bzed ~bzed@devel.recluse.de 1264165775 J * kezar ~kezar@rb178-1-88-163-25-248.fbx.proxad.net 1264165796 J * hijacker ~hijacker@213.91.163.5 1264165811 J * cehteh ~ct@pipapo.org 1264167441 J * thierryp ~thierry@zankai.inria.fr 1264168163 Q * derjohn_mob Ping timeout: 480 seconds 1264171169 J * SubZero ~SubZero@chello089076140236.chello.pl 1264171369 Q * vserverUser 1264171865 Q * niki Quit: Leaving 1264171865 J * kbad ~kyle@ip-66-33-206-8.dreamhost.com 1264171930 M * kbad Bertl_zZ: Jeremy was able to reproduce the issue on 2.6.32.2 vanilla w/ vserver or 2.6.32.2 vanilla w/ grsec 1264174241 J * swen ~quassel@217.72.66.253 1264176578 J * dowdle ~dowdle@scott.coe.montana.edu 1264176960 Q * thierryp Ping timeout: 480 seconds 1264178279 Q * SubZero 1264178697 N * Bertl_zZ Bertl 1264178701 M * Bertl morning folks! 1264178802 M * Bertl kbad: hmm, which means? 1264179042 J * SubZero ~SubZero@chello089076140236.chello.pl 1264179556 M * kbad he's pretty sure it's not mainline 1264179710 M * Bertl how's that? I mean what exactly did he test? 1264179838 M * Bertl (means: your sentence above doesn't make sense for me :) 1264179926 Q * swen Remote host closed the connection 1264179954 M * kbad he was able to reproduce it using the program he wrote on vanilla 2.6.32.2+vserver and vanilla 2.6.32.2+grsec but not 2.6.32.2 vanilla alone 1264179986 M * kbad Jeremy being fzylogic 1264179987 M * kbad sorry 1264180026 M * Bertl okay, but it doesn't really make sense if it happens with two disjunkt patches then it would need to be a issue caused by common code 1264180040 M * Bertl but AFAIK, there is no common code between grsec and Linux-VServer 1264180089 M * kbad yeah, that raised my eyebrow too but he wanted me to pass it along in case he wasn't able to catch you later when he was on 1264180176 M * Bertl okay, so I'd assume that it is harder to trigger on mainline then, which might explain why it wasn't observed before 1264180258 M * kbad seems logical 1264180763 J * derjohn_mob ~aj@tmo-101-25.customers.d1-online.com 1264181235 J * niki ~niki@0x5553169c.adsl.cybercity.dk 1264181943 J * thierryp ~thierry@home.parmentelat.net 1264181948 Q * thierryp Remote host closed the connection 1264181965 J * thierryp ~thierry@home.parmentelat.net 1264182003 Q * derjohn_mob Ping timeout: 480 seconds 1264182805 J * hparker ~hparker@linux.homershut.net 1264182957 M * fzylogic back 1264182970 M * Bertl wb 1264182971 M * fzylogic I'm not entirely certain it's the same bug between grsec and vserver 1264182982 M * fzylogic it can't get the soft lockup error out of it 1264182996 M * kbad out of grsec? 1264183001 M * fzylogic correct 1264183013 M * fzylogic the machine does become unresponsive, but not at all in the same way as vserver kernels 1264183033 M * kbad anything useful I can show to brad or pipacs? 1264183038 M * fzylogic it immediately locks up all active terminals 1264183048 M * fzylogic not yet, but hopefully by the end of the day 1264183060 M * kbad hmm 1264183061 Q * barismetin Remote host closed the connection 1264183064 M * kbad some sort of tty lockup? 1264183065 M * Bertl interesting .. but no luck with mainline yet, I presume? 1264183076 M * fzylogic no, still haven't been able to trigger anything on mainline 1264183169 Q * hparker Quit: Quit 1264183212 M * fzylogic vserver does let me be more sloppy with my malloc() calls than mainline/grsec so it's easier to trigger. could just be that I'm not able to pick the correct value. 1264183374 M * Bertl I think, the main problem on mainline is that you actually need to go over limit, i.e. have a real OOM situation 1264183395 M * Bertl and with real, I mean full host OOM 1264183396 M * fzylogic I haven't been having a problem triggering oom 1264183425 M * fzylogic without grsec, my process gets killed and everything's happy 1264183462 M * fzylogic with grsec, it looks like it's picking random other processes and maybe looping. my allocating process never seems to get picked (still need more testing, though) 1264183509 M * Bertl you could try to adjust odds there 1264183534 M * Bertl first there is a oom adjust value, which allows you to fine tune the probability somewhat 1264183546 M * Bertl second, there is a kernel mode which _always_ kills the offender 1264183562 M * fzylogic yeah...there's a completely different bug I suspect lies in that code :) 1264183991 M * fzylogic also, while I can trigger the bug in a vserver kernel from within a guest, I can't trigger it outside a guest 1264184008 M * fzylogic even though I can trigger the OOM killer every time 1264184035 Q * balbir Read error: Operation timed out 1264184558 Q * thierryp Remote host closed the connection 1264184727 J * barismetin ~barismeti@jua06-1-82-242-159-114.fbx.proxad.net 1264184783 J * hparker ~hparker@2001:470:1f0f:32c:203:dff:fe14:cc01 1264184837 J * balbir ~balbir@122.172.48.17 1264184875 M * fzylogic huh. apparently you don't have to be reading from a device 1264184883 M * fzylogic any file too large to fit into memory will do 1264186193 J * hijacker_ ~hijacker@87-126-142-51.btc-net.bg 1264186421 J * hparker_lappie ~hparker@linux.homershut.net 1264186519 M * Bertl interesting ... 1264186796 M * fzylogic out of curiosity, have you been testing with a 32-bit or 64-bit kernel? 1264186804 M * fzylogic all of mine thus far has been 64 1264187069 M * fzylogic it also looks like running the crasher app via strace will allow you to control-c it after the machine locks up and allow everything to recover 1264187262 J * thierryp ~thierry@home.parmentelat.net 1264187310 Q * ensc|w_ Ping timeout: 480 seconds 1264187420 J * ensc|w ~ensc@www.sigma-chemnitz.de 1264187565 Q * kir Quit: Leaving. 1264187643 J * BenG ~bengreen@cpc2-aztw22-2-0-cust521.aztw.cable.virginmedia.com 1264188122 Q * thierryp Quit: ciao folks 1264188261 M * Bertl 64bit here, but I guess 32bit will expose the same 1264188290 M * Bertl regarding the strace, that kind of supports the signal theory, as strace will allow you to reset the signal handlers 1264188351 M * fzylogic yep... 1264188363 M * fzylogic rebooting with some extra debugging options 1264188499 Q * hparker Quit: Read error: 104 (Peer reset by connection) 1264188664 N * hparker_lappie hparker 1264188739 Q * hparker Quit: Quit 1264188739 J * hparker ~hparker@linux.homershut.net 1264189439 Q * ensc|w Remote host closed the connection 1264189805 M * fzylogic oom_kill_allocating_task results in an infinite loop 1264189824 M * Bertl on what kernel? 1264189845 M * fzylogic only tested on 2.6.32.2-vserver so far 1264189874 M * fzylogic looks like the pagefault OOM handler keeps triggering the killer on the allocating process while it's trying to clean up? 1264189889 M * fzylogic this just scrolls by ad infinitum: 1264189889 M * fzylogic Out of memory (oom_kill_allocating_task): kill process mem(11779:#20059) score 0 or a child 1264189890 M * fzylogic Killed process mem(11779:#20059) 1264189927 M * fzylogic with the following trace thrown in occasionally: 1264189930 M * fzylogic Pid: 11779, comm: mem Not tainted 2.6.32.2-vs2.3.0.36.28 #2 1264189930 M * fzylogic Call Trace: 1264189930 M * fzylogic [] ? _spin_unlock+0x2b/0x40 1264189931 M * fzylogic [] oom_kill_process+0x151/0x280 1264189931 M * fzylogic [] __out_of_memory+0xb3/0xe0 1264189932 M * fzylogic [] ? _read_lock+0x65/0x70 1264189932 M * fzylogic [] pagefault_out_of_memory+0x63/0xa0 1264189934 M * fzylogic [] mm_fault_error+0x45/0xc0 1264189934 M * fzylogic [] do_page_fault+0x281/0x290 1264189936 M * fzylogic [] page_fault+0x1f/0x30 1264189936 M * fzylogic [] ? __clear_user+0x3d/0x70 1264189938 M * fzylogic [] ? __clear_user+0x21/0x70 1264189938 M * fzylogic [] read_zero+0x97/0x110 1264189940 M * fzylogic [] vfs_read+0xc9/0x1a0 1264189940 M * fzylogic [] sys_read+0x55/0x90 1264189942 M * fzylogic [] ia32_sysret+0x0/0x5 1264189989 J * ensc|w ~ensc@www.sigma-chemnitz.de 1264190293 Q * BenG Quit: I Leave 1264192264 J * watsonian ~jwatson@204.14.155.152 1264192302 Q * watsonian 1264192323 J * watsonian ~jwatson@204.14.155.152 1264192736 M * Bertl fzylogic: what does addr2line report on ffffffff810c9f41? 1264193163 M * fzylogic mm/oom_kill.c:427 1264193169 J * bonbons ~bonbons@2001:960:7ab:0:2c0:9fff:fe2d:39d 1264193171 M * Bertl tx 1264195254 Q * barismetin Remote host closed the connection 1264196449 M * fzylogic don't know if it'll help, but here's a list of locks being held by the process when it hangs: 1264196449 M * fzylogic 6 locks held by mem/4590: 1264196449 M * fzylogic #0: (tasklist_lock){.+.+..}, at: [] pagefault_out_of_memory+0x58/0xa0 1264196450 M * fzylogic #1: (rcu_read_lock){.+.+..}, at: [] thread_group_cputime+0x0/0xf0 1264196450 M * fzylogic #2: (&i->lock){-.-...}, at: [] serial8250_interrupt+0x32/0x110 1264196450 M * fzylogic #3: (&port_lock_key){-.-...}, at: [] serial8250_handle_port+0x1e/0x320 1264196450 M * fzylogic #4: (sysrq_key_table_lock){-.....}, at: [] __handle_sysrq+0x2b/0x150 1264196452 M * fzylogic #5: (tasklist_lock){.+.+..}, at: [] debug_show_all_locks+0x39/0x190 1264196456 M * fzylogic bbiab, lunch time 1264196484 Q * hparker Quit: Quit 1264196565 M * fzylogic and a slightly different stack trace: 1264196565 M * fzylogic http://karategerbil.com/kernel_debug/new_stack_trace.txt 1264196572 M * Bertl k 1264197113 M * kbad Bertl: did you see Greg's post about 2.6.32 being the next stable tree? 1264197167 M * kbad well, long-term stable (2-3 years) 1264197182 M * Bertl not sure that is the best idea though 1264197197 M * Bertl I've been seeing a lot of regressions with 2.6.32 so far 1264197656 Q * SubZero 1264198079 M * kbad :/ 1264198092 Q * hijacker_ Quit: Leaving 1264198331 M * kbad given the number of security problems recently plaguing the bleeding edge I thought it was comforting news so it's a bummer to hear that 1264199463 M * Bertl I haven't tried the latest 2.6.32 yet, but I switched back a few machines from 2.6.32.2 to 2.6.31 when they showed strange memory issues and unexpected reboots 1264199546 M * kbad hmm, did you see this? 1264199548 M * kbad http://seclists.org/oss-sec/2010/q1/34 1264201601 M * Bertl not in detail ... anyway, kind of tired today .. so I'm off to bed for now ... maybe back later ... 1264201607 N * Bertl Bertl_zZ 1264201620 M * kbad take care! 1264201926 Q * kbad Quit: Leaving. 1264202008 Q * ghislain1 Ping timeout: 480 seconds 1264203465 J * harobed ~sklein@arl57-1-82-231-110-14.fbx.proxad.net