1453594267 J * webhat_ ~quassel@31.25.99.5 1453594319 J * marker_ marker@server6.org 1453594329 Q * webhat Quit: No Ping reply in 180 seconds. 1453594330 Q * marker Write error: connection closed 1453604045 J * derjohn_mobi ~aj@x4db29997.dyn.telefonica.de 1453604483 Q * derjohn_mob Ping timeout: 480 seconds 1453608879 M * gnarface hmm. i think i'm forgetting something important about this debugging process... am i supposed to load some watchdog module of some sort? 1453609643 M * gnarface maybe its called a hangcheck timer? 1453609648 M * gnarface i'm not clear on the difference even 1453611034 P * undefined 1453611440 J * undefined ~undefined@00011a48.user.oftc.net 1453611971 M * gnarface alright i guess i'm gonna take a shot in the dark and load hangcheck_timer with these options: hangcheck_tick=30 hangcheck_margin=60 hangcheck_reboot=0 hangcheck_dump_tasks=1 (but i'd appreciate someone telling me whether i'm approaching this in the right fashion or not) 1453613762 N * Bertl_zZ Bertl 1453613768 M * Bertl morning folks! 1453613802 M * Bertl gnarface: are you on a bughunt or are you just configuring a kernel? 1453613826 M * gnarface Bertl: still trying to diagnose a deadlock 1453613862 M * gnarface the machine randomly stops responding within 1-3 days 1453613869 M * gnarface that's all i've got so far 1453613876 M * Bertl does magic sysrq work? 1453613880 M * gnarface nope 1453613887 M * gnarface that was the first thing i tried 1453613890 M * Bertl then it is most likely a hardware issue 1453613893 M * gnarface its right here 1453613903 M * gnarface so i physically reboot it 1453613909 M * gnarface seems fine 1453613918 M * gnarface it had almost 400 days of uptime though 1453613922 M * Bertl have you stress tested the machine? 1453613946 M * gnarface well i installed munin and munin-node 1453613958 M * Bertl i.e. memtest for 72 hours, cpu hogs, high network rate ... etc 1453613960 M * gnarface either that or adding some extra maps to the quake server seem to be what destabilized it 1453613978 M * gnarface oh, yea it passes memtest, even the optional tests 1453613999 M * Bertl what about 'bios' sensors? 1453614006 M * gnarface temperatures all look normal 1453614022 M * Bertl i.e. you might be banging on smt-bios sensors with munin 1453614024 M * gnarface the munin graphs show everything looking normal right up to the lockup, each time 1453614047 M * gnarface i might be using the wrong temperature sensor driver... there is some question about that 1453614071 M * Bertl this can easily lock up a system hard if the smbus stalls 1453614075 M * gnarface hmm. 1453614083 M * gnarface so checking the temperature might be wrecking it? 1453614091 M * gnarface basically? 1453614095 M * Bertl yep 1453614105 M * gnarface hmm 1453614110 M * gnarface any way to be sure? 1453614128 P * undefined 1453614136 M * Bertl well, if you can reproduce the issue (i.e. if you have a trigger or it happens on a regular basis) 1453614157 M * Bertl you can simply unload the driver and test/wait 1453614157 M * gnarface well its happened on its own 4 times or so now in the last weekish 1453614165 M * gnarface before that... ~400 days uptime 1453614178 M * gnarface hmm 1453614187 M * Bertl you can also do the oposite, and start hamming on the temperature readings with a script and see if that increases the lockup rate 1453614203 M * gnarface hmm. ah 1453614233 M * gnarface this might be relevant: i'd blacklisted k10temp much earlier, because it was spitting errors 1453614258 M * gnarface and it refused to actually work unless i loaded it with force=1 1453614260 M * Bertl but as this is a statistical process and your kernel stops you from banging too hard on "slow" sensors it might be tricky to see a difference 1453614334 M * gnarface could it be possible that if i use a different driver for my sensors, it would also change the stability? 1453614353 M * Bertl of course 1453614400 M * gnarface i notice something weird: sensors-detect says 'it87' should be my driver, but when i try to load it from /etc/modules, during boot you can see it fails because the device is busy. but whatever is already holding the device isn't k10temp, because that is blacklisted 1453614428 M * Bertl most likely it is the bios which blocks the I/O region 1453614454 M * Bertl check dmesg for details about that 1453614506 M * gnarface so, none of these modules are a conflicting sensor driver? http://paste.debian.net/367471/ 1453614528 M * gnarface asus_atk0110 looks suspicious to me 1453614572 M * Bertl the problem is that many system health sensors are "watched" by the smbios 1453614583 M * gnarface hmm 1453614598 M * Bertl if you use a kernel driver for that sensor, it will compete 1453614599 M * gnarface is that something i might be able to disable in the bios? 1453614635 M * Bertl this will usually work, until both ends try to access the sensor at the same time 1453614650 M * gnarface hmm. sounds plausible 1453614652 M * Bertl then the bus might simply lock up 1453614657 M * gnarface that would explain the random timings 1453614668 M * gnarface of the lockups 1453614677 M * gnarface ~1-3 days but no specific indicator 1453614678 M * Bertl but I'm not saying that's the case here, it is just a possibility 1453614713 M * gnarface well if i go into the bios and check "ignore" on all those readouts for the fans/temperatures, that would in theory leave them all to the kernel, right? 1453614725 M * gnarface i might try that 1453614752 M * gnarface hmm. that sounds like it might be dangerous to disable though 1453615379 M * Bertl btw, I stopped using asus boards a century ago, because of the regular stability issues when they age 1453615416 M * Bertl but maybe that has gotten better since 1453615507 M * gnarface hmm. noted 1453615516 M * gnarface they've always been really good for me 1453615527 M * gnarface i've got asus boards 15 years old that still work right 1453615560 M * gnarface but its possible there's something wrong with this board 1453615598 M * gnarface this is the same one i was having random lockups a year+ ago with until i disabled the C1E feature in the bios 1453615608 M * gnarface so the CPU is also suspect 1453615693 M * gnarface that time however, we were able to get something to show up in dmesg right at the time of the hangs 1453615699 M * gnarface i just don't remember how 1453615871 M * Bertl well, usually when the CPU is working, the magic sysrq works as well 1453615914 M * Bertl so, a lockup where it doesn't work usually means that the cpu(s) are stalled somehow 1453615965 M * Bertl (given that you are not using an USB keyboard or something foolish like that :) 1453616269 M * gnarface nope, its a ps/2 keyboard 1453616313 M * gnarface the one thing is though, i thought i enabled the sensors graphs in munin to make sure the lockup problem wasn't overheating issues... 1453616334 M * gnarface suggesting something else is causing the lockup 1453616365 M * gnarface i'll try to disable them again though to see what happens 1453616379 M * gnarface since there doesn't seem to be any real threat of overheating 1453616430 M * Bertl overheating should be detected during stress testing a system 1453616621 M * gnarface well it didn't overheat when i compiled this kernel on it 1453616645 M * gnarface and anyway its not overheating that seems to be causing the lockup, because i'd expect some sign of it in the munin graphs 1453616701 M * gnarface the only other thing i added was i added sshd to some of the guests, and i added a couple new guests 1453616744 M * gnarface one for ioquake3, one for F.E.A.R. community edition server 1453616770 M * gnarface the chances of one of those gameservers being able to lock up the host through a kernel glitch are pretty unlikely, right? 1453616828 M * gnarface i'm not sure i'd really know what to look for as far as evidence of that happening though 1453616860 M * gnarface so i deleted the /etc/munin/sensors_* symlinks, and restarted munin-node 1453616891 M * gnarface i guess if the best course of action now is to just be patient, i will 1453618493 Q * jrklein Ping timeout: 480 seconds 1453618628 J * undefined ~undefined@00011a48.user.oftc.net 1453620548 M * Bertl off for now ... bbl 1453620549 N * Bertl Bertl_oO 1453621218 J * jrklein ~cloud@proxy.dnihost.net 1453623695 J * Blit ~miadmmam@host86-139-50-171.range86-139.btcentralplus.com 1453623736 Q * Blit Quit: Leaving 1453627200 J * Aiken ~Aiken@d63f.h.jbmb.net 1453629687 Q * AndrewLee Remote host closed the connection 1453629718 J * AndrewLee ~andrew@210.240.39.201 1453632793 J * bonbons ~bonbons@2001:a18:200:b01:69c4:a7cb:5654:a6f9 1453633835 Q * AlexanderS Quit: WeeChat 1.0.1 1453638504 J * AlexanderS ~Alexander@home.zedat.fu-berlin.de 1453664276 Q * opuk_ Ping timeout: 480 seconds 1453664867 Q * sannes Remote host closed the connection 1453666310 Q * derjohn_mobi resistance.oftc.net larich.oftc.net 1453666310 Q * Sirenia resistance.oftc.net larich.oftc.net 1453666622 J * Sirenia ~sirenia@454028b1.test.dnsbl.oftc.net 1453667187 J * derjohn_mobi ~aj@x4db29997.dyn.telefonica.de 1453671393 Q * bonbons Quit: Leaving 1453673198 Q * transacid Remote host closed the connection 1453673552 J * transacid ~transacid@transacid.de