1524705995 M * Bertl_oO off to bed now ... have fun! 1524705997 N * Bertl_oO Bertl_zZ 1524708571 J * guerby_ ~guerby@ip165.tetaneutral.net 1524708574 J * clopez_ ~tau@neutrino.es 1524708606 J * padde_ ~padde@patrick-nagel.net 1524708678 Q * guerby magnet.oftc.net dacia.oftc.net 1524708678 Q * Ghislain magnet.oftc.net dacia.oftc.net 1524708678 Q * padde magnet.oftc.net dacia.oftc.net 1524708678 Q * DelTree magnet.oftc.net dacia.oftc.net 1524708678 Q * Rockj magnet.oftc.net dacia.oftc.net 1524708678 Q * clopez magnet.oftc.net dacia.oftc.net 1524708678 Q * tokkee magnet.oftc.net dacia.oftc.net 1524708678 Q * Hunger magnet.oftc.net dacia.oftc.net 1524708681 N * padde_ padde 1524708687 J * DelTree ~deplagne@2a00:c70:1:213:246:56:18:2 1524708782 J * guerby ~guerby@ip165.tetaneutral.net 1524708782 J * Ghislain ~ghislain@81.56.195.31 1524708782 J * clopez ~tau@neutrino.es 1524708782 J * Rockj rockj@rockj.net 1524708782 J * tokkee ~tokkee@osprey.tokkee.org 1524708782 J * Hunger ~Hunger@zer0days.com 1524708792 Q * clopez Max SendQ exceeded 1524708793 Q * tokkee Ping timeout: 482 seconds 1524708802 Q * Ghislain Ping timeout: 482 seconds 1524708827 Q * guerby Ping timeout: 482 seconds 1524708841 J * Ghislain ~ghislain@81.56.195.31 1524708993 J * tokkee ~tokkee@osprey.tokkee.org 1524710611 Q * guerby_ Remote host closed the connection 1524710631 J * guerby ~guerby@ip165.tetaneutral.net 1524723602 M * arekm Guy-: good that I'm not alone ;) 1524724670 M * Guy- arekm: did you experiment further? 1524724677 M * Guy- (like with slightly older 4.9 kernels?) 1524724738 M * arekm Guy-: no (t, yet) 1524727445 J * nikolay ~nikolay@149.235.255.3 1524732705 N * Bertl_zZ Bertl 1524732707 M * Bertl morning folks! 1524736069 M * arekm Guy-: dmesg|grep XSAVE on problematic matchine? 1524736391 M * Guy- arekm: nothing 1524736616 M * arekm or rather dmesg from 4.9 booted kernel there 1524736637 J * _pa ~pav@2-245.dsl.iskon.hr 1524736853 Q * _pa 1524739316 M * arekm Guy-: dmesg | grep NUMA 1524739336 A * arekm has two machines on the same mainboard but with different cpu, one works, one fails on 4.9 1524739732 M * arekm https://pastebin.com/iQkD57Qp thebe is problematic one, ymir works fine 1524740048 M * Bertl weird idea, are the CPUs patched for Spectre/Meltdown? 1524740078 M * Guy- mine are very old, so unless the intel microcode loaded from the initramfs patches them, they're not patched 1524740094 M * arekm afaik intel didn't release microcode update for these cpus 1524740097 M * Guy- arekm: no hits for NUMA either (when booted from 4.9) 1524740103 M * Bertl which is very likely ... the main question is the microcode version 1524740127 M * Guy- I don't actually have it installed on this server 1524740135 M * Guy- (the intel-microcode package, I mean) 1524740144 M * Bertl i.e. is it _before_ spectre/meltdown or right after (which is known to be buggy) or most recent (which is supposed to be stable) 1524740175 M * Guy- it's the 8-year-old microcode the CPU shipped with 1524740185 M * Guy- it's two Xeon X5650 CPUs 1524740200 M * Guy- I can retry with the intel-microcode package installed 1524740212 M * Bertl no, just run the spectre/meltdown check 1524740228 M * Bertl https://github.com/speed47/spectre-meltdown-checker 1524740236 M * Bertl it provides all the necessary information 1524740391 M * Guy- Bertl: OK, it gave quite a bit of output -- should I pastebin it all, or is there a specific bit you're interested in? 1524740441 M * Bertl not interested in the vulnerabilities, only in the stability comments (if there are any) 1524740453 M * Guy- * CPU microcode is known to cause stability problems: NO (model 44 stepping 2 ucode 0x14 cpuid 0x206c2) 1524740470 M * Bertl so this one should be fine in this regard then 1524740520 A * arekm has the same results on both machines 1524740546 M * Bertl good, so we can rule that out then ... I guess 1524740580 M * Bertl the only difference I see from your paste (thebe vs ymir) is in the memory 1524740586 M * arekm the only difference I can see is NUMA where one just reports no numa while seconds makes fake node 1524740639 M * Bertl both are non NUMA and both fake the nodes 1524740642 M * arekm thebe (problematic) has E5462, ymir (ok) has X5482 1524740646 M * arekm cpu 1524740661 M * Guy- oh, sorry, I do have hits for NUMA 1524740666 M * Guy- looked in the wrong place 1524740709 M * arekm and XSAVE fpu feature which thebe doesn't have 1524740764 M * Guy- my box doesn't log anything about XSAVE though 1524740767 M * Bertl yes, but that doesn't look like a good reason 1524740782 M * Guy- arekm: do you run the same guests on both? 1524740798 M * Bertl arekm: do you have any recent stack traces from thebe? 1524740809 M * arekm Guy-: no but with the same flags, vserver features etc 1524740819 A * arekm checking boot with noxsave on ymir 1524740844 M * Guy- arekm: do you run Samba on either? 1524740851 M * arekm Guy-: no 1524740857 M * Guy- too bad 1524740860 M * arekm Bertl: nothing new 1524740907 M * Bertl you are both using xfs, yes? 1524740997 Q * Aiken Remote host closed the connection 1524741062 M * Guy- yes 1524741224 M * arekm noxsave changes nothing. Yes both use xfs and ext2/3 (via ext4 module) 1524741375 M * arekm numa_balancing=disable also nothing (previously had a typo - disabled instead of disable) 1524741424 M * arekm TSC is another different thing here 1524742175 M * Bertl we'll have to wait till we get lucky with a 'better' stack trace then I guess ... 1524742183 M * Bertl off for now ... bbl 1524742187 N * Bertl Bertl_oO 1524743883 Q * nikolay Remote host closed the connection 1524743902 J * nikolay ~nikolay@149.235.255.3 1524744674 Q * nikolay Remote host closed the connection 1524745006 J * nikolay ~nikolay@149.235.255.3 1524746946 M * arekm simple, almost empty guest didn't blow 1524747543 M * arekm but it did blow after adding IP and using poldek to install external rpm package inside 1524747576 M * arekm https://pastebin.com/t9vBkrd5 1524747776 M * Bertl_oO can you get the line numbers for the three addresses? 1524747808 M * Bertl_oO and the ^Ac is really strange .... 1524749259 M * arekm $ eu-addr2line -f -e x86_64-gcc-7.3.0/vmlinux ffffffff9e6c4747 1524749259 M * arekm ?? 1524749259 M * arekm ??:0 1524749263 M * arekm ;/ 1524749299 M * Bertl_oO missing debug information? 1524749315 M * arekm x86_64-gcc-7.3.0/vmlinux: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, BuildID[sha1]=433d1e10ace1ba8f81cad9b0f72f9644ca8bb28c, with debug_info, not stripped 1524749432 M * Bertl_oO try or even [] instead 1524749492 M * arekm eu-addr2line: cannot find symbol '' 1524749657 M * arekm starting very basic guest (bash, minilogd) and doing "wget wp.pl" in it is enough to trigger. wget doesn't get far, only to Resolving wp.pl (wp.pl)... 1524749681 M * Bertl_oO can you try with chroot instead of guest start/enter? 1524749748 M * arekm but now a bit more in traces 1524749751 M * arekm https://pastebin.com/fWxSQhyu 1524749782 M * Bertl_oO seccomp is active? 1524749812 M * arekm built in in kernel but not specially activated in any way 1524749818 M * Bertl_oO funny part the stack trace shows that nmi watchdog was disabled :) 1524749851 M * Bertl_oO but hey, it is a very late dump, you need to find the first one 1524750138 M * arekm chroot to guest dir + wget works fine 1524750197 M * Bertl_oO okay, now let's try with network context 1524750210 M * Bertl_oO and if that works fine as well, with process context only 1524750286 M * Bertl_oO i.e. with ncontext and vcontext 1524750665 M * arekm [root@thebe ~]# ncontext --create --nid 999 /bin/bash 1524750665 M * arekm New network context is 999 1524750665 M * arekm [root@thebe ~]# wget wp.pl 1524750665 M * arekm --2018-04-26 15:50:47-- http://wp.pl/ 1524750665 M * arekm Resolving wp.pl (wp.pl)... failed: Temporary failure in name resolution. 1524750668 M * arekm wget: unable to resolve host address ‘wp.pl’ 1524750797 M * arekm hm, did naddress --add --nid 999 --ip xx/yy in separate session, it says added but don't see IP inside "/bin/bash" via "ip a" 1524750847 M * arekm how to get ip into that network context ? 1524751025 M * arekm doh, first ip needs to be added to network stack 1524751142 M * arekm and crashed 1524751236 M * arekm have to go now 1524751250 M * Bertl_oO ah, so the network context is enough ... interesting 1524751265 M * Bertl_oO please when you find some time, try the process context only as well 1524751271 M * Bertl_oO it might confirm my theory 1524755029 Q * dustinm` Quit: Leaving 1524755372 J * dustinm` ~dustinm`@68.ip-149-56-14.net 1524755900 Q * nikolay Remote host closed the connection 1524758030 J * nikolay ~nikolay@external.oldum.net 1524759575 Q * nikolay Quit: Leaving 1524761396 M * arekm Bertl_oO: vcontext --create --xid 999 /bin/bash and wget/ping there works without lockup 1524761643 M * Bertl_oO okay great! so there is a problem in the network isolation triggering this issue 1524761660 M * Bertl_oO most likely some kind of lookup without proper locking 1524763209 M * arekm I'm trying to figure out what's different between machine 1524763237 M * arekm both use ipv6 (but thebe uses static ipv6 routes while ymir uses router advertisement delivered routes if that matters) 1524763290 M * arekm hm but only on host, not in guests 1524765630 Q * any0n Remote host closed the connection 1524765642 J * any0n ~k@7YZAAA9KB.tor-irc.dnsbl.oftc.net 1524768883 J * Aiken ~Aiken@2001:44b8:2168:1000:b26e:bfff:fe2a:b951