1582698016 J * Ghislain ~Ghislain@adsl2.aqueos.com 1582698832 M * Bertl_oO off to bed now ... have a good one everyone! 1582698835 N * Bertl_oO Bertl_zZ 1582701402 J * Ghislain1 ~Ghislain@lfbn-rei-1-57-182.w90-54.abo.wanadoo.fr 1582702546 Q * Ghislain1 1582702908 Q * Ghislain Remote host closed the connection 1582703411 M * Guy- Bertl_zZ: https://pastebin.com/G2P1Zmqm <- stacktrace for the null pointer dereference I triggered by stracing a dozen 32-bit processes from a 64-bit host using chcontext --xid 1 1582703417 M * Guy- captured using netconsole 1582704607 J * Ghislain ~Ghislain@lfbn-rei-1-57-182.w90-54.abo.wanadoo.fr 1582705475 J * hijacker ~nikolay@213.161.83.122 1582705491 Q * Ghislain Ping timeout: 480 seconds 1582705532 J * Ghislain ~Ghislain@adsl2.aqueos.com 1582706532 Q * Ghislain 1582707153 J * Ghislain ~Ghislain@adsl2.aqueos.com 1582708076 Q * hijacker Ping timeout: 480 seconds 1582708087 J * hijacker ~nikolay@149.235.255.3 1582708922 Q * Ghislain 1582709499 J * Ghislain ~Ghislain@adsl2.aqueos.com 1582720594 Q * Aiken Remote host closed the connection 1582722915 N * Bertl_zZ Bertl 1582722917 M * Bertl morning folks! 1582722964 M * Bertl Guy-: tx! 1582723067 M * Bertl what's the proprietary driver? (tainted) 1582723291 M * Guy- Bertl: zfsonlinux 1582723324 M * Bertl ah, yes, I see 1582723342 M * Guy- it's not proprietary, just not GPL 1582723422 M * Bertl can you run the [<>] through addr2line for me? 1582723543 M * Guy- Bertl: how do I make it use the kernel? specify bzImage as the executable? 1582723556 M * Bertl vmlinux 1582723564 M * Bertl (from the build tree) 1582723595 M * Guy- I don't have the build tree anymore; can I reconstruct vmlinux from the bzImage? 1582723630 M * Bertl maybe ... I don't know how your kernel build process looks like and what information you strip away 1582723645 M * Guy- hang on, I do have the build tree in a zfs snapshot 1582723682 M * Guy- addr2line -e vmlinux ffffffff944ce8fe :( 1582723686 M * Guy- ??:0 1582723688 M * Guy- I'm probably doing it wrong? 1582723705 M * Bertl might be part of a module 1582723711 Q * AlexanderS Ping timeout: 480 seconds 1582723721 M * Bertl (in which case you need to find the module first) 1582723835 M * Bertl you might be better off with gdb in this case 1582723836 M * Guy- I grepped through the modules and only sctp_diag.ko matches inet_diag_msg_common_fill 1582723847 M * Guy- but when I run addr2line against that module, the output is the same 1582723871 M * Bertl gdb "$(modinfo -n the_module)" 1582723907 M * Bertl list *(net_diag_msg_common_fill+0x8e) 1582723924 M * Guy- No symbol table is loaded. Use the "file" command. 1582723990 M * Bertl there is an eu-addr2line tool in elfutils 1582724005 M * Bertl try trat one first, syntax would be: 1582724023 M * Bertl eu-addr2line -f -e -j 1582724029 M * Bertl (according to google :) 1582724054 M * Bertl If the debug information for a kernel module is stored in separate file (this is often the case for the kernels provided by the major Linux distros), the path to that file should be used as 1582724104 M * Guy- Bertl: hang on, sctp_diag wasn't even loaded 1582724185 M * Bertl indeed 1582724212 M * Guy- vmlinux also matches inet_diag_msg_common_fill 1582724228 M * Guy- as well as net/built-in.o 1582724275 J * AlexanderS ~Alexander@home.zedat.fu-berlin.de 1582724291 M * Guy- I can give you the entire build tree if that's any help :) 1582724455 M * Guy- maybe I need to rebuild the kernel with debug info? but won't that change the offsets? 1582724571 M * Bertl if it is easy to trigger, that shouldn't be a big problem 1582724637 M * Bertl I presume ipv6 is enabled, yes? 1582724676 M * Guy- yes 1582724764 M * Bertl run objdump -d net/ipv4/inet_diag.o on the kernel build tree and upload that please 1582724926 M * Guy- Bertl: https://pastebin.com/sAVQhqzE 1582725088 M * Guy- 6e0: f6 42 23 04 testb $0x4,0x23(%rdx) 1582725094 M * Guy- this line, if I count right... 1582725215 M * Bertl for some reason your kernel stack trace is missing the PC byte sequence 1582725245 Q * hijacker Read error: Connection reset by peer 1582725255 J * hijacker ~nikolay@149.235.255.3 1582725347 M * Bertl they are normally there where you have '' :) 1582725366 M * Bertl something like: Code: 8b 46 18 85 c0 0f 85 b1 02 00 00 c7 44 24 04 9c 86 3a c0 8d 1582725412 M * Guy- well, those lines are empty :( 1582725422 M * Guy- netconsole doesn't always capture everything 1582725434 M * Bertl you should definitely get a proper serial console 1582725454 M * Bertl you know, it's not that expensive anymore :) 1582725474 M * Guy- in terms of hardware, no :) 1582725506 M * Guy- this box even has ipmi -- if I had been logged on with SoL, I might have seen the stacktrace 1582725540 M * Bertl so, maybe rebuild the kernel with debug info, and trigger the issue while watching the serial console via IPMI 2.0? 1582725591 M * Bertl btw, the MoBo also has two serial ports, one on the I/O shield 1582725609 M * Bertl so it is basically just a matter of plugging a cable in I guess 1582725852 M * Bertl btw, BIOS update is probably necessary as well, 3.2a is current, you are on 1.0b from 2013 1582725870 M * Bertl (note: you can do that from linux :) 1582725966 M * Guy- yes, well, this is a box we don't like to reboot too often :) 1582726044 M * Guy- it'd be a shame if it didn't boot after a bios upgrade, for example 1582726299 M * Bertl just saying ... probably your current BMC/ME is vulnerable to a bunch of exploits 1582726359 M * Guy- I think it has separate firmware from the BIOS, though 1582726413 M * Guy- but yeah, I treat all BMCs as vulnerable by default, updated or not 1582726755 M * Bertl so, the problem seems to be Linux-VServer related as far as I can 'reverse' your stack trace 1582726782 M * Bertl it looks to me like the nxi entry on the socket in question is not initialized 1582726805 M * Bertl i.e. it is not Null, but it doesn't point to a valid nx_info either 1582726904 M * Bertl testb $0x4,0x23(%rdx) 1582726924 M * Bertl is very likely from nx_info_flags(nxi, NXF_HIDE_LBACK, 0) 1582727038 M * Bertl i.e. from this line: 1582727041 M * Bertl r->id.idiag_src[0] = nx_map_sock_lback(sk->sk_nx_info, sk->sk_rcv_saddr); 1582727335 M * Bertl so looking at the further stack trace, the socket is from sk_prot_alloc() 1582727400 M * Bertl which interestingly has a badly applied patch regarding init 1582727452 M * Bertl which might just be the problem 1582727483 M * Guy- it can return sk without calling sock_vx_init(), is that it? 1582727505 M * Bertl not really 1582727519 M * Bertl but it can init sockets which failed allocation 1582727535 M * Bertl (which can result in all kind of problems) 1582727607 M * Guy- ah yes 1582727628 M * Guy- the closing } should be below sock_nx_init(sk), right? 1582727639 M * Bertl yup 1582727665 M * Guy- could this cause the "arekm bug"? 1582727696 M * Bertl a lot is possible here, in any case, it should be fixed soon 1582727723 M * Guy- the 4.4 patch is also affected 1582727749 M * Bertl yeah, I guess it happened some time ago and went unnoticed 1582727878 M * Guy- anyway, it's probably not the root cause of the arekm bug because it's also present in 4.4 1582728298 M * Bertl http://vserver.13thfloor.at/Experimental/delta-skproto-fix01.diff 1582728313 M * Bertl please check if that makes the issue go away 1582728349 M * Guy- it'll be a while (weeks) 1582728370 M * Bertl well, it has been there for weeks now ... 1582728377 M * Guy- quite 1582728394 M * Bertl so just let me know when you get there ... 1582728402 M * Guy- will do :) 1582728500 M * Bertl updated patches should be available till end of the week btw 1582728647 M * Guy- updated as in incorporating this fix, or updated as in applying cleanly to the latest 4.9 and 4.4? 1582728694 M * Guy- I think I was able to port the 4.4 patch to 4.4.214 by finding where the code moved to, but it'll sure be nice to have a clean patch 1582728799 M * Bertl both 1582728887 M * Guy- \o/ 1582731545 M * arekm that patch is for "my" problem? 1582731914 M * Bertl no, but it might have caused all kind of issues 1582731948 M * Bertl (or rather fixed :) 1582731998 M * Bertl off for now ... bbl 1582732003 N * Bertl Bertl_oO 1582734194 M * Guy- arekm: no, as the same problem was present in the 4.4 and 4.9 patch as well, and the "arekm bug" only manifests with 4.9 1582734846 M * Bertl_oO well, it might still be related though 1582735250 M * arekm it didn't fix my problem. cpu stuck 1582735469 M * Bertl_oO got a new trace to look at .. maybe I HAVE 1582735493 M * Bertl_oO *have an idea today ... 1582735545 M * arekm no trace unfortunately 1582735767 M * Bertl_oO so what is your current test routine, as it seems you readily patched and tested the kernel? 1582737003 M * arekm I'm just booting new kernel on my server where this problem was found 1582737121 M * Bertl_oO and it doesn't require any tricks or similar, just hangs? 1582737148 Q * hijacker 1582737158 M * Bertl_oO it might be worth attaching gdb to the kernel if possible 1582743241 M * arekm heh, I can lockup in small VM in qemu... but then even sysrq doesn't work and no backtrace 1582743268 M * arekm (well I got one trace, not directly from cpu lockups but some issue that lockup caused) 1582744393 J * fstd_ ~fstd@xdsl-87-78-201-170.nc.de 1582744451 M * arekm Bertl_oO: not sure if this will help http://ixion.pld-linux.org/~arekm/PLD%20Test/IMG_9964.jpg 1582744858 Q * fstd Ping timeout: 480 seconds 1582746253 M * arekm Bertl_oO: another one http://ixion.pld-linux.org/~arekm/PLD%20Test/stack1.txt 1582748198 M * arekm Bertl_oO: also http://ixion.pld-linux.org/~arekm/PLD%20Test/stack2.txt 1582750071 J * Aiken ~Aiken@b951.h.jbmb.net