1376009005 J * Romster ~romster@202.168.100.149.dynamic.rev.eftel.com 1376012369 M * SteeleNivenson hello. I have a condition in a vserver host that I cannot diagnose. It appears that one guest is consuming an unbalanced number of open files, which prevents other guests from doing things like open ssh connections. 1376012406 M * SteeleNivenson when I get what I expected to be the max open files limit, it's much higher than the total number of open files on all guests. 1376012455 M * SteeleNivenson I got this value from the host: `cat /proc/sys/fs/file-nr` 1376012470 M * SteeleNivenson 2885056 0 2888436 1376012536 M * SteeleNivenson but running lsof on all guests results in a much lower aggregate number, even during the time when guests began complaining about "too many open files" errors. 1376012546 M * SteeleNivenson lower as in an order of magnitude lower. 1376012585 M * Bertl what kernel/patch are we talking about? 1376012628 M * SteeleNivenson this is where the story gets a little dark :) 1376012630 M * SteeleNivenson 2.6.31.6-vs2.3.0.36.24 #1 SMP Thu Nov 12 07:17:30 EST 2009 1376012678 M * SteeleNivenson I guess it's not that old, but the OS is CentOS 5.4 1376012688 M * Bertl okay, you know what the /proc/sys/fs/file-nr values mean? 1376012741 M * SteeleNivenson the doc I'm reading describes it as "The three values in file-nr denote the number of allocated file handles, the number of used file handles, and the maximum number of file handles" 1376012766 M * Bertl correct, so in your case, no file handles are currently used 1376012788 M * SteeleNivenson the "0" value in the middle seems incorrect, since there are used file handles. there are 20 active guests on the host. 1376012805 M * Bertl but you are looking on the host, right? 1376012807 M * SteeleNivenson I can assure you there are more than zero open file handles 1376012813 M * SteeleNivenson yes, on the host. 1376012842 M * Bertl so how does it look inside a guest for example? 1376012857 M * SteeleNivenson 2884992 0 2888436 1376012870 M * SteeleNivenson this guest has more than zero files open too. 1376012893 M * SteeleNivenson 614 according to lsof. 1376012912 M * Bertl files != filehandles, but I agree, it seems to not show anything on that kernel 1376012956 M * Bertl so, what are your limits on the file handles per guest? 1376012966 M * SteeleNivenson a tcp/ip socket counts as a filehandle, right? 1376013039 M * SteeleNivenson is the limit on file handles per guest stored in /etc/vservers/guesname/rlimits ? 1376013092 M * Bertl yes, as nofile 1376013107 M * SteeleNivenson 8192 1376013132 M * Bertl so none of your guests should be able to have more than that in use 1376013153 M * SteeleNivenson it's the same for all 20 guests, which adds up to 163840, which is less than 2888436 1376013189 M * SteeleNivenson unfortunately, I don't have a time series on open files per guest. 1376013197 M * SteeleNivenson that's in development :) 1376013206 M * Bertl what does the current usage show per guest? 1376013252 M * SteeleNivenson using lsof, it looks like this...(one sec, prints out as a table) 1376013276 M * Bertl (for FILES and OFD in the /proc/virtual//limits) 1376013287 M * Bertl *limit 1376013345 M * SteeleNivenson here something weird, no guests have a /proc/virtual directory 1376013349 M * SteeleNivenson ls 1376013352 M * SteeleNivenson ww 1376013365 M * Bertl that's on the host :) 1376013385 M * SteeleNivenson oh yeah, it's on the host. does that get the information for each guest? 1376013396 M * Bertl yep, per xid 1376013409 M * SteeleNivenson ah. that explains a lot. 1376013426 M * Bertl I just looked through the kernel code, and it seems that the sys entry file-nr doesn't contain anything useful in column 2 anymore 1376013452 A * SteeleNivenson confirms that from the other end :) 1376013528 M * SteeleNivenson cd .. 1376013532 M * SteeleNivenson sorry. 1376013759 M * SteeleNivenson one sec, my awk-foo is rusty 1376013855 M * SteeleNivenson all guests are using 6987 ODF right now 1376013884 M * SteeleNivenson and 12916 FILES 1376013904 M * Bertl so your open file descriptors should be fine, no? 1376013935 M * SteeleNivenson anecdotally, it takes about a week for someone to complain about this. I just rebooted the guest I suspect is the culprit this morning. 1376013967 M * SteeleNivenson what I'm doing now is working on a proof that it is that guest I suspect to be the culprit. 1376013993 M * SteeleNivenson so yeah, it's fine now...but it won't be next thursday 1376014009 M * Bertl I'd suggest to collect those two values (and maybe even more :) from /proc/virtual//limit for each guest 1376014025 M * Bertl should be easy to pin down the offender 1376014121 M * SteeleNivenson agreed. in the meantime, is there a way to set this limit so one guest is isolated from the whole system? Perhaps I can half the nofile value on the one I suspect? 1376014155 M * SteeleNivenson I'm confused because the aggregate seems like it should always be much lower than ~ 2.8 million. 1376014161 M * Bertl in general, the max set for any of those values (if enforced) will be limiting 1376014191 M * SteeleNivenson what do you mean by "if enforced"? 1376014229 M * Bertl well, it depends on the kernel if the check is enforced or not 1376014243 M * Bertl s/check/limit/ 1376014244 M * SteeleNivenson I'm new to vserver resource limiting. I've inherited this system. 1376014292 M * SteeleNivenson oh. is there a place I can RTFM on the subject. I'm happy to dig through the kernel source. 1376014295 M * Bertl but it should be easy to test this, just create a test guest, check how much of a certain resource is used by default 1376014301 M * SteeleNivenson s/./?/ 1376014342 M * Bertl then put a limit slightly over the defaul usage, and ssh into the guest just to go over the limit (and see what happens :) 1376014347 M * Bertl *default 1376014361 A * Bertl has clumsy fingers tonight ... 1376014391 M * SteeleNivenson good idea. I'll try that. thanks for your help! I gotta step out for a minute. 1376014401 M * Bertl you're welcome! np 1376025967 M * Bertl off to bed now ... have a good one everyone! 1376025972 N * Bertl Bertl_zZ 1376026701 Q * quasisane Ping timeout: 480 seconds 1376028016 J * quasisane ~sanep@0001267b.user.oftc.net 1376029946 Q * distemper Ping timeout: 480 seconds 1376030651 J * distemper ~user@2001:470:6f:404:4e3c:da8d:532c:2849 1376041310 P * Fog_Watch 1376044550 J * arfego ~arfego@94.26.11.6 1376045154 Q * arfego Ping timeout: 480 seconds 1376045951 Q * ircuser-1 Ping timeout: 480 seconds 1376047046 J * FireEgl FireEgl@2001:470:e5ad:1:b9ca:621a:e581:2d67 1376048494 J * ircuser-1 ~ircuser-1@35.222-62-69.ftth.swbr.surewest.net 1376048801 Q * FireEgl Ping timeout: 480 seconds 1376049335 Q * renihs Ping timeout: 480 seconds 1376049587 J * FireEgl ~FireEgl@173-25-83-57.client.mchsi.com 1376050066 J * tanjix tanjix@a.clients.kiwiirc.com 1376050088 M * tanjix hiho - when i'm doing a "vserver-stat | wc -l" it get: open(memory.stat): No such file or directory 1376050117 M * tanjix i'm running 3.9.5-vs2.3.6.5 with utils 0.30.216-pre3038 1376052846 Q * eyck Read error: Connection reset by peer 1376054245 J * eyck ~eyck@nat08.nowanet.pl 1376054683 N * Bertl_zZ Bertl 1376054686 M * Bertl morning folks! 1376054704 M * Bertl tanjix: you're sure about the kernel/tool version? 1376055201 Q * distemper Ping timeout: 480 seconds 1376055690 J * distemper ~user@2001:470:6f:404:4e3c:da8d:532c:2849 1376056279 M * tanjix Bertl: Yes, I am 1376056319 M * tanjix I'm not getting that error, when just doing "vserver-stat"... Piping it to another command like "wc" gives the error 1376056374 Q * distemper Quit: bye 1376056423 J * distemper ~user@2001:470:6f:404:4e3c:da8d:532c:2849 1376056622 M * Bertl well, I do not get it here with 3.9.7 and pre3038 1376056809 M * tanjix so it'Sa kernel issue? 1376057044 M * Bertl maybe, but it sounds more like a userspace issue 1376057053 M * Bertl specifically like older util-vserver 1376057187 Q * hparker Quit: reboot 1376057343 J * hparker ~hparker@2001:470:1f0f:32c:beae:c5ff:fe01:b600 1376057380 M * tanjix but they are not older :-) 1376057381 M * tanjix node4:~# vserver-info 1376057382 M * tanjix Versions: 1376057382 M * tanjix Kernel: 3.9.5-vs2.3.6.5-beng 1376057382 M * tanjix VS-API: 0x00020308 1376057382 M * tanjix VCI: 0x0000000013003f11 1376057383 M * tanjix util-vserver: 0.30.216-pre3038; Oct 2 2012, 20:18:00 1376057958 J * renihs ~arf@83-65-34-34.arsenal.xdsl-line.inode.at 1376058838 Q * SteeleNivenson Quit: Leaving 1376060865 Q * distemper Ping timeout: 480 seconds 1376061981 N * l0kit Guest2784 1376061989 J * l0kit ~1oxT@0001b54e.user.oftc.net 1376062256 M * Bertl tanjix: maybe check with strace -fF and upload the output? 1376062389 Q * Guest2784 Ping timeout: 480 seconds 1376062521 Q * HarrYPotter Remote host closed the connection 1376062556 M * tanjix Bertl: http://paste.ubuntu.com/5966775/ 1376062684 M * Bertl what does 'cat /proc/mounts' show? 1376062751 M * tanjix http://paste.ubuntu.com/5966785/ 1376062792 M * Bertl seems you do not have cgroups mounted properly 1376062803 M * Bertl i.e. you did not execute the util-vserver runlevel scripts 1376062821 M * Bertl (or more precisely, your distro didn't :) 1376062855 M * Bertl which in turn means, that your guests are not completely isolated 1376062869 M * tanjix hmm, which one should this be? 1376062869 M * Bertl (at least not resource wise) 1376062892 M * Bertl /etc/init.d/util-vserver 1376062931 M * tanjix i ran it now and tried it again - the same result 1376062956 M * Bertl verify that you get a cgroup mount in /proc/mounts 1376062973 M * tanjix node4:~# cat /proc/mounts |grep cgroup 1376062973 M * Bertl and also, you need to restart all the guests to take effect 1376062974 M * tanjix vserver /dev/cgroup cgroup rw,relatime,perf_event,blkio,net_cls,freezer,devices,memory,cpuacct,cpu,cpuset 0 0 1376062974 M * tanjix node4:~# 1376062985 M * Bertl yep, that looks good 1376063004 M * tanjix oh, ok. then i will restart them later this evening. 1376063149 M * tanjix thank you ;) 1376063157 M * Bertl you're welcome! 1376063187 M * Bertl check that the util-vserver script is executed at host startup, otherwise you'll end up with the same incomplete guest setup next time 1376063205 M * Bertl (same goes for the vprocunhide script) 1376064513 J * bonbons ~bonbons@2001:a18:20b:a301:4032:61f8:8ba6:536c 1376064665 Q * tanjix Quit: http://www.kiwiirc.com/ - A hand crafted IRC client 1376067850 J * hijacker_ ~hijacker@cable-84-43-134-121.mnet.bg 1376073663 J * distemper ~user@2a01:198:2ee:dcb4:6085:d437:5768:a048 1376073967 N * theocrite theocrite_ 1376073978 N * theocrite_ theocrite 1376075033 J * distemper_ ~user@2a01:198:2ee:dcb4:6085:d437:5768:a048 1376075033 Q * distemper Read error: Connection reset by peer 1376080277 Q * hijacker_ Quit: Leaving 1376083448 Q * jrklein Remote host closed the connection 1376083453 J * jrklein ~osx@proxy.dnihost.net 1376084188 Q * imcsk8 Remote host closed the connection 1376084438 J * imcsk8 ~ichavero@148.229.1.11 1376086499 J * SteeleNivenson ~SteeleNiv@pool-96-224-241-140.nycmny.fios.verizon.net 1376090850 J * tolkor ~rj@tdream.lly.earlham.edu 1376091041 J * ncopa_ ~test@ti0143a340-0315.bb.online.no 1376091488 Q * bonbons Quit: Leaving 1376091500 Q * ncopa Ping timeout: 480 seconds