1486686819 J * thierryp ~thierry@82.226.190.44 1486687300 Q * thierryp Ping timeout: 480 seconds 1486690515 J * thierryp ~thierry@2a01:e35:2e2b:e2c0:e458:7309:c879:a970 1486691043 Q * geb Remote host closed the connection 1486691069 Q * thierryp Ping timeout: 480 seconds 1486691111 J * geb ~geb@mars.gebura.eu.org 1486692796 Q * derjohn_mob Ping timeout: 480 seconds 1486693338 J * derjohn_mob ~aj@p2003008E6C1ACE0089D804A68EE7845F.dip0.t-ipconnect.de 1486694486 J * xos4who ~xos4who@ipservice-092-210-053-169.092.210.pools.vodafone-ip.de 1486695389 J * fstd_ ~fstd@x4db5f11d.dyn.telefonica.de 1486695842 Q * fstd Ping timeout: 480 seconds 1486695842 N * fstd_ fstd 1486697849 J * thierryp ~thierry@82.226.190.44 1486698295 Q * xos4who Quit: Leaving 1486698332 Q * thierryp Ping timeout: 480 seconds 1486705125 J * thierryp ~thierry@82.226.190.44 1486705609 Q * thierryp Ping timeout: 480 seconds 1486708815 J * thierryp ~thierry@82.226.190.44 1486709300 Q * thierryp Ping timeout: 480 seconds 1486711372 M * Bertl off to bed now ... have a good one everyone! 1486711373 N * Bertl Bertl_zZ 1486712067 J * thierryp ~thierry@2a01:e35:2e2b:e2c0:e458:7309:c879:a970 1486714361 Q * Ghislain Quit: Leaving. 1486714788 Q * thierryp Remote host closed the connection 1486715745 J * nikolay ~nikolay@HOST.255.3.ixos.de 1486716291 Q * derjohn_mob Ping timeout: 480 seconds 1486717852 J * derjohn_mob ~aj@b2b-94-79-172-98.unitymedia.biz 1486718219 J * thierryp ~thierry@zeta.inria.fr 1486718980 M * Guy- one one host with vservers, when running backups, load skyrockets to 260+, with no i/o for several seconds and many processes stuck in wait_on_page_bit_killable() 1486718983 M * Guy- what could cause this? 1486718987 M * Guy- I don't see it on other hosts 1486719295 Q * _Shiva_ Quit: Operator halted - Coffee not found 1486719328 J * _Shiva_ shiva@whatcha.looking.at 1486724959 Q * Defaultti Quit: WeeChat . 1486725029 J * Defaultti defaultti@lakka.kapsi.fi 1486732747 N * Bertl_zZ Bertl 1486732751 M * Bertl morning folks! 1486732799 M * Bertl Guy-: a bad disk or something which causes high I/O delays could cause that 1486732844 M * Guy- Bertl: but I see zero %util in iostat 1486732870 M * Guy- all counters in iostat are zero, for several seconds 1486733229 M * Bertl yes, that is how a bad disk would look like 1486733249 M * Bertl double check the smart values of all disks involved 1486733296 M * Bertl if the disk decides to try to recover a bad sector, there won't be any I/O to or from the disk, still everything in the kernel will wait for the access to complete 1486736049 Q * thierryp Ping timeout: 480 seconds 1486736532 J * thierryp ~thierry@zeta.inria.fr 1486736589 Q * thierryp Remote host closed the connection 1486737260 J * thierryp ~thierry@zeta.inria.fr 1486737864 Q * nikolay Quit: Leaving 1486738309 Q * thierryp Remote host closed the connection 1486738326 J * thierryp ~thierry@zeta.inria.fr 1486738347 Q * thierryp Remote host closed the connection 1486741805 Q * derjohn_mob Ping timeout: 480 seconds 1486744106 J * derjohn_mob ~aj@p2003008E6C1ACE00D15142769F1D3DA2.dip0.t-ipconnect.de 1486749439 M * Bertl off for now ... bbl 1486749441 N * Bertl Bertl_oO 1486750411 J * thierryp ~thierry@2a01:e35:2e2b:e2c0:b867:5976:4a6d:1aca 1486756479 M * Guy- Bertl_oO: yes, but then I'd see 100% %util, and long await 1486756499 M * Guy- Bertl_oO: to me it looks more like no I/O is issued at all, and instead the kernel is doing... something 1486756545 M * Guy- I monitor the smart values; the disks in question are even SSDs (which doesn't mean that they can't go bad, but I wouldn't expect the same kind of lengthy recovery attempts that spinning disks do) 1486756550 M * Bertl_oO so is there rising cpu usage when this happens or 'just' rising load? 1486756592 M * Bertl_oO SSDs do a lot more of error correction ... it is basically an SSDs second nature to correct bad blocks 1486756610 M * Bertl_oO (but it also happens a lot faster :) 1486756707 M * Guy- Bertl_oO: CPU usage increases somewhat, but the CPU is not saturated 1486756752 M * Bertl_oO load rises continually till everything normalizes again? 1486756847 M * Guy- I'd say so 1486756856 M * Guy- the processes are not stalled completely 1486756866 M * Bertl_oO and that happens when you run the backup? 1486756869 M * Guy- they do get some work done, but for example the irc server on the box keeps splitting 1486756876 M * Guy- yes 1486756895 M * Guy- interactive shells are usable 1486756896 M * Bertl_oO did you try running the backup with low I/O priority? 1486756917 M * Guy- I'm not sure I even use the CFQ scheduler 1486756920 M * Guy- let me check 1486756943 M * Guy- scheduler on the ssds is set to none 1486756953 M * Guy- I suppose I could try with cfq 1486756974 M * Guy- but I don't understand how that would help 1486756975 M * Bertl_oO well, you might be starving the SSDs with requests 1486756983 M * Guy- but I'm not, according to iostat 1486757003 J * Gremble ~Gremble@cpc87179-aztw31-2-0-cust6.18-1.cable.virginm.net 1486757017 M * Guy- according to munin, max. %util for the two ssds is about 10% 1486757022 M * Bertl_oO it might also be the filesystem cache in the kernel if you do extensive filesystem searches/lookups 1486757045 Q * Gremble 1486757050 M * Bertl_oO i.e. if walking inode talbles, etc is required a lot 1486757071 M * Bertl_oO usually doesn't result in much I/O but keeps the memory busy 1486757108 M * Guy- well, I use rsync for backups, so that's certainly a factor 1486757109 M * Bertl_oO personally I would run some 'stress' tests and see what causes the same or similar behaviour 1486757137 M * Guy- I just started find / -ls >/dev/null 1486757145 M * Guy- that should do it 1486757165 M * Guy- sure enough, load is rising; we're at 10 1486757186 M * Guy- but it doesn't go higher apparently 1486757253 M * Bertl_oO try a find with a -printf which queries file specific data like blocks, size, etc 1486757280 M * Guy- -ls does that 1486757353 M * Guy- it calls newfstatat() on every entry it finds 1486757362 M * Bertl_oO okay 1486757383 M * Bertl_oO what happens if you run two find tasks in parallel? 1486757425 M * Guy- I would expect nothing much to happen - the SSDs can take it and there's even cache 1486757437 M * Guy- I can try, but I just started the backup job to be able to give more specific data 1486757728 M * Guy- well, this is unexpected... 1486757732 M * Guy- vorführeffekt 1486757745 M * Guy- it isn't acting up now 1486757756 M * Guy- colour me puzzled 1486757767 M * Guy- I could reproduce it even this morning... 1486757778 M * Bertl_oO could be that the inode cache is now 'hot' 1486757801 M * Bertl_oO makes quite a difference 1486757808 M * Guy- based on the munin graphs, it's not hotter than normal 1486757823 M * Guy- well, my find(1) sure helped prime it, though 1486757852 M * Guy- backup job finished, no load spike 1486757980 M * Guy- I wonder how the real backup jobs that run at night will fare 1486758388 M * Bertl_oO well, we will know tomorrow, no? 1486758547 M * Guy- quite :) 1486761314 Q * derjohn_mob Remote host closed the connection 1486762531 J * derjohn_mob ~aj@p2003008E6C1ACE00B0B5952052C45A60.dip0.t-ipconnect.de 1486765031 Q * derjohn_mob Ping timeout: 480 seconds