1486686819 J * thierryp ~thierry@82.226.190.44
1486687300 Q * thierryp Ping timeout: 480 seconds
1486690515 J * thierryp ~thierry@2a01:e35:2e2b:e2c0:e458:7309:c879:a970
1486691043 Q * geb Remote host closed the connection
1486691069 Q * thierryp Ping timeout: 480 seconds
1486691111 J * geb ~geb@mars.gebura.eu.org
1486692796 Q * derjohn_mob Ping timeout: 480 seconds
1486693338 J * derjohn_mob ~aj@p2003008E6C1ACE0089D804A68EE7845F.dip0.t-ipconnect.de
1486694486 J * xos4who ~xos4who@ipservice-092-210-053-169.092.210.pools.vodafone-ip.de
1486695389 J * fstd_ ~fstd@x4db5f11d.dyn.telefonica.de
1486695842 Q * fstd Ping timeout: 480 seconds
1486695842 N * fstd_ fstd
1486697849 J * thierryp ~thierry@82.226.190.44
1486698295 Q * xos4who Quit: Leaving
1486698332 Q * thierryp Ping timeout: 480 seconds
1486705125 J * thierryp ~thierry@82.226.190.44
1486705609 Q * thierryp Ping timeout: 480 seconds
1486708815 J * thierryp ~thierry@82.226.190.44
1486709300 Q * thierryp Ping timeout: 480 seconds
1486711372 M * Bertl  off to bed now ... have a good one everyone!
1486711373 N * Bertl Bertl_zZ
1486712067 J * thierryp ~thierry@2a01:e35:2e2b:e2c0:e458:7309:c879:a970
1486714361 Q * Ghislain Quit: Leaving.
1486714788 Q * thierryp Remote host closed the connection
1486715745 J * nikolay ~nikolay@HOST.255.3.ixos.de
1486716291 Q * derjohn_mob Ping timeout: 480 seconds
1486717852 J * derjohn_mob ~aj@b2b-94-79-172-98.unitymedia.biz
1486718219 J * thierryp ~thierry@zeta.inria.fr
1486718980 M * Guy- one one host with vservers, when running backups, load skyrockets to 260+, with no i/o for several seconds and many processes stuck in wait_on_page_bit_killable()
1486718983 M * Guy- what could cause this?
1486718987 M * Guy- I don't see it on other hosts
1486719295 Q * _Shiva_ Quit: Operator halted - Coffee not found
1486719328 J * _Shiva_ shiva@whatcha.looking.at
1486724959 Q * Defaultti Quit: WeeChat .
1486725029 J * Defaultti defaultti@lakka.kapsi.fi
1486732747 N * Bertl_zZ Bertl
1486732751 M * Bertl  morning folks!
1486732799 M * Bertl Guy-: a bad disk or something which causes high I/O delays could cause that
1486732844 M * Guy- Bertl: but I see zero %util in iostat
1486732870 M * Guy- all counters in iostat are zero, for several seconds
1486733229 M * Bertl yes, that is how a bad disk would look like
1486733249 M * Bertl double check the smart values of all disks involved
1486733296 M * Bertl if the disk decides to try to recover a bad sector, there won't be any I/O to or from the disk, still everything in the kernel will wait for the access to complete
1486736049 Q * thierryp Ping timeout: 480 seconds
1486736532 J * thierryp ~thierry@zeta.inria.fr
1486736589 Q * thierryp Remote host closed the connection
1486737260 J * thierryp ~thierry@zeta.inria.fr
1486737864 Q * nikolay Quit: Leaving
1486738309 Q * thierryp Remote host closed the connection
1486738326 J * thierryp ~thierry@zeta.inria.fr
1486738347 Q * thierryp Remote host closed the connection
1486741805 Q * derjohn_mob Ping timeout: 480 seconds
1486744106 J * derjohn_mob ~aj@p2003008E6C1ACE00D15142769F1D3DA2.dip0.t-ipconnect.de
1486749439 M * Bertl  off for now ... bbl
1486749441 N * Bertl Bertl_oO
1486750411 J * thierryp ~thierry@2a01:e35:2e2b:e2c0:b867:5976:4a6d:1aca
1486756479 M * Guy- Bertl_oO: yes, but then I'd see 100% %util, and long await
1486756499 M * Guy- Bertl_oO: to me it looks more like no I/O is issued at all, and instead the kernel is doing... something
1486756545 M * Guy- I monitor the smart values; the disks in question are even SSDs (which doesn't mean that they can't go bad, but I wouldn't expect the same kind of lengthy recovery attempts that spinning disks do)
1486756550 M * Bertl_oO so is there rising cpu usage when this happens or 'just' rising load?
1486756592 M * Bertl_oO SSDs do a lot more of error correction ... it is basically an SSDs second nature to correct bad blocks
1486756610 M * Bertl_oO (but it also happens a lot faster :)
1486756707 M * Guy- Bertl_oO: CPU usage increases somewhat, but the CPU is not saturated
1486756752 M * Bertl_oO load rises continually till everything normalizes again?
1486756847 M * Guy- I'd say so
1486756856 M * Guy- the processes are not stalled completely
1486756866 M * Bertl_oO and that happens when you run the backup?
1486756869 M * Guy- they do get some work done, but for example the irc server on the box keeps splitting
1486756876 M * Guy- yes
1486756895 M * Guy- interactive shells are usable
1486756896 M * Bertl_oO did you try running the backup with low I/O priority?
1486756917 M * Guy- I'm not sure I even use the CFQ scheduler
1486756920 M * Guy- let me check
1486756943 M * Guy- scheduler on the ssds is set to none
1486756953 M * Guy- I suppose I could try with cfq
1486756974 M * Guy- but I don't understand how that would help
1486756975 M * Bertl_oO well, you might be starving the SSDs with requests
1486756983 M * Guy- but I'm not, according to iostat
1486757003 J * Gremble ~Gremble@cpc87179-aztw31-2-0-cust6.18-1.cable.virginm.net
1486757017 M * Guy- according to munin, max. %util for the two ssds is about 10%
1486757022 M * Bertl_oO it might also be the filesystem cache in the kernel if you do extensive filesystem searches/lookups
1486757045 Q * Gremble 
1486757050 M * Bertl_oO i.e. if walking inode talbles, etc is required a lot
1486757071 M * Bertl_oO usually doesn't result in much I/O but keeps the memory busy
1486757108 M * Guy- well, I use rsync for backups, so that's certainly a factor
1486757109 M * Bertl_oO personally I would run some 'stress' tests and see what causes the same or similar behaviour
1486757137 M * Guy- I just started find / -ls >/dev/null
1486757145 M * Guy- that should do it
1486757165 M * Guy- sure enough, load is rising; we're at 10
1486757186 M * Guy- but it doesn't go higher apparently
1486757253 M * Bertl_oO try a find with a -printf which queries file specific data like blocks, size, etc
1486757280 M * Guy- -ls does that
1486757353 M * Guy- it calls newfstatat() on every entry it finds
1486757362 M * Bertl_oO okay
1486757383 M * Bertl_oO what happens if you run two find tasks in parallel?
1486757425 M * Guy- I would expect nothing much to happen - the SSDs can take it and there's even cache
1486757437 M * Guy- I can try, but I just started the backup job to be able to give more specific data
1486757728 M * Guy- well, this is unexpected...
1486757732 M * Guy- vorführeffekt
1486757745 M * Guy- it isn't acting up now
1486757756 M * Guy- colour me puzzled
1486757767 M * Guy- I could reproduce it even this morning...
1486757778 M * Bertl_oO could be that the inode cache is now 'hot'
1486757801 M * Bertl_oO makes quite a difference
1486757808 M * Guy- based on the munin graphs, it's not hotter than normal
1486757823 M * Guy- well, my find(1) sure helped prime it, though
1486757852 M * Guy- backup job finished, no load spike
1486757980 M * Guy- I wonder how the real backup jobs that run at night will fare
1486758388 M * Bertl_oO well, we will know tomorrow, no?
1486758547 M * Guy- quite :)
1486761314 Q * derjohn_mob Remote host closed the connection
1486762531 J * derjohn_mob ~aj@p2003008E6C1ACE00B0B5952052C45A60.dip0.t-ipconnect.de
1486765031 Q * derjohn_mob Ping timeout: 480 seconds