[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Savannah-hackers-public] Re: [gnu.org #649274] colonialone ailing
From: |
Sylvain Beucler |
Subject: |
[Savannah-hackers-public] Re: [gnu.org #649274] colonialone ailing |
Date: |
Mon, 20 Dec 2010 22:49:06 +0100 |
User-agent: |
Mutt/1.5.20 (2009-06-14) |
Hi,
I don't have a recommendation for the hardware issues.
However:
- Note that you backup savannah-backup.gnu.org (a 1TB VM that I think
runs on cloud9). That VM in turn backs-up the colonialone at 12:00
UTC daily and usually runs in 1/2h-1h.
So the time at which monolith starts is not relevant to
colonialone's disks load.
- ward recommended using smaller RAID partitions used as separate LVM
PVs for a bigger VG, in order to minimize the disks resync, so if
there's a way to do that during the downtime, please do :)
- Sylvain
On Mon, Dec 20, 2010 at 01:21:12PM -0500, Peter Olson via RT wrote:
> This is timeline for savannah disk problem.
>
> 2010-12-18T04:25 /dev/sdd was added to /dev/md0,1,2,3
>
> The first three were small, resynched very quickly.
>
> 2010-12-18T04:31 /dev/md3 starts to resynch /dev/sdd6 (955 million blocks).
>
> Everything is fine for a while.
>
> Dec 18 06:52:45 colonialone mdadm[5026]: Rebuild20 event detected on md
> device /dev/md3
>
> This means resynch is 20% complete.
>
> I'm not sure what this is all about:
>
> Dec 18 07:00:30 colonialone kernel: [1106742.561695] EXT3-fs: INFO: recovery
> required on readonly
> filesystem.
> Dec 18 07:00:30 colonialone kernel: [1106742.568646] EXT3-fs: write access
> will be enabled during
> recovery.
> Dec 18 07:00:53 colonialone kernel: [1106765.725031] kjournald starting.
> Commit interval 5 seconds
> Dec 18 07:00:53 colonialone kernel: [1106765.732416] EXT3-fs: recovery
> complete.
> Dec 18 07:00:53 colonialone kernel: [1106765.897655] EXT3-fs: mounted
> filesystem with ordered data
> mode.
>
> I had erroneously thought that the monolith backup starts at 07:20, but it
> starts at 20:07 and runs to
> typically midnight or one AM. Sometimes it runs longer. For example it ran
> until 1 in the afternoon
> Monday, Dec 13. Perhaps this is caused by log rotation or some other massive
> change in the file
> system. We are considering standardizing on datestamped logs for our other
> machines, so the log
> filename doesn't change at every rotation. This morning the backup ran until
> 8 AM:
>
> rsync of savannah-backup.gnu.org started at Sun Dec 12 20:07:01 EST 2010
> rsync of savannah-backup.gnu.org finished at Mon Dec 13 13:00:55 EST 2010
> rsync of savannah-backup.gnu.org started at Mon Dec 13 20:07:01 EST 2010
> rsync of savannah-backup.gnu.org finished at Mon Dec 13 22:07:03 EST 2010
> rsync of savannah-backup.gnu.org started at Tue Dec 14 20:07:01 EST 2010
> rsync of savannah-backup.gnu.org finished at Wed Dec 15 00:36:57 EST 2010
> rsync of savannah-backup.gnu.org started at Wed Dec 15 20:07:01 EST 2010
> rsync of savannah-backup.gnu.org finished at Thu Dec 16 00:59:14 EST 2010
> rsync of savannah-backup.gnu.org started at Thu Dec 16 20:07:01 EST 2010
> rsync of savannah-backup.gnu.org finished at Fri Dec 17 00:36:40 EST 2010
> rsync of savannah-backup.gnu.org started at Fri Dec 17 20:07:01 EST 2010
> rsync of savannah-backup.gnu.org finished at Sat Dec 18 00:04:56 EST 2010
> rsync of savannah-backup.gnu.org started at Sat Dec 18 20:07:01 EST 2010
> rsync of savannah-backup.gnu.org finished at Sun Dec 19 00:01:14 EST 2010
> rsync of savannah-backup.gnu.org started at Sun Dec 19 20:07:01 EST 2010
> rsync of savannah-backup.gnu.org finished at Mon Dec 20 08:17:39 EST 2010
>
> On Saturday the hung task notifications start around 07:20 AM, but my theory
> why this happened is
> faulty (it's not because of load from monolith).
>
> Dec 18 07:21:50 colonialone kernel: [1108026.400926] INFO: task
> blkback.5.sda2:5001 blocked for more
> than 120 seconds.
> Dec 18 07:21:51 colonialone kernel: [1108026.406840] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> Dec 18 07:21:51 colonialone kernel: [1108026.418761] blkback.5.sda D
> ffffffff8044af00 0 5001
> 2
> Dec 18 07:21:51 colonialone kernel: [1108026.424213] ffff8802bbc439c0
> 0000000000000246
> 0000000000000000 ffff8805df47a740
> Dec 18 07:21:51 colonialone kernel: [1108026.431214] ffff88053cfad440
> ffffffff804ff460
> ffff88053cfad6c0 0000000080266ba3
> Dec 18 07:21:51 colonialone kernel: [1108026.441373] 00000000ffffffff
> 0000000000000000
> 0000000000000010 ffff8805e25d4c40
> Dec 18 07:21:51 colonialone kernel: [1108026.447733] Call Trace:
> Dec 18 07:21:51 colonialone kernel: [1108026.450819] [<ffffffffa00ef395>]
> :raid1:wait_barrier+0xb8/0x12c
> Dec 18 07:21:51 colonialone kernel: [1108026.457985] [<ffffffff80224c8a>]
> default_wake_function+0x0/0xe
> Dec 18 07:21:51 colonialone kernel: [1108026.464714] [<ffffffffa00f83f8>]
> :dm_mod:__split_bio+0x363/0x374
> Dec 18 07:21:51 colonialone kernel: [1108026.471151] [<ffffffffa00f0ff9>]
> :raid1:make_request+0x73/0x5b6
> Dec 18 07:21:51 colonialone kernel: [1108026.477958] [<ffffffff802fdcc1>]
> elv_merged_request+0x30/0x39
> Dec 18 07:21:51 colonialone kernel: [1108026.484758] [<ffffffff80311b13>]
> __up_read+0x13/0x8a
> Dec 18 07:21:51 colonialone kernel: [1108026.490338] [<ffffffff802ffb1e>]
> generic_make_request+0x2fe/0x339
> Dec 18 07:21:51 colonialone kernel: [1108026.497154] [<ffffffff80266ba3>]
> mempool_alloc+0x24/0xda
> Dec 18 07:21:51 colonialone kernel: [1108026.503219] [<ffffffff80300df8>]
> submit_bio+0xdb/0xe2
> Dec 18 07:21:51 colonialone kernel: [1108026.509646] [<ffffffff80387254>]
> dispatch_rw_block_io+0x5b9/0x65d
> Dec 18 07:21:51 colonialone kernel: [1108026.515434] [<ffffffff802ffbb3>]
> blk_unplug+0x5a/0x60
> Dec 18 07:21:51 colonialone kernel: [1108026.522210] [<ffffffffa00f9b19>]
> :dm_mod:dm_table_unplug_all+0x2a/0x3d
> Dec 18 07:21:51 colonialone kernel: [1108026.529606] [<ffffffff80387747>]
> blkif_schedule+0x3c7/0x4dc
> Dec 18 07:21:51 colonialone kernel: [1108026.535638] [<ffffffff80387380>]
> blkif_schedule+0x0/0x4dc
> Dec 18 07:21:51 colonialone kernel: [1108026.541776] [<ffffffff8023f57f>]
> kthread+0x47/0x74
> Dec 18 07:21:51 colonialone kernel: [1108026.547366] [<ffffffff802282ec>]
> schedule_tail+0x27/0x5c
> Dec 18 07:21:51 colonialone kernel: [1108026.553793] [<ffffffff8020be28>]
> child_rip+0xa/0x12
> Dec 18 07:21:51 colonialone kernel: [1108026.559343] [<ffffffff8023f538>]
> kthread+0x0/0x74
> Dec 18 07:21:51 colonialone kernel: [1108026.564794] [<ffffffff8020be1e>]
> child_rip+0x0/0x12
>
> Note that this may not be a fatal problem, it may just be that the machine is
> very overloaded.
>
> Practically speaking it is a show stopper, because the resynchronization
> process makes very little
> progress once this happens.
>
> Dec 18 07:21:51 colonialone kernel: [1108026.572483] INFO: task
> md3_resync:713 blocked for more than
> 120 seconds.
> Dec 18 07:21:51 colonialone kernel: [1108026.580017] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> Dec 18 07:21:51 colonialone kernel: [1108026.588702] md3_resync D
> ffffffff8044af00 0 713
> 2
> Dec 18 07:21:51 colonialone kernel: [1108026.595326] ffff8802bbfffce0
> 0000000000000246
> 0000000000000000 ffff8805df963ee0
> Dec 18 07:21:51 colonialone kernel: [1108026.603863] ffff8805e25e7040
> ffff8805e245e040
> ffff8805e25e72c0 0000000200000000
> Dec 18 07:21:51 colonialone kernel: [1108026.612292] 00000000ffffffff
> ffffffff8023f6b6
> ffff8805dfa705c0 ffffffff80221555
>
> For the DomUs it causes functional problems, such as sshd not being able to
> complete the protocol
> negotiation for a successful login to a VM.
>
> This may have failed on the Dom0 as well, since I wound up using servent to
> contact colonialone
> (perhaps because I could not ssh to it).
>
> At 9 AM Bernie rebooted colonialone:
>
> Dec 18 09:03:54 colonialone kernel: [1114157.641661] md: cannot remove active
> disk sdd6 from md3 ...
> Dec 18 09:04:00 colonialone kernel: [1114163.556083] raid1: Disk failure on
> sdd6, disabling device.
> Dec 18 09:04:00 colonialone kernel: [1114163.556086] raid1: Operation
> continuing on 2 devices.
> Dec 18 09:04:00 colonialone mdadm[5026]: FailSpare event detected on md
> device /dev/md3, component
> device /dev/sdd6
> Dec 18 09:04:08 colonialone shutdown[8193]: shutting down for system reboot
> Dec 18 09:04:08 colonialone init: Switching to runlevel: 6
> Dec 18 09:04:08 colonialone rpc.statd[2395]: Caught signal 15, un-registering
> and exiting.
> Dec 18 09:04:08 colonialone snmpd[2839]: Received TERM or STOP signal...
> shutting down...
>
>
> Dec 18 09:10:10 colonialone kernel: [1114537.764182] [<ffffffff8020b528>]
> system_call+0x68/0x6d
> Dec 18 09:10:10 colonialone kernel: [1114537.770695] [<ffffffff8020b4c0>]
> system_call+0x0/0x6d
> Dec 18 09:10:10 colonialone kernel: [1114537.776042]
>
> Dec 18 09:12:00 colonialone kernel: imklog 3.18.6, log source = /proc/kmsg
> started.
> Dec 18 09:12:00 colonialone rsyslogd: [origin software="rsyslogd"
> swVersion="3.18.6" x-pid="2509" x-
> info="http://www.rsyslog.com"] restart
> Dec 18 09:12:00 colonialone kernel: [ 0.000000] Initializing cgroup subsys
> cpuset
> Dec 18 09:12:00 colonialone kernel: [ 0.000000] Initializing cgroup subsys
> cpu
> Dec 18 09:12:00 colonialone kernel: [ 0.000000] Linux version
> 2.6.26-2-xen-amd64 (Debian 2.6.26-
> 26lenny1) (address@hidden) (gcc version 4.1.3 20080704 (prerelease) (Debian
> 4.1.2-25)) #1 SMP Thu Nov
> 25 06:39:26 UTC 2010
> Dec 18 09:12:00 colonialone kernel: [ 0.000000] Command line:
> root=/dev/md2 ro
> console=ttyS0,115200n8 console=tty0 acpi=off
>
> Here's where /dev/sda6 and /dev/sdd6 are rejected:
>
> Dec 18 09:12:00 colonialone kernel: [ 10.978740] md: md3 stopped.
> Dec 18 09:12:00 colonialone kernel: [ 11.050788] md: bind<sda6>
> Dec 18 09:12:00 colonialone kernel: [ 11.053846] md: bind<sdb6>
> Dec 18 09:12:00 colonialone kernel: [ 11.057885] md: bind<sdd6>
> Dec 18 09:12:00 colonialone kernel: [ 11.061587] md: bind<sdc6>
> Dec 18 09:12:00 colonialone kernel: [ 11.064449] md: kicking non-fresh sdd6
> from array!
> Dec 18 09:12:00 colonialone kernel: [ 11.069346] md: unbind<sdd6>
> Dec 18 09:12:00 colonialone kernel: [ 11.072410] md: export_rdev(sdd6)
> Dec 18 09:12:00 colonialone kernel: [ 11.075384] md: kicking non-fresh sda6
> from array!
> Dec 18 09:12:00 colonialone kernel: [ 11.080268] md: unbind<sda6>
> Dec 18 09:12:00 colonialone kernel: [ 11.083244] md: export_rdev(sda6)
> Dec 18 09:12:00 colonialone kernel: [ 11.086655] md: md3: raid array is not
> clean -- starting
> background reconstruction
> Dec 18 09:12:00 colonialone kernel: [ 11.106510] raid1: raid set md3 active
> with 2 out of 3 mirrors
>
> Here is some information from SMART (Bernie started smartd Dec 18 19:46, so
> we have no data prior to
> that to compare):
>
> SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 111 to 112
> sdb
> SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 111 to 112
> sdc
> SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 112 to 114
> sdc
> SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 112 to 115
> sdb
> SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 114 to 115
> sdc
> SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 115 to 116
> sdb
> SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 115 to 116
> sdc
> SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 116 to 117
> sdb
> SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 116 to 117
> sdc
> SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 117 to 118
> sdb
> SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 119 to 111
> sdc
> SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 70 to 71 sdd
> SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 71 to 70 sdd
> SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 71 to 72 sdc
> SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 72 to 71 sdc
> SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 72 to 73 sda
> SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 73 to 72 sda
> SMART Usage Attribute: 194 Temperature_Celsius changed from 27 to 28 sda
> SMART Usage Attribute: 194 Temperature_Celsius changed from 28 to 27 sda
> SMART Usage Attribute: 194 Temperature_Celsius changed from 28 to 29 sdc
> SMART Usage Attribute: 194 Temperature_Celsius changed from 29 to 28 sdc
> SMART Usage Attribute: 194 Temperature_Celsius changed from 29 to 30 sdd
> SMART Usage Attribute: 194 Temperature_Celsius changed from 30 to 29 sdd
> SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 41 to 48 sdc
> SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 43 to 44 sdc
> SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 44 to 43 sdc
> SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 45 to 44 sdc
> SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 48 to 45 sdc
> SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 48 to 49 sdb
> SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 49 to 50 sdb
> SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 50 to 49 sdb
> SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 50 to 51 sdb
> SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 51 to 50 sdb
>
> Saturday evening we power cycled at 18:28:
>
> Dec 18 18:28:05 colonialone kernel: [26770.798179] [<ffffffff8023f538>]
> kthread+0x0/0x74
> Dec 18 18:28:05 colonialone kernel: [26770.803143] [<ffffffff8020be1e>]
> child_rip+0x0/0x12
> Dec 18 18:28:05 colonialone kernel: [26770.811011]
> Dec 18 18:38:00 colonialone kernel: imklog 3.18.6, log source = /proc/kmsg
> started.
> Dec 18 18:38:00 colonialone rsyslogd: [origin software="rsyslogd"
> swVersion="3.18.6" x-pid="2562" x-
> info="http://www.rsyslog.com"] restart
> Dec 18 18:38:00 colonialone kernel: [ 0.000000] Initializing cgroup subsys
> cpuset
> Dec 18 18:38:00 colonialone kernel: [ 0.000000] Initializing cgroup subsys
> cpu
>
> Sunday morning, the resynchronization reached 81% before things went sour.
>
> Dec 19 07:16:36 colonialone smartd[7514]: Device: /dev/sdb, SMART Prefailure
> Attribute: 1
> Raw_Read_Error_Rate changed from 102 to 113
> Dec 19 07:16:36 colonialone smartd[7514]: Device: /dev/sdb, SMART Usage
> Attribute: 195
> Hardware_ECC_Recovered changed from 51 to 52
> Dec 19 07:16:36 colonialone smartd[7514]: Device: /dev/sdc, SMART Prefailure
> Attribute: 1
> Raw_Read_Error_Rate changed from 117 to 120
> Dec 19 07:16:36 colonialone smartd[7514]: Device: /dev/sdc, SMART Usage
> Attribute: 195
> Hardware_ECC_Recovered changed from 47 to 48
>
> Dec 19 07:17:01 colonialone /USR/SBIN/CRON[3261]: (root) CMD ( cd / &&
> run-parts --report
> /etc/cron.hourly)
> Dec 19 07:17:19 colonialone kernel: [45888.504962] INFO: task md3_resync:3307
> blocked for more than 120
> seconds.
> Dec 19 07:17:19 colonialone kernel: [45888.510214] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> Dec 19 07:17:19 colonialone kernel: [45888.519215] md3_resync D
> ffff8805bef5be9c 0 3307 2
> Dec 19 07:17:19 colonialone kernel: [45888.527748] ffff8805bef5bce0
> 0000000000000246 0000000000000001
> ffff8805df67dee0
> Dec 19 07:17:19 colonialone kernel: [45888.535827] ffff8805e25a2480
> ffff8805e2704e80 ffff8805e25a2700
> 0000000300000000
> Dec 19 07:17:19 colonialone kernel: [45888.543885] ffff8805bef5bca0
> ffffffff8023f6b6 ffff8805df50a140
> ffffffff80221555
>
> (It is this, starting at 07:17 which made me realize my theory about the
> monolith backups was wrong.)
>
> Perhaps something else happens shortly after 7 AM each day? But what about
> the timing of the other
> failures?
>
> For planning to resynchronize two more drives into the arrays, here is the
> timing of things:
>
> Dec 18 06:52:45 colonialone mdadm[5026]: Rebuild20 event detected on md
> device /dev/md3
> hung tasks at 07:21
> rebooted at 9 AM due to overload
> hung tasks at 09:41
> Bernie takes down the VMs at 10:21
> rebooted at 11 AM
>
> Dec 18 13:34:39 colonialone mdadm[4980]: Rebuild20 event detected on md
> device /dev/md3
> Dec 18 16:15:38 colonialone mdadm[4980]: Rebuild40 event detected on md
> device /dev/md3
> hung tasks at 18:13
> rebooted at 18:38
>
> Dec 18 23:29:03 colonialone mdadm[4987]: Rebuild20 event detected on md
> device /dev/md3
> Dec 19 01:59:03 colonialone mdadm[4987]: Rebuild40 event detected on md
> device /dev/md3
> Dec 19 04:08:03 colonialone mdadm[4987]: Rebuild60 event detected on md
> device /dev/md3
> Dec 19 06:48:04 colonialone mdadm[4987]: Rebuild80 event detected on md
> device /dev/md3
> hung tasks at 07:17
> rebooted at 07:30
>
> Dec 19 07:36:26 colonialone mdadm[5098]: RebuildStarted event detected on md
> device /dev/md3
> Dec 19 08:14:26 colonialone mdadm[5098]: Rebuild20 event detected on md
> device /dev/md3
> Dec 19 08:55:26 colonialone mdadm[5098]: Rebuild40 event detected on md
> device /dev/md3
> Dec 19 09:38:26 colonialone mdadm[5098]: Rebuild60 event detected on md
> device /dev/md3
> Dec 19 10:27:26 colonialone mdadm[5098]: Rebuild80 event detected on md
> device /dev/md3
> Dec 19 11:19:24 colonialone mdadm[5098]: RebuildFinished event detected on md
> device /dev/md3
> rebooted at 11:22 as a countermeasure in case some corruption had accumulated
> (no theory about
> this) and also to prove that the RAID would be synchronized after the boot
> no more hung tasks since
>
> So it appears that if we accept 4 hours of downtime (no VMs running) we can
> add one disk to the RAID.
> Adding two disks would take longer, but probably not a lot longer. Trying to
> add disks without an
> explanation of why the overloads take place seems to be risky.
>
> Peter Olson
> FSF Senior Systems Administrator
- [Savannah-hackers-public] [gnu.org #649274] colonialone ailing, Peter Olson via RT, 2010/12/20
- [Savannah-hackers-public] Re: [gnu.org #649274] colonialone ailing,
Sylvain Beucler <=
- Message not available
- Message not available
- Re: [Savannah-hackers-public] Re: [gnu.org #649274] colonialone ailing, Sylvain Beucler, 2010/12/21
- Re: [Savannah-hackers-public] Re: [gnu.org #649274] colonialone ailing, Michael J. Flickinger, 2010/12/21
- Message not available
- Message not available
- Re: [Savannah-hackers-public] Re: [gnu.org #649274] colonialone ailing, Bernie Innocenti, 2010/12/21
- Re: [Savannah-hackers-public] Re: [gnu.org #649274] colonialone ailing, Sylvain Beucler, 2010/12/23
- Re: [Savannah-hackers-public] Re: [gnu.org #649274] colonialone ailing, Bernie Innocenti, 2010/12/24