[Savannah-hackers-public] Re: [gnu.org #498996] Hard-disk failures on co

savannah-hackers-public

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Savannah-hackers-public] Re: [gnu.org #498996] Hard-disk failures on co

From:	Sylvain Beucler
Subject:	[Savannah-hackers-public] Re: [gnu.org #498996] Hard-disk failures on colonialone
Date:	Sat, 31 Oct 2009 11:13:51 +0100
User-agent:	Mutt/1.5.20 (2009-06-14)

> On Thu, Oct 29, 2009 at 01:20:55PM -0400, Daniel Clark via RT wrote:
> > Ah I see, I was waiting for comments on this - should be able to go out 
> > this weekend to do 
> > replacements / reshuffles / etc, but I need to know if savannah-hackers has 
> > a strong 
> > opinion on how to proceed:
> > 
> > (1) Do we keep the 1TB disks?
> > > - Now that the cause of the failure is known to be a software failure,
> > > do we forget about this, or still pursue the plan to remove 1.0TB
> > > disks that are used nowhere else at the FSF?
> > 
> > That was mostly a "this makes no sense, but that's the only thing that's 
> > different about 
> > that system" type of response; it is true they are not used elsewhere, but 
> > if they are 
> > actually working fine I am fine with doing whatever savannah-hackers wants 
> > to do.
> > 
> > (2) Do we keep the 2 eSATA drives connected?
> > > - If not, do you recommend moving everything (but '/') on the 1.5TB
> > > disks?
> > 
> > Again if they are working fine it's your call; however the bigger issue is 
> > if you want to 
> > keep the 2 eSATA / external drives connected, since that is a legitimate 
> > extra point of 
> > failure, and there are some cases where errors in the external enclosure 
> > can bring a system 
> > down (although it's been up and running fine for several months now).
> > 
> > (3) Do we make the switch to UUIDs now?
> > > - About UUIDs, everything in fstab in using mdX, which I'd rather not
> > > mess with.
> > 
> > IMHO it would be better to mess with this when the system is less critical; 
> > not using UUIDs 
> > everywhere tends to screw you during recovery from hardware failures.
> > 
> > And BTW totally off-topic, but eth1 on colonialone is now connected via 
> > crossover ethernet 
> > cable to eth1 on savannah (and colonialone is no longer on fsf 10. 
> > management network, 
> > which I believe we confirmed no one cared about)
> > 
> > (4) We need to change to some technique that will give us RAID1 redundancy 
> > even if one 
> > drives dies. I think the safest solution would be to not use eSATA, and use 
> > 4 1.5TB drives 
> > all inside the computer in a 1.5TB quad RAID1 array, so all 4 drives would 
> > need to fail to 
> > bring savannah down. Other option would be 2 triple RAID1s using eSATA, 
> > each with 2 disks 
> > inside the computer and the 3rd disks in the external enclosure.

On Thu, Oct 29, 2009 at 07:29:50PM +0100, Sylvain Beucler wrote:
> Hi,
> 
> As far as the hardware is concerned, I think it is best that we do
> what the FSF sysadmins think is best.
> 
> We don't have access to the computer, don't really know anything about
> what it's made of, don't understand the eSATA/internal
> differences. We're even using Xen as you do, to ease this kind of
> interaction. In short, you're more often than not in better position
> to judge the hardware issues.
> 
> 
> So:
> 
> If you think it's safer to use 4x1.5TB RAID-1, then let's do that.
> 
> Only, we need to discuss how to migrate the current data, since
> colonialone is already in production.
> 
> In particular, fixing the DNS issues I reported would help if
> temporary relocation is needed.


I see that there are currently 4x 1.5TB disks.


sda 1TB   inside
sdb 1TB   inside
sdc 1.5TB inside?
sdd 1.5TB inside?
sde 1.5TB external/eSATA?
sdf 1.5TB external/eSATA?


Here's what I started doing:

- recreate 4 partitions on sdc and sde (but 2 of them in an extended
  partition)

- added sdc and sdd to the current RAID-1 arrays

  mdadm /dev/md0 --add /dev/sdc1
  mdadm /dev/md0 --add /dev/sdd1
  mdadm /dev/md1 --add /dev/sdc2
  mdadm /dev/md1 --add /dev/sdd2
  mdadm /dev/md2 --add /dev/sdc5
  mdadm /dev/md2 --add /dev/sdd5
  mdadm /dev/md3 --add /dev/sdc6
  mdadm /dev/md3 --add /dev/sdd6
  mdadm /dev/md0 --grow -n 4
  mdadm /dev/md1 --grow -n 4
  mdadm /dev/md2 --grow -n 4
  mdadm /dev/md3 --grow -n 4

colonialone:~# cat /proc/mdstat 
Personalities : [raid1] 
md3 : active raid1 sdd6[4] sdc6[5] sdb4[1] sda4[0]
      955128384 blocks [4/2] [UU__]
      [>....................]  recovery =  0.0% (43520/955128384) 
finish=730.1min speed=21760K/sec
      
md2 : active raid1 sdc5[2] sdd5[3] sdb3[1] sda3[0]
      19534976 blocks [4/4] [UUUU]
      
md1 : active raid1 sdd2[2] sdc2[3] sda2[0] sdb2[1]
      2000000 blocks [4/4] [UUUU]
      
md0 : active raid1 sdd1[2] sdc1[3] sda1[0] sdb1[1]
      96256 blocks [4/4] [UUUU]

- install GRUB on sdc and sdd


With this setup, the data is both on the 1TB and the 1.5TB disks.

If you confirm that this is OK, we can:

* extend this to sde and sdf,

* unplug sda+sdb and plug all the 1.5TB disks internaly

* reboot while you are at the colo, and ensure that there's no device
  renaming mess

* add the #7 partitions in sdc/d/e/f as a new RAID device / LVM
  Physical Volume and get the remaining 500GB


Can you let me know if this sounds reasonable?

-- 
Sylvain

[Prev in Thread]

Current Thread

[Next in Thread]

[Savannah-hackers-public] colonialone: 2nd disk of /dev/md2 dead + filesystem errors on newly created LV, Sylvain Beucler, 2009/10/21
- Message not available
  - [Savannah-hackers-public] Re: [gnu.org #494104] colonialone: 2nd disk of /dev/md2 dead + filesystem errors on newly created LV, Sylvain Beucler, 2009/10/21
- Message not available
  - [Savannah-hackers-public] Re: [gnu.org #494104] colonialone: 2nd disk of /dev/md2 dead + filesystem errors on newly created LV, Sylvain Beucler, 2009/10/27
- Message not available
  - [Savannah-hackers-public] Re: [gnu.org #498996] Hard-disk failures on colonialone, Sylvain Beucler, 2009/10/29
    - [Savannah-hackers-public] Re: [gnu.org #498996] Hard-disk failures on colonialone, Sylvain Beucler <=

Prev by Date: [Savannah-hackers-public] Re: [gnu.org #498996] Hard-disk failures on colonialone
Next by Date: [Savannah-hackers-public] Savannah backup strategy
Previous by thread: [Savannah-hackers-public] Re: [gnu.org #498996] Hard-disk failures on colonialone
Next by thread: [Savannah-hackers-public] Re: git.sv.gnu.org has no git server?!?
Index(es):
- Date
- Thread