[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Savannah-hackers-public] Re: [gnu.org #498996] Hard-disk failures on co
From: |
Sylvain Beucler |
Subject: |
[Savannah-hackers-public] Re: [gnu.org #498996] Hard-disk failures on colonialone |
Date: |
Sat, 31 Oct 2009 11:13:51 +0100 |
User-agent: |
Mutt/1.5.20 (2009-06-14) |
> On Thu, Oct 29, 2009 at 01:20:55PM -0400, Daniel Clark via RT wrote:
> > Ah I see, I was waiting for comments on this - should be able to go out
> > this weekend to do
> > replacements / reshuffles / etc, but I need to know if savannah-hackers has
> > a strong
> > opinion on how to proceed:
> >
> > (1) Do we keep the 1TB disks?
> > > - Now that the cause of the failure is known to be a software failure,
> > > do we forget about this, or still pursue the plan to remove 1.0TB
> > > disks that are used nowhere else at the FSF?
> >
> > That was mostly a "this makes no sense, but that's the only thing that's
> > different about
> > that system" type of response; it is true they are not used elsewhere, but
> > if they are
> > actually working fine I am fine with doing whatever savannah-hackers wants
> > to do.
> >
> > (2) Do we keep the 2 eSATA drives connected?
> > > - If not, do you recommend moving everything (but '/') on the 1.5TB
> > > disks?
> >
> > Again if they are working fine it's your call; however the bigger issue is
> > if you want to
> > keep the 2 eSATA / external drives connected, since that is a legitimate
> > extra point of
> > failure, and there are some cases where errors in the external enclosure
> > can bring a system
> > down (although it's been up and running fine for several months now).
> >
> > (3) Do we make the switch to UUIDs now?
> > > - About UUIDs, everything in fstab in using mdX, which I'd rather not
> > > mess with.
> >
> > IMHO it would be better to mess with this when the system is less critical;
> > not using UUIDs
> > everywhere tends to screw you during recovery from hardware failures.
> >
> > And BTW totally off-topic, but eth1 on colonialone is now connected via
> > crossover ethernet
> > cable to eth1 on savannah (and colonialone is no longer on fsf 10.
> > management network,
> > which I believe we confirmed no one cared about)
> >
> > (4) We need to change to some technique that will give us RAID1 redundancy
> > even if one
> > drives dies. I think the safest solution would be to not use eSATA, and use
> > 4 1.5TB drives
> > all inside the computer in a 1.5TB quad RAID1 array, so all 4 drives would
> > need to fail to
> > bring savannah down. Other option would be 2 triple RAID1s using eSATA,
> > each with 2 disks
> > inside the computer and the 3rd disks in the external enclosure.
On Thu, Oct 29, 2009 at 07:29:50PM +0100, Sylvain Beucler wrote:
> Hi,
>
> As far as the hardware is concerned, I think it is best that we do
> what the FSF sysadmins think is best.
>
> We don't have access to the computer, don't really know anything about
> what it's made of, don't understand the eSATA/internal
> differences. We're even using Xen as you do, to ease this kind of
> interaction. In short, you're more often than not in better position
> to judge the hardware issues.
>
>
> So:
>
> If you think it's safer to use 4x1.5TB RAID-1, then let's do that.
>
> Only, we need to discuss how to migrate the current data, since
> colonialone is already in production.
>
> In particular, fixing the DNS issues I reported would help if
> temporary relocation is needed.
I see that there are currently 4x 1.5TB disks.
sda 1TB inside
sdb 1TB inside
sdc 1.5TB inside?
sdd 1.5TB inside?
sde 1.5TB external/eSATA?
sdf 1.5TB external/eSATA?
Here's what I started doing:
- recreate 4 partitions on sdc and sde (but 2 of them in an extended
partition)
- added sdc and sdd to the current RAID-1 arrays
mdadm /dev/md0 --add /dev/sdc1
mdadm /dev/md0 --add /dev/sdd1
mdadm /dev/md1 --add /dev/sdc2
mdadm /dev/md1 --add /dev/sdd2
mdadm /dev/md2 --add /dev/sdc5
mdadm /dev/md2 --add /dev/sdd5
mdadm /dev/md3 --add /dev/sdc6
mdadm /dev/md3 --add /dev/sdd6
mdadm /dev/md0 --grow -n 4
mdadm /dev/md1 --grow -n 4
mdadm /dev/md2 --grow -n 4
mdadm /dev/md3 --grow -n 4
colonialone:~# cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 sdd6[4] sdc6[5] sdb4[1] sda4[0]
955128384 blocks [4/2] [UU__]
[>....................] recovery = 0.0% (43520/955128384)
finish=730.1min speed=21760K/sec
md2 : active raid1 sdc5[2] sdd5[3] sdb3[1] sda3[0]
19534976 blocks [4/4] [UUUU]
md1 : active raid1 sdd2[2] sdc2[3] sda2[0] sdb2[1]
2000000 blocks [4/4] [UUUU]
md0 : active raid1 sdd1[2] sdc1[3] sda1[0] sdb1[1]
96256 blocks [4/4] [UUUU]
- install GRUB on sdc and sdd
With this setup, the data is both on the 1TB and the 1.5TB disks.
If you confirm that this is OK, we can:
* extend this to sde and sdf,
* unplug sda+sdb and plug all the 1.5TB disks internaly
* reboot while you are at the colo, and ensure that there's no device
renaming mess
* add the #7 partitions in sdc/d/e/f as a new RAID device / LVM
Physical Volume and get the remaining 500GB
Can you let me know if this sounds reasonable?
--
Sylvain