[ILUG] Can someone explain a little about linux-raid ?
John P. Looney
valen at tuatha.org
Thu Dec 5 09:13:02 GMT 2002
I got badly bitten yesterday because of my lack of knowledge of Linux
RAID (and some assumptions that it would work like Solaris' Disksuite).
I'd a machine that I'd said rootfs mirroring up on, and it had been
working fine for months. I'd to take it over to the UK, and install it in
a rack.
Just before I went, I set it up to use the serial port as a console, and
messed up LILO somehow (I think I used an old lilo.conf). I booted off a
tomsrtbt disk, mounted /dev/hda5 (one half of the rootfs mirror), ran lilo
with the right config file, and the machine booted fine. I powered down,
and brought it over with me to the UK.
I brought the machine back up, changed IPs, did some other configuration.
I rebooted, scanning the boot output to make sure it was OK. One line that
made me go cold was 'Error reading /etc/mtab, I/O error'. To me, that
usually means FS corruption.
I brought the box down to single user mode, remounted / as ro, and fscked
it. I got *hundreds* of errors. Two hours left till the airplane left. I
can fix it, I thought. Did a second fsck, and everything was fine. I
rebooted, same problem...only the corruption was getting worse. I did this
once more, and suddenly the machine didn't reboot. I got 'LIL-' for a boot
prompt.
I rebooted with my handy tomsrtbt disk, and ran LILO. Because I couldn't
mount md0, I mounted hda5, and did and fsck of that. Loads of errors, all
fixable. Cool.
I rebooted without the floppy, and got massive corruption. This time
worse than before. fsck fixed it, but again I got 'LIL-'. Re-ran LILO from
tomsrtbt, rebooted, and this time the machine had a corrupted inittab. My
heart sank.
I rebooted from tomsrtbt, and noticed that /dev/hda5 was fine. /dev/hdd1
(the other half of the mirror) was screwed. So, I changed /etc/raidtab to
set hdd1 as a "failed-disk", did a "raidstop" on md0, and changed / to be
/dev/hda5 and rebooted. More filesystem corruption.
It took about two reboots before I copped on that raidstop wasn't
persistant across reboots, like it is on solaris. Because the partition
types were set to "RAID Autodetect", every boot it was making an md0, and
even when I wasn't mounting it, it was syncing the two halves of the
mirror. Worse yet, it didn't sync from "last mounted" to "other disk", it
was always picking the corrupted disk, and syncing that to hda5, which I
had mounted, read-write, as root.
Once I changed both disks partition types back to 83 (linux fs), and did
an fsck, there was no more corruption. Alas, it had deleted files like
/etc/sysconfig/network and many others. I am not impressed.
Could someone with a bit more RAID knowledge than I have tell me what the
"right way to do things" was ? I've a feeling it probably incorporates
"Wait till both halves of the RAID mirror sync before you reboot" or some
such...as you don't have to do this in Solaris, I didn't bother...and that
could be what caused the massive corruption.
John
More information about the ILUG
mailing list