[ILUG] Why RAID

Rick Moen rick at linuxmafia.com
Tue Jul 13 16:45:39 IST 2004


Quoting Timothy Murphy (tim at birdsnest.maths.tcd.ie):

> Personally I just rsync to an ancient (PII) machine with a large disk.

That's backup, not redundancy.  More on that below.

> The chances of a total disk failure  are negligible in my experience
> (especially with SCSI disks).
> I'd actually be more worried about 2 disks on the same machine
> being struck by lightning, pee-ed on by the cat, etc.

See, people are getting confused between three different but similar 
concepts:

o  redundancy
o  backup
o  archival storage

These protect against different threat models.  Hasn't ILUG had this
discussion before?

Here's a relevant anecdote, cross-posted from a similar discussion
elsewhere:

 From rick Tue Jul  6 12:43:27 2004
 Date: Tue, 6 Jul 2004 12:43:27 -0700
 To: luv-main at luv.asn.au
 Subject: Re: A small workgroup server
 X-Mas: Bah humbug.
 User-Agent: Mutt/1.5.5.1+cvs20040105i

[Skipping most of this discussion.  There are too many points that would
need to be covered, to do it right.]

Quoting Russell Coker (russell at coker.com.au):

> Anyone who doesn't want a single disk failure to lose all their data needs 
> RAID or a good backup.

Er...

Here's a story for you:  I was sent out to a network-consulting client,
an architecture firm in San Francisco.  Customer had delayed acting on
my urgent advice about installing proper ventilation into an area that
was being converted into a network closet, and relied on a sign on the
door saying to never close it.

On a Friday, someone closed that door, shutting off all ventilation for
the impromptu server room.  Monday afternoon, customer realised he had a 
degraded RAID1 pair, and called in the firm I was working for, to deal
with it.

First thing upon assessing server condition, I checked on the condition
of Friday backup tapes (seemed OK), and then fetched my own personal
spare hard drive to remirror the remaining drive onto it, for safety's
sake.  As I feared would happen, the customer drive failed completely
during the remirror operation.  I was obliged to do a fresh OS install
and restore the Friday backup, as the next best thing.  Customer CEO was
extremely upset about losing an entire day's work, and complained to my
firm.  I replied that he was damned lucky to lose only that much, and
should thank me for ensuring that he had a well-tested backup regimen as
a safety-net protecting the firm against fallout from bad management
IT decisions.

Moral:  The threat model that takes out one drive of your RAID set might
very well take out the other drives, at the same time:  heat buildup,
power spikes, catastrophically failing disk controllers, PDUs committing
seppuku, fire/smoke damage, etc.  Therefore, _never ever_ rely only on 
RAID to protect data sets.

> Which brings me to one of the biggest problems with SCSI, almost no-one 
> terminates it properly!

Well, I've seldom heard of such an easily fixable problem.  Yes, dimwit
VAR hardware installers are very common, doubly so among white-box
vendors.  But it's really pretty easy to study how to do termination
right, and then check their work.

For the same reason, when I have to work with ethernet cabling
contractors, I quality-check every single run, and make them do again
the ones that they stuff up, as many times as required until they get it
right.

The ex-telco guys are the worst, because they're convinced they learned
everything worth knowing, thirty years ago.

-- 
Cheers,      "Transported to a surreal landscape, a young girl kills the first
Rick Moen     woman she meets, and then teams up with three complete strangers
rick at linuxmafia.com       to kill again."  -- Rick Polito's That TV Guy column,
              describing the movie _The Wizard of Oz_



More information about the ILUG mailing list