[ILUG] Production email server, corrupt ext3 fs - need advice
junk_mail at iol.ie
Fri May 19 17:21:05 IST 2006
On Fri, 2006-05-19 at 16:25 +0100, John Molohan wrote:
> a little background first.
> Were thinking of the following approach.
> 1. Take the box down.
> 2. In the scsi host util run verify media on all disks to identify &
> mark any bad sectors and make them unavailable.
> 3. Reboot & remount /var ro
> 4. Rsync a new backup.
> 5. Run smartctl see if it identifies any issues.
> 6. Format /var?
> 7. Recreate /var from backup.
> Any suggestions/additions, other approaches?
Don't let us see you posted this to so many mailing lists ;-)
Is S.M.A.R.T. enabled? It should work on all disks and prevent the
gradual corruption you seem afraid of.
> Some questions:
> 1. Do you think we could continue to trust these disks or should we just
> forget it and replace them?
Unless they are pretty new, replace them if you're a serious outfit
(e.g. a business). I don't know what the site is. If you are students,
or software heads gifted a server by some company then sure, look again
If the disk is damaged, often there are damaged sectors near the ones
that actually don't read. Near on the same track, or on nearby tracks.
So you fix, and more go down tomorrow.
> 2. Does anyone have any hints from the admittedly little information as to
> whether this might be just filesystem corruption or dead disks?
The obvious problems you haven't thought of are variations in
temperature in the server room, and dirt under the heat sinks. People
put fans on top of CPUs and think they will remain cold. They build up
dust often between the fins of the heat sink, and then heat problems
> 3. There were a lot of servers in the server room which all experienced
> this slow cooking but none have shown any obvious problems so far.
> Should we be doing something as a precaution for them?
See the answer to 2. Lift the fans & check.
> 4. Is it safe to assume that this failure is probably a direct result of
> the heat?
It is likely. But it doesn't matter - it's a failure.
With Best Regards,
More information about the ILUG