[ILUG] Production email server, corrupt ext3 fs - need advice
John Molohan
john.molohan at gcd.ie
Tue May 30 16:11:36 IST 2006
Declan Moriarty wrote:
> On Fri, 2006-05-19 at 16:25 +0100, John Molohan wrote:
>
>> a little background first.
>>
>
>
>> Were thinking of the following approach.
>>
>> 1. Take the box down.
>> 2. In the scsi host util run verify media on all disks to identify &
>> mark any bad sectors and make them unavailable.
>> 3. Reboot & remount /var ro
>> 4. Rsync a new backup.
>> 5. Run smartctl see if it identifies any issues.
>> 6. Format /var?
>> 7. Recreate /var from backup.
>>
>> Any suggestions/additions, other approaches?
>>
>
> Don't let us see you posted this to so many mailing lists ;-)
>
>
> Is S.M.A.R.T. enabled? It should work on all disks and prevent the
> gradual corruption you seem afraid of.
>
>
>> Some questions:
>> 1. Do you think we could continue to trust these disks or should we just
>> forget it and replace them?
>>
>
> Unless they are pretty new, replace them if you're a serious outfit
> (e.g. a business). I don't know what the site is. If you are students,
> or software heads gifted a server by some company then sure, look again
> at them.
>
> If the disk is damaged, often there are damaged sectors near the ones
> that actually don't read. Near on the same track, or on nearby tracks.
> So you fix, and more go down tomorrow.
>
>
>> 2. Does anyone have any hints from the admittedly little information as to
>> whether this might be just filesystem corruption or dead disks?
>>
>
> The obvious problems you haven't thought of are variations in
> temperature in the server room, and dirt under the heat sinks. People
> put fans on top of CPUs and think they will remain cold. They build up
> dust often between the fins of the heat sink, and then heat problems
> start.
>
>
>> 3. There were a lot of servers in the server room which all experienced
>> this slow cooking but none have shown any obvious problems so far.
>> Should we be doing something as a precaution for them?
>>
>
> See the answer to 2. Lift the fans & check.
>
>
>> 4. Is it safe to assume that this failure is probably a direct result of
>> the heat?
>>
>>
> It is likely. But it doesn't matter - it's a failure.
>
>
>
Just a quick update. It seems that the root of our problems may actually
be a buggy aacraid driver. We switched over onto a backup server last
week only to experience the exact same error, which was nice. It was
also a Dell 2650 with the same controller so it made sense. Anyway it
seems that if you have an Adaptec Perc 3Di and are using the 1.1.2
driver you could trigger this bug with heavy disk I/O. We've upgraded to
1.1.5 and have been testing since the weekend without a repeat. I'll
give an update when we know for certain.
More information about the ILUG
mailing list