[ILUG] Systems crashing on disk activity
Niall O Broin
niall at linux.ie
Sun Apr 18 14:34:50 IST 2004
I have a nasty problem with which some of you are already familiar but
I'm throwing it open to the wider community to see if anyone has any
ideas.
I admin a couple of servers which are hosted by Rackspace. Recently, we
have upgraded those boxes. The new hardware has an AMD Athlon XP 2600+
with a VIA chipset, 1GB of RAM and 2x36 GB SCSI disks RAID-1 on a
Megaraid controller.
We migrated one important client to one of these servers and the bloody
box crashed with a scsi timeout error on the console. It was rebooted
and worked away until it crashed again with the same symptoms - we could
ping it, but it wasn't serving pages, and we couldn't ssh to it.
At that point, I asked that Rackspace replace it with new hardware which
they did. My mother having reared no idiots, I proceeded to beat on this
box's disks and it seemed fine. A few hours later, I decided to run an
overnight test which lasted about 5 minutes before it died again. You
can guess how thrilled I was.
On Friday I had telephone conference with our Rackspace account manager
and a senior Rackspace technician. They were very concerned because they
really found one hardware failure unlikely, and couldn't conceive of
there being two in a row. However, the problem was clearly hardware
related, and not a product of anything I was doing. So, they agreed to
deploy a THIRD server and I would test it and the second box over the
weekend.
As it happened, all three boxes (serv31, serv32, serv33) were still
online (Rackspace has a LOT of hardware - years ago we migrated a server
and the old one was still online months later, lost and forgotten in a
rack somewhere) so I decided to run a little test on all three. The test
was this script
#!/bin/sh
while true
do
rsync -a web /
sleep 60
rm -fr /web
date >> /root/hammer.count
done
web being a directory with about 2GB of data in it.
This test is somewhat more disk I/O than the box would normally have but
nonetheless a solid combination of hardware, kernel and drivers should
keep running that script on RAID-1 disk until both disk drives died of
old age.
However, serv31 and serv32 died after a short time (don't know how long
as I deliberately did NOT have them rebooted by the ops staff until the
senior tech. people at Rackspace could take a look). As I went to bed
last night, serv33 appeared OK, having carried out 20 iterations of the
test (which takes about 10 minutes to complete).
First thing this morning when I got up, I tried to ssh to serv33. It was
dead :-( I opened a ticket with Rackspace to have it rebooted and found
that it died sometime after completing 39 iterations of the test.
So, tomorrow I'll be having a rather fraught (I imagine) telecon with
the people from Rackspace and I'm wondering what to say to them.
It would seem that the chances of getting 3 servers deployed, all of
which has a similar hardware fault, is very small (of course, I could be
after stumbling on a bad mother board batch - I'm assuming that these
boxes have mobo integrated RAID controllers). That leaves a kernel
problem. The kernel is 2.4.21-9.0.1.EL which hopefully means that we'll
be able to utilise Red Hat's support to help investigate the kernel if
that becomes necessary.
Do any of you have any ideas about this, or have encountered anything
remotely similar?
Niall
More information about the ILUG
mailing list