[ILUG] Power outage in Esat's web farm

Niall O Broin niall at magicgoeshere.com
Tue Aug 29 21:57:33 IST 2000


John McWade doesn't know me, nor I him, and I've nothing against him
personally, but I felt I had to reply to this as I'm afraid it has a lot of
holes from an engineering point of view. I had already requested a response
from Esat which I await with interest.

On Tue, Aug 29, 2000 at 04:28:18PM +0100, John McWade wrote:

> This basically is what happened to the power at the weekend....Thunder &
> Lightning had resulted in short outages in ESB Power Supply. The Standby
> Generator kicked in and provided power to all units after the designated
> time (5 secs).At the time of these outages, the UPS supporting the Web Farm
> was in Alarm mode, which means that the Battery Unit was bypassed and
> therefore the Web Farm Units were in effect being fed by raw ESB
> Mains.

So there was NO UPS power and nobody noticed ?

> Following an inspection yesterday by the suppliers of the UPS, it was
> found that since the UPS was installed, larger (32A as opposed to the
> 16A)servers had been located in the server room. 

I don't understand this statement. What is a 32A server, or for that matter
a 16A server ? The server I'm responsible for in the web farm is connected
to a standard 13A socket, as I imagine are most of the boxes there. If there
any which need more unusual power arrangements, Esat must surely know about
them, as they organised/approved/provisioned for their installation. And in
any event, the power requirements of any one server should have no effect on
the aggregate situation.

> This shouldn't in itself cause problems as the UPS is sized, with diversity,
> to cope with uneven loads.

What does "sized, with diversity, to cope with uneven loads" mean ? What,
for that matter, is an uneven load ? If that's to be taken to mean a load
with varying power requirements, then the average server is hardly such a
thing (ignoring startup loads, which can be considerably more than normal
due to disk drive motor start currents). And besides, in a properly run colo
facility, the UPS should never encounter a lot of simultaneous start loads,
and I wouldn't expect it to be sized to handle the total start load.

A UPS is sized in VA (volt amps) which for the purposes of this discussion
the electrically challenged can think of as watts. A small domestic UPS for
1 PC might be 500VA, medium sized commercial use might be several KVA, and
organisations such as Esat for web farms etc. will have multi KVA systems.
The systems will be designed to provide the rated VA for a certain period of
time, and will provide less for a longer time, but they won't usually
provide a whole lot more than their rating (this leads to overload, or the
alarm mode mentioned).

It would seem to me that it's not terribly difficult to size a UPS for a web
farm, and as the farm is filled, it's trivial to check on your sizing - just
measure the current load. As the technical representative for a customer of
Esat's web farm, I'd like Esat to tell me what the rated capacity of the web
farm's UPS is, and what is the current total load on that UPS.

> However, it seems that occasionally, when too many of the servers run
> together, including the larger ones, the UPS goes into overload or alarm
> mode. 

WTF does that mean ? This is a web farm - one takes it as a given that all
of the servers run together, all of the time. Perhaps John meant when too
many of the servers start together ? (I'll be back to this)

> This means that the UPS is supplying power but is not backed up by the
> batteries.

In which situation, the UPS is NOT supplying power - the ESB is supplying
power, and the current is passing through some relay / thyristor / whatever
in the UPS.


> When the power surge, caused by the servers rebooting simultaneously, is
> over the UPS can be reset and therefore run on the battery again.

But why did the servers reboot simultaneously ? This can only happen in a
colo facility when power to said servers fails simultaneously (barring a
vanishingly unlikely conspiracy by the customers) in which case the UPS (or
more correctly, the backup power system as a whole) has already failed.

> It is likely that either the storms 

That, if true, is just NOT good enough. A backup power system of this size
must have adequate protection against lightning induced surges (says I, who
hadn't, and apparently lost a modem and a telephone/digital answering
machine in my house to the same storms)

> or the rebooting of the larger servers caused the UPS to overload and
> therefore bypass the batteries.

The rebooting of the larger servers caused the UPS to overload ? ? ? This
has got to be on a par with fighting for peace, or making love (I was nearly
rude) for virginity. The servers wouldn't have been rebooting unless the UPS
(or again more correctly, the backup power system as a whole) had already
failed.

> Because of the Generator, the servers would have lost power for
> just 5 seconds.

Which for most servers is 5 seconds too long (though it's an impressive
cut-in time for a diesel generator). And many servers (all based on PC ATX
mainboards, including the one I'm responsible for, many Sun boxes, and I'm
sure many others) will not restart after a power outage. This is by design.
One good reason is that very often an unintentional power outage will not
simply be an off - delay - on situation but there may be transient failures
for a while. It's not at all good for the equipment to have it attempt to
start and then stop again several times. In our case, the server was down
for a day (it was the weekend) before someone at my customers noticed and
called the web farm, where somebody went and pressed the on switch on the
box.

> We are looking into a scenario whereby the alarm condition of the UPS is
> relayed back to our NMC, thus giving us the chance to reset the UPS before
> any further losses occur.

You're "looking into a scenario" ? This has got to be the most inept part of
the whole sorry story. Here we have an absolutely critical failure mode
(needing of course the further failure of the mains power supply to actually
trigger a disaster) and evidently the only indication is a front panel LED
and/or a beeper or some such, none of which are observable from the NMC.
What idiot thought that up ?

I'd be delighted if my customer decided to sue Esat for loss of earnings
over this, purely because I'd find it highly amusing to see someone stand up
in a court and try to defend these design decisions. I don't think the
plaintiff would even need to call an expert witness, as the defendant's
ineptness would be clear to anybody.

A little caveat - I am an electrical engineer by profession, but I was
seduced by the soft side of the force, so I'm rusty-ish on hardware and
power systems (although I did have to do a little bit of work on the latter
in the space agency, where reliable UPS is of course critical). Please feel
free to correspond with me off list (or flame me on list, if you prefer) if
you disagree technically with anything I have said here.

And my apologies again to John McWade. Please be assured that I'm shooting
the message and NOT the messenger.

And to those who think this is OT - you're probably right, although this did
affect several people here, and a lot of the boxes concerned were running
Linux. And I'm mad as hell because the fsckers demolished my uptime of 285 +
days. This box needs some more disks, and I've been shuffling things for a
while to try to put off the upgrade until I had uptime of a year, which is
something I've never had before in all my years at this games. So, as the
vulture says on the T shirt - "Patience my arse - I'm going to kill
something !"




Kindest regards (though not for Esat (with the exception of Dave Rynne :-) )),




Niall  O Broin




More information about the ILUG mailing list