Dear valued customers:

Here is a letter stating the outage on our Server Facility provided by Superb Internet.  Here is the explanation of the failure.  We are sorry for the inconvenience.

During the morning of Friday, December 11th, we lost utility power feeding our DCA2 data centre at approximately 7:30am EST due to a blown fuse in a feeding transformer (as per Dominion Power). Our three UPS systems functioned normally and their batteries carried the load as our two generators kicked in automatically. Throughout the whole process the generators continued running normally.

 

Our generator #1 and UPS unit #1 which services rooms C, D, E and H worked flawlessly throughout this process. Unfortunately there was a series of unforseen events in regards to our second generator and power system which caused a disruption to our hosting infrastructure. While generator #2 carried on normally and two out of three Automatic Transfer Switches that feed off generator #2 transferred to the generator power normally, ATS-3 did not transfer to generator power. We have since determined this was caused by a loose cable on the generator power feed into ATS-3 that has been re-tightened and tested yesterday. ATS-3 is the last ATS in the succession to turn on (the three ATS'es are timed to gradually increase the load on the generator with the aim of increased reliability. Two Liebert UPS'es feeding customer servers switch on first and second, and the less critical but still important load, A/C systems, go on third. As a result of this, PDU-2 in room G/H overheated and its breaker tripped around 8:20am causing some customer servers in G room (that are fed off PDU-2) to lose power. Only a small portion of customers were affected by this.

Furthermore, at 8:40am as utility power was restored, UPS #2 (Liebert 400kVA) went into bypass mode (an issue with it that Liebert (Emerson) techs had repaired around a year earlier, by essentially replacing all the internal parts - the "brain" of the UPS, that now 
still came back apparently) that caused a power surge impacting the advanced Liebert Static Transfer Switch that gives redundant power off both the primary UPS systems (APC & Liebert) to the core routers. As a result, the STS that is designed for the ultimate reliability, ensuring that even in case of one or both generators down, no utility power, and one of two UPS'es down, the core network layer at DCA2 remains operational, failed and both of its input power breakers were
reset. There is no reason why this should have happened, as there were no issues whatsoever with its feed from the APC UPS. There was a resulting ~40 minute network inaccessibility for DCA2.

All was corrected by around 9:20am. Subsequently fault was found within ATS-3 wiring that has been repaired. The PDU-2 (in room G/H) issue is still being investigated, as is the UPS #2 issue. STS is scheduled to be diagnosed next Friday evening at 10:00pm EST, as that issue is of the most grave concern. Prior to then alternate power circuits (not off PDU-5 that feeds off the STS) will be brought in for the DCA2 Core routers and DistA switches, and one of their two redundant power supplies will be switched over to those accordingly prior to the STS maintenance window, thus resulting in no impact for customers. Also, the UPS #2 (Liebert 400kVA) repeat automatic-entry-into-bypass-mode issue will be thoroughly examined next week by senior Liebert (Emerson) technicians.

Please rest assured that this series of events is being investigated and worked on as quickly as possible. Our goal is to determine all the root causes, and perform the necessary repairs to ensure that such can not re-occur, by the end of next week. The DCA2 data centre is designed with the maximum feasible redundancy in place, and these series of events have not been forseen in any worst case scenarios by any of the electrical engineers engaged in designing and upgrading, optimizing the systems. As such, a full scale testing and evaluation of all the systems is being launched next week.



Saturday, December 12, 2009

« Back

Powered by WHMCompleteSolution