July 24th 2007 Power Incident
San Francisco Data Center

As I reflect on the last week, I’d like to begin with an extension of our sincere apologies to our San Francisco customers who were impacted by the power incident on July 24th 2007.  Because we strive each day to deliver our customers the world’s finest data centers, we are taking this event very seriously.

Addressing customer concerns is our top priority. In the days since the incident occurred, we identified and corrected the root source of the problem and are taking steps to prevent this type of problem from happening again. We are also making our comprehensive findings available to other data centers to try to prevent the same problem from recurring elsewhere.

The Q&A below should help answer questions you may have surrounding the incident.  If you would like to schedule time with me or anyone on my executive team in the coming days and weeks, please do not hesitate to contact me directly.

Sincerely,


Christopher M. Dolan
President and CEO
365 Main Inc.

FINAL INCIDENT FAQ
Published August 1, 2007

1.    What happened at 365 Main’s San Francisco data center on July 24th?
  • At 1:47 p.m. on Tuesday, July 24, 365 Main’s San Francisco data center was impacted by a power surge caused when transformer breakers at a local PG&E power station unexpectedly opened.  PG&E has still not determined what caused the breakers to open.
  • Following the surge, three of 365 Main’s ten back-up generators, manufactured by Hitec, failed to complete their start sequence causing over 40% of 365 Main’s San Francisco customers to lose power to their equipment for as much as 45 minutes. The rigorously maintained and tested back-up generators within 365 Main’s San Francisco facility have successfully handled multiple power surges without incident during the facility’s five years of operations. A complete investigation of the incident began immediately.
2.    Why did back-up power fail? Why didn’t the uninterruptible power supply keep everything up and running during the outage?
  • An international team of specialists was deployed to the San Francisco site within hours of the incident to join on-site technicians and begin systematically testing the generators in search of a root cause.   After four days of thorough testing around the clock, the team discovered weakness in a small but essential component in the back-up system known as a DDEC (Detroit Diesel Electronic Controller). 
  • The team discovered a setting in the DDEC that was not allowing the component to correctly reset its memory.  Erroneous data left in the DDEC’s memory subsequently caused misfiring or engine start failures on the next diesel engine call to start.
  • The investigation team discovered DDEC issues on each of the failed Hitec units and were able to successfully simulate failure. A fix was introduced by altering the timing of a command to the DDEC component, allowing more time between the engine shut-down command and the DDEC reset command. Once this fix was introduced, the Hitec generators successfully passed more than 50 consecutive start-up sequence tests without incident.  The fix was immediately applied to all 10 Hitec units.
  • This testing methodology was performed by Hitec specialists along with 365 Main’s Chief Engineer and staff. Specialists from Cupertino Electric were present during all testing and EYP Mission Critical Facilities provided independent verification of the findings. Click here for the EYP Independent Verification Report.
  • During the testing process, 365 Main published daily updates directly from the investigation team meeting minutes, allowing customers and the public at large to track progress.  A complete archive of these updates is kept here:  http://www.365main.com/status_update.html
3.    Are other data centers currently at risk?
  • Yes, however not all Hitec customers are at risk; only customers with units containing specific versions of the DDEC. Hitec is contacting customers and arrange for adjustments to their equipment.
  • In addition to implementing the fix in San Francisco, 365 Main has already implemented the DDEC fix in its El Segundo facility.  El Segundo is the only other 365 Main facility with Hitec generators containing DDECs.  All other facilities feature other brands of generators or have different models of Hitecs.
  • 365 Main is sharing the discoveries of our investigation with other Hitec customers and we are also making these findings publicly available today.  In addition, Hitec has already expanded its preventative maintenance procedures as a direct result of discoveries made during the 365 Main investigation.  
4.    How can you keep the brand promise of “world’s finest data centers” after such an incident?
  • Maximum uptime and high-reliability are essential to our customers. While no data center operator can guarantee 100% uptime, 365 Main continually strives to deliver a premium data center experience. 
  • Since its inception over five years ago, 365 Main has delivered 99.9967% uptime across its five-data-center portfolio.  This includes the 45-minute outage experienced in San Francisco last week.  The San Francisco facility alone is 99.9942%, inclusive of last week’s outage.
  • In addition, we hope that the level of transparency we’ve provided during this difficult experience sets a standard of communications in our industry.  We know our customers appreciate swift, factual and candid communications, which we’ve delivered throughout this investigation.  Unfortunately we are not the first data center to experience an outage, and we are certainly not the last.
5.    What type of service level agreements (SLAs) does 365 Main have with its customers?
  • As part of their Service Level Agreements with 365 Main, 365 Main customers receive rent abatements (refunds) in the event that electrical power is dropped in the section(s) of the data center where their servers are located. 365 Main is honoring all Service Level Agreements with affected customers. 
6.    What is the back-up design in 365 Main’s San Francisco data center?
  • The San Francisco facility has ten 2.1 megawatt (MW) back-up generators to be used in the event of a loss of utility. The electrical design is N+2, meaning eight primary generators can successfully power the building (numbered 1-8), with two generators available on a stand-by, parallel bus (Back-up 1 and Back-up 2) in case there are any failures with the primary eight. The paralleling of the back-up units is meant to carry the load in the event of two simultaneous Hitec failures and to allow for a minimum N+1 redundancy during any maintenance procedures.
  • Each primary generator backs-up a corresponding colocation room, with generator 1 backing up colocation room 1, generator 2 backing up colocation room 2, and so on.
7.    What was the exact sequence of electrical events that took place following the power surge on July 24?
  • When the initial surge was detected at 1:47 p.m., the building’s electrical system attempted to roll all colocation rooms to back-up diesel generator power.
  • Generator 1 detected a problem in its start sequence and shut itself down within 8-10 seconds.  Generator 1 was started manually by on-site engineers and reestablished stable diesel power by 2:24 p.m.
  • After initial failure, Generator 1 passed its 732 kilowatt (kW) load to the parallel back-up bus. 
  • Generator 3 started up and ran for 30 seconds before it detected a problem in the start sequence and passed its 780 kW load to the parallel back-up bus.
  • Generator 4 started up and ran for 2 seconds before detecting a problem in the start sequence and passed its 900 kW load on to the parallel back-up bus.   Generator 4 was manually started and brought back into operations at 2:22 p.m.  
  • Back-up 1 detected a problem in its start sequence.
  • Back-up 2 failed on an over load condition (the unit absorbed over 2.4 MW, ultimately overloading its 2.1 MW capacity, causing it to fail.)
  • Generators 2, 5, 6, 7 and 8 all operated as designed and carried their respective loads appropriately.  Back-up 2 started as designed but eventually failed when overloaded.
8.    Why did 365 Main choose Hitec’s Continuous Power Systems over traditional generators used in combination with batteries?
  • Hitec promises greater than or equal to reliable power in a smaller footprint than a traditional battery + generator combinations.  365 Main is aware there were battery-based back-up systems in other San Francisco data centers that handled the 7/24 surge without incident, but we also know that our Hitecs have flawlessly performed during previous power events when battery-based systems failed.  Unfortunately there is no guaranteed back-up, both options still carry risk.
  • While we are disappointed with the discovery of the DDEC issue in our Hitecs in San Francisco and El Segundo, we have been overall pleased with Hitec’s reliability.  We will continue to work closely at all levels of their organization so we are best prepared to meet the expectations of our customers. 
9.    Were the Hitec units regularly maintained?
  • Yes.  Complete maintenance records are kept for each generator and these are available for customer review.


UPDATE: 4:30 P.M., Tuesday, July 31, 2007

SUMMARY

  • Generator investigation
    • DDEC was confirmed as root cause on each affected generator.  All units have been fixed and returned to normal operation.
  • PDU investigation in colo 7
    • The 490ms outage in colo 7 occurred at the PDU not at the generator/fly-wheel UPS (Hitec).
    • The PDU has 2 sources of power; primary (Source 1) and back-up (Source 2). During the power event there was a power surge on the primary (Source 1) over 11%, more than the allowable setting of the PDU device. The PDU tried to switch to the redundant (Source 2) power supply. That power supply was trying to accept the load from three other COLO rooms and reached an overload condition and did not allow the transfer. The PDU then switched back to the primary (Source 1) supply. During this time 490ms passed before the loads on the colo 7 PDUs were put on the unit 7 generator/fly-wheel UPS (Hitec).
    • To correct this issue, we have set the over-voltage/under-voltage parameters to (+/-) 20% for all units in the building. PDUs performed normally following the change.  365 Main is implementing this change on all PDUs in all facilities.

CURRENT GENERATOR STATUS

  • Following thorough testing and the successful implementation of the DDEC fix across all units, all generators are currently operating normally.  The overall power system redundancy has returned to N+2.

OTHER DISCOVERIES

  • None

NEXT STEPS

  • On Wednesday, August 1, 2007, 365 Main will publish a final press release detailing the findings of the investigation.

UPDATE: 4:30 P.M., Monday, July 30, 2007

SUMMARY

  • The investigation team further pinpointed the digital controller of the generator unit as the probable root cause of failure in Unit 1.  After altering the suspect DDEC component, the investigation team was able to successfully start Unit 1 over 60 times without incident.
  • Hitec technicians performed a series of tests that confirmed the timing sequence for the DDEC controller was the probable cause of the 7/24 failure-to-start event.
  • Having corrected root cause on Unit 1, the team successfully returned Unit 1 to utility without incident and turned testing focus to Unit 3.  Unit 3 was first powered down for inspection of the clutch oil and proactive replacement of the clutch oil feed grommet (the unit had been running in diesel mode for 6 days). When the inspection and repairs were completed, the team was able to fail the unit and observed the same error in the DDEC component.  Technicians have implemented the DDEC fix on Unit 3 and are in the process of verifying this was the root cause of start sequence failure on Unit 3.

CURRENT GENERATOR STATUS

  • Operational status has changed since the last update. Unit 1 has finished testing and repairs and has been returned to Normal operation supporting customer load. Customer loads have been transferred from Unit 3 which had been operating in Diesel mode since Tuesday. All other units continue to support customer loads in the Normal mode. The overall power system redundancy remains at N+1.

OTHER DISCOVERIES

  • None

NEXT STEPS

  • Complete root cause analysis and implement fixes on all affected generators.  Return building to normal operations once testing is complete and stability is proven.
  • Publish complete details of the investigation for 365 Main customers.

UPDATE: 4:30 P.M., Sunday, July 29, 2007

SUMMARY
  • Comprehensive testing of speed control and regulation components and their relationship to the startup sequences of the generators continues.
  • Unit 1 continues to be the focus of initial testing.  During over 100 start/stop tests late Saturday night on Unit 1, the investigation team was able to simulate failure.   The digital controller for the diesel engine (know as a DDEC) has proven erratic and a spare DDEC is en route.  While this component is the focus of the investigation, the team continues start/stop testing to rule out other potential contributors to failure.

CURRENT GENERATOR STATUS

  • Operational status is unchanged.  The overall facility continues to operate at N+1 power system redundancy. Unit 1 remains offline, pending some additional testing. Unit 3 continues to support customer load in Diesel mode. All other units are supporting customer load in Normal mode.

OTHER DISCOVERIES

  • None

NEXT STEPS

  • Determine exact cause of diesel engine synchronization failure and PDU issue.
  • Continue to run generator 3 on diesel power until diesel engine synchronization failure root cause is corrected.
  • Continue to update customers with details of the ongoing investigation.  Reports will be posted each day at 4:30 p.m. until root cause is determined.

UPDATE: 4:30 P.M., Saturday, July 28, 2007

SUMMARY

  • Comprehensive testing of speed control and regulation components and their relationship to the startup sequences of the generators continues.  Conclusive evidence of the root cause of start sequence failures has not yet been discovered.  The team has narrowed its investigation after collecting detailed event logs produced by the affected machines as well as observational data collected by specialists on-site since Tuesday 7/24.
  • A senior member of the Hitec R&D team in Holland arrived at the facility today to join the other specialists.
  • A longstanding member of the Hitec Board of Directors is arriving later tonight and will be onsite tomorrow (Sunday) to participate in all investigation activities.
  • The team continues to meet 3 times each day to communicate status and findings of technicians and specialists.

CURRENT GENERATOR STATUS

  • Operational status is unchanged.  The overall facility continues to operate at N+1 power system redundancy. Unit 1 remains offline, pending some additional testing. Unit 3 continues to support customer load in Diesel mode. All other units are supporting customer load in Normal mode.

OTHER DISCOVERIES

  • None

NEXT STEPS

  • Determine exact cause of diesel engine synchronization failure and PDU issue.
  • Continue to run generator 3 on diesel power until diesel engine synchronization failure root cause is corrected.
  • Continue to update customers with details of the ongoing investigation.  Reports will be posted each day at 4:30 p.m. until root cause is determined.

UPDATE: 4:30 P.M., Friday, July 27, 2007

SUMMARY

  • A root cause investigation continues around the clock on-site at 365 Main.  Once root cause is discovered, we will be introducing a tested fix across all facilities that feature Hitec generators.
  • 365 Main will be making the findings of this investigation available to the public.
  • The cross-functional investigation team is comprised of:
    • Senior management and technical staff from 365 Main
    • Senior management from Cupertino Electric
    • Senior Management and Senior Engineers from Hitec in the US (Senior Engineering specialists are en route from Holland and arrive at the data center on Saturday, July 28)
    • Board Level Management and Senior Engineering and Technical Personnel from Hitec in Holland
  • Standing conference calls are set each day for the investigation team at 9:00 AM, 12:00 PM and 4:00 PM Pacific Time to communicate status and findings of technicians and specialists. 
  • The initial focus of the investigation is directed toward the speed control and regulation components and their relationship to the startup sequences of the generators. The initial investigation of Unit 1 has revealed that specific signals directed to its regulation board of the UPS control regulator are not being acknowledged by that board.  This board determines whether the diesel engine speeds meets certain parameters set forth for operation of the UPS system and sends commands to the remainder of the control logic to either continue to operate as directed or to shut the diesel engine down depending on whether specific parameters are met.  We are investigating whether this is a board calibration issue or a sensing issue.

CURRENT GENERATOR STATUS

  • All customer loads are being supported at N+1 redundancy. All additional repair and testing operations will be done in such a way as to preserve N+1 redundancy at all times.
  • All units with exception of Unit 1 are supporting customer loads. Unit 1 is stopped for testing and repairs; Unit 3 is connected and operating in diesel mode; all other units are operating in Normal mode. 
  • During the night, repairs were completed on Unit 4. The unit was tested in normal mode with 1200 kW of load bank load on the no break side and mechanical loads on utility on the short break side. Everything checked OK. The short break loads were locked onto utility (to prevent transfer to diesel) and the Q1 of Unit 4 was opened to verify proper diesel start and transfer of No Break loads. All worked normally.
  • The diesel technician adjusted the idle speed of all engines to 1750 rpm.
  • The diesel technician also replaced an exhaust system part on Unit 4. The diesel is capable of supporting approximately 1 megawatt of load.
  • After the exhaust system repairs, load was transferred to Unit 4. It continues to operate normally.
  • Because of the peculiar sequence of events documented in the error logs, we will perform testing of the circuit cards and Hitec machines.
  • Hitec technicians have discovered some readings on magnetic pickups that appear to be at the very outer limits of specifications. At this writing, we are trying comparative measurements to determine if this variation might have affected the signals to the programmable logic controller.

OTHER DISCOVERIES

  • 5 full years of preventative maintenance logs on the Hitec generators are currently available for customer review.   All generators in San Francisco pass weekly start tests and monthly load tests where diesels are started and run at full load for 2 hours.  Both of these tests simulate a loss of utility and the auto start function is accurately tested.
  • In March of 2007, California Data Center Design Group (CDCDG) completed an Operational Risk Assessment of 365 Main’s San Francisco data center.  The comprehensive report audited 365 Main’s operations including all policies and procedures, and rated operations “extremely positive”.  This report is available for customer review.
  • In its 5 years of operations, 365 Main’s San Francisco facility had successfully handled dozens of surges and utility failures without incident.  The most recent successfully managed surge took place earlier this month.  The event on 7/24 was unique in that it delivered 4-6 repetitive surges to the facility within a short period of time.  We are working with PG&E and the investigation team to develop a test to mimic the 7/24 electrical event.

NEXT STEPS

  • Determine exact cause of diesel engine synchronization failure and PDU issue.
  • Continue to run generator 3 on diesel power until diesel engine synchronization failure root cause is corrected.
  • Continue to update customers with details of the ongoing investigation.  Reports will be posted each day at 4:30 p.m. until root cause is determined.

UPDATE: 4:30 P.M., Wednesday, July 25, 2007

A complete investigation of the power incident continues with several specialists and 365 Main employees working around the clock to address the incident.

Generator/Electrical Design Overview
The San Francisco facility has ten 2.1 MW back-up generators to be used in the event of a loss of utility. The electrical design is N+2, meaning 8 primary generators can successfully power the building (labeled 1-8), with 2 generators available on stand-by (labeled Back-up 1 and Back-up 2) in case there are any failures with the primary 8.

Each primary generator backs-up a corresponding colocation room, with generator 1 backing up colocation room 1, generator 2 backing up colocation room 2, and so on.

Series of Electrical Events

  • The following is a description of the electrical events that took place in the San Francisco facility following the power surge on July 24, 2007:
    • When the initial surge was detected at 1:47 p.m., the building’s electrical system attempted to roll all colocation rooms to diesel generator power.
    • Generator 1 detected a problem in its start sequence and shut itself down within 8-10 seconds.  The cause of the start-up failure is still under investigation though engineers have narrowed the list of suspected components to 2-3 items.  We are testing each of these suspected components to determine if service or replacement is the best option.  Generator 1 was started manually by on-site engineers and reestablished stable diesel power by 2:24 p.m.
    • After initial failure, Generator 1 attempted to pass its 732 kW load to Back-up 1, which also detected a problem in its start sequence.  The exact cause of the Back-up 1 start sequence failure is also under investigation. 
    • After Generator 1 and Back-up 1 failed to carry the 732 kW, the load was transferred to Back-up 2 which correctly accepted the load as designed. 
    • Generator 3 started up and ran for 30 seconds before it too detected a problem in the start sequence and passed an additional 780 kW to Back-up 2 as designed.
    • Generator 4 started up and ran for 2 seconds before detecting a problem in the start sequence, passing its 900 kW load on to Back-up 2.  This 900kW brought the total load on Back-up 2 to over 2.4 MW, ultimately overloading the 2.1 MW Back-up 2 unit, causing it to fail.   Generator 4 was manually started and brought back into operations at 2:22 p.m.  Generator 4 was switched to utility operations at 7:05 a.m. on 7/25 to address an exhaust leak but is operational and available in the event of another outage.
    • Generators 2, 5, 6, 7 and 8 all operated as designed and carried their respective loads appropriately.
    • By 1:30 p.m. on Wednesday, July 25, after assurance from PG&E officials that utility power had been stable for at least 18+ continuous hours, 365 Main placed diesel engines back in standby and switched generators 2,5,6,7, 8 to utility power.
  • Customers in colocation rooms 2, 4, 5, 6, 7 & 8 are once again powered by utility, and are backed up in an N+1 configuration with Back-up 2 generator available.
  • Generators that had failed during the start-up sequence but were performing normally after manual start (1 & 3) continue to operate on diesel and will not be switched back to utility until the root causes of their respective failures are corrected.

Other Discoveries

  • In addition to previously known affected colocation rooms 1, 3 and 4, we have discovered that several customers in colo room 7 were affected by a 490 millisecond outage caused when the dual power input PDUs in colo 7 experienced open circuits on both sources.  A dedicated team of engineers is currently investigating the PDU issue.

Next Steps

  • Determine exact cause of generator start-up failure and PDU issues through comprehensive testing methodology.
  • Replacements for all suspected components have been ordered and are en route.   
  • Continue to run generators 1 & 3 on diesel power until automatic start-up failure root cause is corrected.
  • Continue to update customers with details of the ongoing investigation.

 UPDATE: 5:15 A.M, Wednesday, July 25, 2007

SUMMARY

  • At 1:49 p.m. on Tuesday, July 24, 365 Main’s San Francisco data center was affected by a power surge caused when a PG&E transformer failed in a manhole under 560 Mission St. While back-up electrical infrastructure is installed in the facility to defend against power surges, an initial investigation has revealed that certain 365 Main back-up generators did not start when the initial power surge hit the building.  On-site facility engineers responded and manually started affected generators allowing stable power to be restored at approximately 2:34 pm across the entire facility.
  • As a result of the incident, continuous power was interrupted for up to 45 minutes for certain customers.  We’re certain 3 of the 8 colocation rooms were directly affected, and impact on other colocation rooms is still being investigated.  Due to the complexity and specialization of data center electrical systems, we are currently working with Hitec, Cupertino Electric and PG&E to further investigate the incident and determine the root cause of why certain generators did not start.  All generators will continue to operate on diesel fuel until the root cause of the event has been identified and corrected.  Generators are currently filled with over 4 days of fuel and additional fuel has already been ordered.
  • We will apply knowledge gained in this investigation to all 365 Main facilities to help prevent this type of incident from happening again.


We sincerely apologize for the impact this has had on our customers’ operations. We understand the seriousness of this issue and will provide more details as they come available.  

FAQ

  1. When did 365 Main’s back-up generators kick-in and resume power to 365 Main customers?
    1. For affected customers, stable power was restored at approximately 2:34 pm. (approximately 45 minutes after power went down)
    2. Not all 365 Main customers were impacted by the incident. 
  1. How many 365 Main customers were affected by the outage?
    1. At this time we are not certain the exact number of San Francisco customers that were affected by the outage, initial estimates are between 20 and 40%.
  1. Which companies’ Web sites were affected by the 365 Main outage?
    1. We have confidentiality agreements with our customers not to disclose information specific to their companies without their prior approval.
  1. How long were their Web sites down? Did all customer Web sites resume service when power resumed, or was there an additional delay in the resumption of the Web sites’ service? If so, why?
    1. At this time we are only certain of the amount of time web sites were without power:  roughly 45 minutes for affected customers. 
    2. Once power is resumed in affected areas, customer equipment must be configured to operate on-line.  This configuration process is managed by our customers and can take minutes or hours depending on the complexity of a site’s operations.
  1. Why did back-up power fail? Why didn’t the uninterruptible power supply keep everything up and running during the outage?
    1. We are still investigating the root cause of why certain generators did not start when called on for back-up power. 
  1. How is the back-up power system supposed to work?
    1. Power anomalies (surges) in the San Francisco data center have previously been managed as designed.  In all previous cases, the N+2 electrical system automatically switched to live back-up generators when a surge was detected.  This switch to back-up power triggered the immediate start of the 3000 horsepower diesel generators to carry the building until utility power was stabilized.
  1. What is 365 Main doing to prevent this type of problem from occurring again? In the immediate future? In the long-term?
    1. In the short-term, all generators in the San Francisco 365 Main facility, which are currently operating normally, will continue to operate on diesel until the root cause of the event has been identified and corrected.  We will not switch back to utility power until we are assured by PG&E that the utility is stable. Generators are currently fueled with over 4 days of fuel and additional fuel has already been ordered.
    2. In the long-term, we will apply knowledge gained in this investigation to all 365 Main facilities to do all that we can to prevent this type of incident from happening again.