Overview of when our ‘worst case scenario’ actually happened after 12 years of smooth running.
The Cornwallis Data Centre (CDC) houses most of the University’s servers, which are used to deliver the majority of IT services to the University. On the 11th June, this room suffered a complete power failure during major maintenance work.
What were you doing?
Replacing the uninterruptable power supply (UPS) and some air conditioning units.
Why were you doing that?
The CDC is a highly resilient data centre designed to deal with many potential disaster scenarios. This includes systems to maintain power. While normally run from mains electrical power, we have an interlinked set of three UPS (Uninterruptable Power Supplies – essentially three very large batteries) to run from in the event of a power cut. These are expected to be able to run for 30 – 60 minutes, which is considered more than enough time for our diesel generator to automatically start. This then has sufficient capacity to run for 18-24 hours before requiring more fuel. Over the years, we have had several power cuts and this combination has maintained service throughout.
The CDC was built in 2008 and 12 years later, equipment such as the UPS and air conditioning units were becoming too old to be reliable and were approaching a stage where they were no longer supported by the manufacturer. In order that we could be confident they could continue to run, we started an Estates led capital programme of replacing equipment.
Why did you do this work when you did?
Due to the current Covid crisis, scheduling this work proved quite difficult. At the point where the engineering company (Future-tech – who we use for all our data centre work) confirmed they were legally able to do the work, their availability was to either to start almost immediately (8th June) or delay a number of weeks, pushing the work closer to Clearing. While we would have preferred the work to start a week later, out of term time, we opted to secure the opportunity to ensure work was completed as soon as possible.
The UPS work was expected to take one week while the air conditioning work should take up to four weeks.
Who was involved?
Information Services (IS) are the key users of the data centre, but there is equipment from other University schools and departments, and also some which is owned and managed by our network provider and regional partners.
Estates manage the contract for the environmental (power, temperature, security) state of the room, and as this is a highly specialist area contract a specialist company called Future-tech for most of the support and maintenance work.
Future-tech is a specialist Data Centre provider, highly accredited and recipients of numerous awards. They have a significant customer base including numerous other Universities, councils, blue chip commercial entities and the NHS. Future-tech were part of the design and commissioning of the room, have been part of the maintenance process since and know it intimately. The CDC electrical supply had worked flawlessly for over a decade and covered many periods of maintenance, numerous electrical supply cuts over the years and the whole campus power cut of February 2013
What was supposed to happen?
As the problem centred around power, the air conditioning work will not be included in the remainder of this report. Air conditioning work was suspended to remove any further complications.
The data centre has three UPS units, in what is known as a N+1 design. Two units can handle the power load of the data centre, with three, this allows one to be shut down for maintenance or allowing spare capacity in the event of a fault.
Our maintenance company planned to move one of our old UPS units to a physically different location in the data centre, power it up and use that to offer extra resilience while a new UPS unit could be brought in. The remaining two units would then be swapped out in turn, maintaining power.
What actually happened?
A number of things did not quite go to plan once the work was started.
After moving our old UPS and attempting to power it up again, it failed. This was always a risk as the equipment was so old. It was feared it would not survive an extended power cut, vindicating the decision to replace them. However, this resulted in a change of plan. On the Tuesday, it was decided to follow the advice of Future-Tech to take the slightly riskier option to run on generator for an extended period and install two new units together, before returning to mains power. This was successful and left us with sufficient UPS capacity, but no UPS resilience. In the event of a mains supply power failure, we would have still reverted to UPS followed by generator. We did not have the capacity to cope with the failure of a single UPS unit.
When installing the third UPS unit on the Wednesday, the engineers discovered one of the internal controller boards was damaged, possibly in transit or as a manufacturing fault. A replacement was immediately ordered for urgent delivery from the manufacturer. This left us running on two UPS units for longer than expected, and as part of the N+1 design two units are sufficient as a minimum. With the old units dead, this was undesirable for a long period as it meant CDC was missing the “+1” aspect of the N+1 design but there were no alternatives.
At 12:21 on Thursday 11thJune 2020, one of the two new UPS units suffered a communications failure with the other, went into an error state and shutdown. This then transferred all power to a single UPS unit, which could not cope with the load. This immediately shutdown for protection, before the generator could be engaged. This cut off all power to the data centre, powering off all servers, network and disk storage. Around 10 seconds later our electrical switch gear switched us directly to mains power, bypassing the UPS units. A loud click was reported while this automatically failed over.
A combination of the three issues together resulted in the power outage.
You have plans for disasters such as this right?
Yes. Information Services have a set of Disaster Recovery documentation designed to cover the management of disasters. In addition, we have specific plans for a Cornwallis Data Centre outage, including a ‘power up’ schedule. This disaster has always been considered one of the worst scenarios, and has never happened before.
In addition, we have a certain level of redundancy to an alternate data centre, described in more details below.
The general plan for such a disaster involved bringing as many staff to site as soon as possible to work collaboratively on a solution. Due to Covid restrictions, this plan was not possible. Before the maintenance work was started, we produced a new plan involving two teams, a small on-site team dealing with physical issues and a larger off-site team managing services as they became available.
What did you do next?
For other unrelated work, the IT Operations Team had three team members working on site, one of which was in the data centre at the time of the power cut. The IT Operations Manager was called to site while the two other members immediately flipped power breakers to racks of servers, in order to give us control of how and when servers returned to service. This was part of the ‘power on from cold’ plan, though it was usually expected we would have more than 10 seconds to react.
The IS Major Incident process was swiftly initiated, with the Head of IT Support taking the role of Major Incident manager.
Meanwhile a message was quickly put out to contact members of core infrastructure staff to not come to site but be ready for disaster management. Our Head of Systems, who lives locally, happened to be walking through campus and diverted himself to Cornwallis to help manage the situation. The news was first relayed on our Teams channel, after a problem with the website had been reported, as “CDC power shutdown. Everybody please standby”. Onsite the team started powering services in our planned order, attempting to restore network connections, before moving on to disk storage and servers. As the telephone and Wi-Fi service depends on the IT infrastructure and there is virtually no mobile phone signal in the Cornwallis building, the two teams had no method of communication between them other than relaying messages through two people taking phone calls. It appears that connections from offsite were restored before network connections in the Operations office could be established, giving the off-site team advanced insight into the issues.
At around 13:40, the on-site team established network connections and the two teams were able to share information about on the situation. At that point many of our systems were booting and two new problems were discovered.
With one UPS device having already failed, we were advised not to go above 60KW of power – the capacity of a single UPS unit. Recently the total power draw had been 90KW. This left us unable to power on all services and to need to examine what could be shut down again.
In addition, we found that our disk storage had not finished booting before our large VMware servers started to boot. Rather than have many individual servers we use virtual servers. We buy large VM Servers, with lots of memory and processor, then in a virtual environment, divide these up to make virtual servers – the servers people connect to. The storage is separate in a large bank of disks known as a storage array and is part of a network called the Storage Area Network (SAN).
For resilience, we have a smaller disk array connected to the SAN in a secondary off-site data centre. We synchronously mirror data for core services to the secondary data centre and have the capacity to bring up virtual servers there and restore key services. There is insufficient capacity at this secondary location to run all virtual servers and a plan was in place to decide what to boot in a disaster depending on current business needs.
As power was quickly restored to Cornwallis, the decision was taken not to bring up key services in the secondary data centre (a time consuming and difficult to reverse decision – something we would only do in cases where Cornwallis was likely to be unavailable for considerable time).
As power had been restored quickly, many systems in Cornwallis started to automatically boot. The Cornwallis disk array takes far longer to boot (around 30 minutes) As soon as the CDC array lost power the data on its synced disks became out of date compared to the secondary array. The array then automatically blocked the paths to the CDC array so the secondary array in the DR datacentre took over as designed. The arrays will then remain in this state, with CDC disks blocked until the data can be manually re-synced.
Any systems that automatically booted and had replicated storage started up, and began running out of the secondary array – this is transparent to the server, and apart from having a smaller number of network paths to their disk, they are not aware the disks they are using are actually the other side of Canterbury. Any systems that did not have replicated storage failed to boot as until the primary array came back online, they had no disks to boot from.
As systems booted, it started to put load on the limited storage we have at the secondary data centre. There were concerns that this would start to cause other problems if it hit its maximum, and we were also aware that a significant VMware farm (the one that hosts most of our Microsoft SQL Server databases) was not recovered for reasons which at this point, remain unknown. In addition, storage link capacity is required to resynchronise storage back to our primary, considerably higher performance, disk array. The decision was taken that we needed to shutdown less important services in order to free the I/O capacity to boot services off storage in Cornwallis until the picture became clearer, and we could access all of the storage arrays and networking to assess the load.
Our disaster plans concentrated on restoring as many services as quickly as possible. We did not expect to be working to shut them down again. However, most user facing services had booted and were operational before IT staff could login to check. To many users this give an interruption in services of less than an hour. While some services were shutdown to reduce I/O load, we aimed to maintain key services where possible.
Working until very late on the Thursday evening, one member of the Server Infrastructure Team was able to re-synchronise the disks back to Cornwallis and start returning services to their normal array.
Friday
On the Friday, the third UPS arrived, but until it was installed we still had to keep power under 60KW. However with the storage problem solved, we were able to systematically run through our list of virtual servers, with the majority of services restored by 2pm, just under 26 hours from an event we always thought would take a week or more to recover from.
UPS diagnostics to establish the root cause had been started on Thursday soon after the issue occurred. Given no clear evidence of a specific issue was found by the on-site team this was escalated and a UPS engineer from the manufacturer was called to site to investigate what went wrong and verify our UPS units were in a good and healthy state.
The initial investigation suggested a problem with the communication bus on the UPS. The manufacturer head office recommended replacing the entire communication hardware for the first two UPS units and immediately dispatched parts and an engineer from their head office. Unfortunately, due to a couple of delays, they did not arrive until 11:30pm. Work was finally completed and signed off by 4am on Saturday when IS staff involved could finally go home.
Saturday
An IS team arrived on-site on Saturday morning to power on the remaining servers, restoring most services to full working order by 10am. Given the power “all-clear” from the UPS manufacturer and Future-tech we were also able to increase the power load within the room. At this point additional equipment was powered on, restoring other services including for Computer Science, the UK Mirror Service and for our High Performance Computing cluster users.
What actually went wrong?
While there were a number of factors and we still await the official report from the UPS manufacturer and installation company, it appears the cause was faulty hardware in the second UPS device during a state where we did not have the resilience in place to prevent a power outage. There were always going to be periods of higher risk where we did not have resilience against certain failures. Although the devices passed all tests before going into service, they still failed. The risk period was extended after the third unit, which would have added the resilience, was found to be damaged on arrival.
What went right?
While the negative event of losing power will always overshadow this work, it is important to recognise a number of things that did go well. Replacing a vital component supplying power to the data centre is always going to be a difficult and risky process and the initial plans (which we rejected) involved a prolonged period with all power cut. The engineers were able to produce a plan which would have maintained power throughout the upgrade, if it were not for an unexpected hardware failure at the wrong time.
From the IS point of view, the biggest success story is how quickly we were able to recover. An unexpected power off has been planned for and discussed a number of times. Systems have been built with resilience and recovery in mind, but a disaster on this scale has never been tested or rehearsed. IS staff can take a great deal of pride for how quickly we were able to recover from this, especially during a time when the majority of staff were working in isolation remotely. What we thought could take days took a few hours or in some cases just a few minutes.
What have you learned for the future?
It is difficult to know what could have been done differently to avoid the failure, though we have asked for suggestions from our maintenance company in their official report. However, it is gratifying to reflect how quickly many core services were restored. Systems had been built with resilience and recovery in mind but never put to the test in such a way.
Although we would never have chosen to boot all VMware servers together, believing the dependency chain to be critical, doing this quickly restored services, leaving us to fix those that did not boot into an operational state.