Here's an update on the unexpected outage which we had yesterday.
The issue affected servers in the North side of our Maidenhead Data Centre, Spectrum House.
At ~08:55, we became aware of a network issue affecting some servers in the North side of our Maidenhead Data Centre. Approximately half of these servers were experiencing connectivity problems ranging from packet loss to total loss of connectivity. Other servers were unaffected by this issue, and were responding as normal. Our network monitoring server was amongst those fully affected by this problem and therefore reported a total outage, including for servers hosted at other data centres and not affected at all. We are in the process of clearing this misleading monitoring data.
The issue we detected was affecting both the primary and secondary Cisco 6500 network system that are configured in a VSS-1440 redundant cluster. We ran through our emergency procedures to identify the problems, but all tests were responding within normal parameters.
After finishing our emergency procedures, and not identifying a specific problem, we raised a case with Cisco TAC at ~10:10. A Cisco engineer then logged into our routers to try and identify the problem. After 3 hours, the Cisco engineer was unable to provide a resolution; we understood the problem was either a software bug within the routers, or else a hardware fault.
We took matters into our own hands at ~13:20, and decided to reboot both routers. This affected all servers in the North data floor, as it takes about 15-20 minutes for the routers to reload. During the reload, the primary router failed to boot up normally. The secondary router booted normally, and our monitoring showed service was restored as a result of this.
Our conclusion is that the failure of the primary Cisco 6500 to boot indicates a hardware problem. We take full responsibility for all the infrastructure required to provide you with a reliable service, and therefore we asked Cisco to provide an answer to these questions:
1) Why were Cisco unable to diagnose a hardware fault within a 3 hour time frame?
2) Why did traffic not automatically fail-over to the secondary 6500, as by design?
Cisco commented that they do not know for sure if this is a hardware problem, and so were unable to provide a specific response to these two questions. Clearly these are very important questions that need to be answered, and we will continue to work with Cisco to provide a full and adequate response to them.
|