Cluster 1 problems
Hello,
At approximately 15:30 a problem was detected with a storage device connected to Cluster1. The problem was rectified at 16:15, but a few clients had a problem with their VM going into a locked state. The remaining three VMs are currently being worked on.
The problem itself was a result of a new cluster (Cluster5) being configured and connected to the storage clusters. For a reason we are still investigating, this caused the connectivity between Cluster 1 and what we refer to as EVS01 (storage device) to go down. The new cluster was brought offline and the storage device was available for the VMware Hosts to connect to again.
This is a rather unusual incident caused by an apparent configuration error on our new cluster 5. We are still investigating why it behaved in the way it did, especially since other storage devices didn't suffer the same failure.
I'm aware this raises redundancy issues, but our storage devices act as a cluster. They have a separate network configured on a different subnet to what the VMware Hosts connect to. As this separate network showed everything as working properly, the storage pools didn't switch over to another device as they should do.
In simple terms, EVS02 was happily talking to EVS01 via our management cluster and it had no idea EVS01 wasn't able to talk with the VMware host servers. It therefore didn't switch the storage pools from EVS01 to EVS02.
We are in contact with our storage system vendor to ascertain if there are any configuration changes we can make to ensure storage pools are available via all devices, all the time (not just when a failure is detected) to ensure this kind of issue can't happen again.
We sincerely apologise for the inconvenience. Downtime is a very stressful thing to go through, especially when you have purchased something which shouldn't have downtime. We do take our commitment seriously.
Currently a number of VMs need to go through an FSCK. We are manually checking each one now.
|