When the cloud bursts
When the cloud bursts
As an enterprise the cloud holds great potential but the cloud does not solve your problems - if you have an operational problem now, you'll have it in the cloud as well. If anything, the cloud makes operations remote and you'd need to up your game to be able to leverage it effectively.
The first place to start is crisis management. If the enterprise does not have a robust major incident process in place, then it becomes more difficult to operate in a distributed environment. This is relevant when considering the following list of incidents that have occurred at data centres and operators in South Africa which is available on LinkedIn.
Many, if not all of those incidents, have poor causation - meaning that the root causes are not clearly stated or not communicated. Take into account that your infrastructure and operations is now in the cloud and you need to manage the cloud provider/s. These vendors will indicate that a crisis never happens and that they have sufficient mitigation actions in place. Ignore the probability of major incidents at your own peril because when the cloud bursts you do not want to be hit with a cascading technology tsunami.
Beyond the shores of South Africa the problem is potentially worse. Read about major failures on the UK's The Register site.
One of the first aspects of dealing with a major incidents is to put eyes on the problem. In a remote data centre, that might be a bit more complex than if the location was in the basement of your own building. It is always desirable to physically view a problem as eyeballing the situation is key part of understanding the path to quick resolution. Alternatively, one needs to be a bit more inventive to achieve the same results. The first aspect is to use technologies like ILO (Integrated Lights Out). This will allow access to a server even if the operating system or application has crashed. Additionally, there are data centre tools and monitoring devices that use sensors to obtain a view of the situation. This might even include remote IP based cameras of the data centre where your infrastructure is co-located. Referring to the above list of incidents, there are a large number of countermeasures required to ensure effective management of operations. This includes being able to handle component failure or issues related to power or temperature abnormalities.
Many facilities offer a remote hands service where assistance is provided to an enterprise when physical intervention is required. This is especially important when the data centre and enterprise are in totally different cities or even countries. If the remote hands is not skilled then it is possible to use a smart phone and take pictures of the situation which can then be transmitted back for analysis. Regardless, the practice might be a good idea as a mechanism to record and document the events. In the major incident process, it is encouraged to embrace modern social networking communications. If your business is in the cloud, you will not be able to call all participants into a crisis meeting to discuss the events and actions required for the major incident process. As your infrastructure is distributed into the cloud, so will the resources you use be dispersed to various locations. Thus communications technologies like Whatsapp, Facebook Messenger, Skype and Telegram https://telegram.org/ become critical tools for notifications and escalations within an enterprise's Information Technology's (IT) teams. Key members of the tiger team, https://en.wikipedia.org/wiki/Tiger_team the team established within IT to deal with major incidents, need to be in constant contact using the above mentioned tools.
These tools are an excellent out of band mechanism to maintain communications. YOU CANNOT RELY ON YOUR OWN IN-BAND COMMUNICATIONS! (e.g. email-it may not be accessible). The uptime and reliability of social media has become rock solid and can be relied upon. A Whatsapp message is actually many times more secure in its transmission path than something as arcane as email. A good alternative that is cloud based is PureCloud
But the important aspect of the process is that it requires structure as has been noted in a previous article, "Most important process of them all" This structure improves the response times to deal with the events that have been triggered by the major incident. Social media chats can be unstructured. It is not beneficial in a crisis for team members to be twiddling their thumbs waiting for a moment of magic to occur to resolve a problem. Some of these chats can generate noise as poor structure will lead to inappropriate or less critical actions being performed. Thus an important action item when working through diagnosing a disaster associated with a major incident, is to initially complete a diagnostic checklist. This checklist is a predetermined one that has prioritized the most common causes and checks upfront leading to optimal completion. If the checklist is completed via a group chat on Whatsapp, then all tiger team members are immediately aware of the situation and relevant feedback. They can apply their minds to the situation in a structured manner.
There are many more structures that need to be applied to a major incident process and Dee Smith and Associates is able to assist with either consulting or training on the process. This will result in an enterprise being able to deal with the eventual cloud burst in a manner similar to a sudden and expectant Highveld storm and not be washed downriver in a flash flood.