Amazon data center crashes

Early in the morning of April 21, Amazon's EC2 data center in Virginia crashed, bringing down many popular websites, small businesses and social networking sites.

The strange fact is that the outage still ensures that the 99.55% availability as defined in the SLA (Service Level Agreement) is not breached. Let us put aside the other aspects and focus on Cloud services and the new generation of programmers and business who use these services. Though the SLA leads to quite an interesting debate, we will leave that to the legal experts.

More often than not, when we discuss building applications in the Cloud, the basic assumption is that of 24x7 service availability. While Cloud service providers strive to live up to this expectation, the onus of designing a system resilient to failures is on the application architects. On the other hand, SLA driven approaches are very reactive in nature. In purest sense, SLA's are just a means of trust between the user and the service provider. The fact is that SLA's can never repay for losses. It is up to an Architect and CIO to build systems that tolerates such risks (Cloud system failures, connectivity failures, SLA’s, etc).

With Cloud infrastructure, we end up building traditional systems that are so tightly coupled and hosted without taking advantages of the availability factor. These shortcomings maybe part and parcel of software world where functionality takes precedence over all other aspects, but such tolerance cannot be expected in the Cloud paradigm. A failure on part of the Cloud service provider will bring down the business and getting back the data becomes a nightmare when all the affected businesses are trying to do the same.

Accommodating and managing these factors are the business risks, which need to be identified. Businesses that do not envision these risks are sure to suffer large scale losses. The truth is that building such resilient systems is not very complex task. The basics of all software principles have remained same whether they are built for Cloud or enterprise-owned hardware. Mitigating as many risks as possible requires that several basic designs and business decisions be made – while considering the software provider – such as:

  • Loosely couple the application
  • Make sure the application follows “Separation of Concerns”
  • Distribute the applications
  • Backup application & user data
  • Setup DR sites with a different Cloud service provider

These decisions involve software that follows these basic designs and business decision managers who identify various service providers to mitigate such risks. Cloud service will enforce a thinking among the business managers that availability should not and cannot be taken for granted.

These failures will not stop the adoption to Cloud but will make the customers aware of the potential risks and mitigation plans. The Cloud failure will have serious impact on the CTO/ CIO and the operations head. In a non-Cloud model, a CIO’s role has been noted as very limited. The interaction of the CIO with a CTO in the everyday business is much less. These two executives need to work more closely to protect the business and reduce risk.

The best practices for the Cloud application builders are:

  • Build Cloud applications, not applications in the Cloud
  • Design fault tolerant systems, wherein nothing fails
  • Design for scalability
  • Loosely couple application stacks (IOC)
  • Design for dynamism
  • Design distributed
  • Build security into every component

The best practices are necessary for all the architects who build Cloud applications. Do not simply port a traditional application to the Cloud. They are architecturally different and will not take advantage of the underlying services – and most often – will result in failure.

Remember "Everything fails, all the time.” It is time to think and manage risks and not let the SLA stare at you when you are losing business. Be proactive; build Cloud-friendly applications.

The new world on Cloud looks more promising than ever. However, failures can make us realize that functionality without proper foundation and thought process can have serious repercussions. It is essential for every business to review their risks and redefine their new perimeter in the Cloud.

For more information on how Team Aujas is assisting clients with security risk in the Cloud, please contact Karl Kispert, our Vice President of Sales. He can be reached at karl.kispert@aujas.com or 201.633.4745.