@CloudExpo Authors: Yeshim Deniz, Pat Romanski, Elizabeth White, Liz McMillan, Zakia Bouachraoui

Related Topics: @CloudExpo, Microservices Expo, Cloud Security

@CloudExpo: Article

What You Can Do When Thunder Strikes Your Cloud

Best practices for mitigating cloud application outages

In spite of the hype that a Cloud system or application will never fail, we are still seeing cases of cloud system failures. The recent ones being Lightining strikes in Dublin taking the Amazon, Microsoft Clouds down for a while. While this may cause some Fear, Uncertainty and Doubt on Cloud, the underlying fact remains that transforming an application to Cloud means not just setting a switch for the enterprises, but there needs to be lot more planning and the best practices that are proven in the traditional data center are still valid. The following are some of the best practices in prevent the Cloud outages. These are beyond the basic disaster recovery provisions given by the most cloud providers.

Avoiding a Single Point of Failure Across Tenants: It is a general trend that most of the Cloud applications tend to be multi-tenant in nature. Again as explained in my other articles, multi tenancy within an enterprise means, different geographical regions, business units or other acquired and merged entities. However, the way the load balancing, Web Server Scalability, Application Server routing and database partitioning should be done in such a way, that a single failure of a Cloud component like a database , application server should not make the all the tenants down. The database partitioning strategy plays an important role here.

Suppose in an enterprise ERP application hosted on cloud, and if the ERP application is logically separated by plants or warehouses, then ensure that a failure of a single Virtual Machine or data store does not shut down all the plants, but only specific plants. All the load balancing, routing and data partitioning schemes should adhere to the principle of avoiding total failure if a few virtual machines are down.

Utilizing the Out-of-the-Box Features of the Vendor for Availability: Typically most cloud providers provide you multiple choices to whether the disaster and outage scenarios. It is up the enterprises to evaluate and choose the best ones suited to their needs. Some of the typical options given by various vendors are:

  • Multiple data centers across the zones: Most providers have their location in all continents or in major locations across the world. It is good choice to choose the scalability options across these locations to ensure that failure of a single location does not result in total outage of your application.
  • Availability Zones: Though this a specific Amazon EC2 terminology, this concept is more about making certain servers and networks isolated from the failures of other parts within a particular geographical regions. Careful analysis of this feature and scaling out the application and data across availability zones would be a viable option.

Utilizing the Out-of-the-Box Features of the Backups: Most vendors do provide multiple choices for backing up the data automatically. However, it is up the enterprises to choose them to fit to their needs.

For example, why we use the Windows Azure Storage, All your content stored on Windows Azure is replicated three times. No matter which storage service you use, your data will be replicated on different fault domains thereby making it much more fault tolerant. Similar SQL Azure makes automatic backup of the database.

Similarly the EBS Storage units in Amazon do provide automatic options for replicating the data into the multiple servers within an availability Zone and options like S3 provide backup across availability zones.

Building a Custom Storage Backup Strategy: One of the major reasons for outage of applications is due to the reason that these applications fully reliant on the vendor provided automatic backup options. So if everything else fails, application owners have no options  but to wait for the Vendor to restore their services.

Also vendor (cloud provider) backup options will not protect against application failures like data corruption, accidental or deliberate deletion of data and hence a custom application specific backup strategy is needed.

Most Cloud Services do provide many custom options too, for example if you use Cloud databases like Oracle RDS you have options like recycle bin  and flashback database that can help to restore the database content to a specific point of time.

Another simple option which always worked effectively is to use the features like TRIGGER or Message Queues to replicate the transactions to a different server or regions. This will ensure that the all the important transactions have been backed up and making the restore option also easier.

Creating Copy Back In To the Data Center: No current enterprise is going to fully relinquish the data centers and do the business on Cloud, rather there will be a HYBRID delivery of a combination of  data center, private and public clouds. In that scenario keeping a local copy of the most critical data is always a better option. Most Cloud providers do support such a scenario too.

For example with support from SQL Azure Data sync, we can replicate the data from Cloud back to the data centers.

SQL Azure Data Sync Scenarios:

  • Cloud to cloud synchronization
  • Enterprise (on-premise) to cloud
  • Cloud to on-premise
  • Bi-directional or sync-to-hub or sync-from-hub synchronization

Summary: Cloud has far reaching potential to enable the enterprises to concentrate on business capability needs versus operational and maintenance needs. Cloud also opens up new areas like High Performance Computing, Platform and Solutions as Service. Few of the initial outages should not create a fear, uncertainty and doubt in the minds of the enterprises.

It is all about the SLA needs of the individual applications and how we plan the cloud deployment. For example it's almost impossible for today's enterprises to suddenly provision a data center in a different continent and utilize for its disaster recovery needs. However most cloud providers allow for such a scenario as a simple self-service based provisioning.

It is up to the enterprises to evaluate the out-of-the-box as well as custom features against the SLA needs and come up with an appropriate strategy. This will make the Cloud Journey of the enterprises more fruitful.

More Stories By Srinivasan Sundara Rajan

Highly passionate about utilizing Digital Technologies to enable next generation enterprise. Believes in enterprise transformation through the Natives (Cloud Native & Mobile Native).

CloudEXPO Stories
Most modern computer languages embed a lot of metadata in their application. We show how this goldmine of data from a runtime environment like production or staging can be used to increase profits. Adi conceptualized the Crosscode platform after spending over 25 years working for large enterprise companies like HP, Cisco, IBM, UHG and personally experiencing the challenges that prevent companies from quickly making changes to their technology, due to the complexity of their enterprise. An accomplished expert in Enterprise Architecture, Adi has also served as CxO advisor to numerous Fortune executives.
Cloud computing is a goal aspired to by all organizations, yet those in regulated industries and many public sector organizations are challenged in adopting cloud technologies. The ability to use modern application development capabilities such as containers, serverless computing, platform-based services, IoT and others are potentially of great benefit for these organizations but doing so in a public cloud-consistent way is the challenge.
"Calligo is a cloud service provider with data privacy at the heart of what we do. We are a typical Infrastructure as a Service cloud provider but it's been designed around data privacy," explained Julian Box, CEO and co-founder of Calligo, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
In his session at 21st Cloud Expo, Michael Burley, a Senior Business Development Executive in IT Services at NetApp, described how NetApp designed a three-year program of work to migrate 25PB of a major telco's enterprise data to a new STaaS platform, and then secured a long-term contract to manage and operate the platform. This significant program blended the best of NetApp’s solutions and services capabilities to enable this telco’s successful adoption of private cloud storage and launching of virtual storage services to its enterprise market.
When building large, cloud-based applications that operate at a high scale, it’s important to maintain a high availability and resilience to failures. In order to do that, you must be tolerant of failures, even in light of failures in other areas of your application. “Fly two mistakes high” is an old adage in the radio control airplane hobby. It means, fly high enough so that if you make a mistake, you can continue flying with room to still make mistakes. In his session at 18th Cloud Expo, Lee Atchison, Principal Cloud Architect and Advocate at New Relic, will discuss how this same philosophy can be applied to highly scaled applications, and can dramatically increase your resilience to failure.