@CloudExpo Authors: Elizabeth White, Liz McMillan, Yeshim Deniz, Pat Romanski, Roger Strukhoff

Related Topics: @CloudExpo, Java IoT, Microservices Expo, Containers Expo Blog

@CloudExpo: Blog Post

High Availability, Fault Tolerance and Disaster Recovery in AWS

Introducing elastic cloud computing for Disaster Recovery, Fault Tolerance and High Availability

Amazon Web Services – Disaster Recovery, High Availability and Fault Tolerance

Abbreviations used:

  • AWS - Amazon Web Services
  • AMI - Amazon Machine Instance
  • DR - Disaster Recovery
  • FT - Fault Tolerance
  • HA - High Availability

Non-technical introduction.
igh Availability and Fault Tolerance – the requirement that a computer application be seemlessly available to users without interruption, literally “no (or very little) fault will be tolerated”. In simple speak, this means that I am able to use a computer application even though in the background there may be outages, for example hardware failure, network congestion or maximum CPU utilization.  The application is “highly available”, it  is available “(almost) all the time.”

Disaster Recovery – what it takes for your organization to recover from a computer disaster and be operational again. Simple example: suppose a hard-drive fails, do you need a duplicate hard-drive with an identical copy of the data available  within micro-seconds? Or can your application be unavailable to end-users while a new hard-drive is installed and data restored from a backup, knowing that there will be data missing between the time of the last backup and the failure of the hard-drive?

Organizations such as an “online auction company” and a “major search engine” cannot tolerate down-time nor data loss. Your personal computer, though, can likely tolerate some downtime and even data loss. Sometimes DR is automatic, when the primary hardware/software fails, the secondary automatically takes over; other times DR requires manual intervention.  Whichever you configure, depends on your tolerance for downtime. When considering FT and DR, there are two important terms:

  • Recovery Time Objective: how long it takes your organization to recover from an outage.
  • Recovery Point Objective:how much data can you afford to lose.

What does Fault Tolerance, High Availability and Disaster Recovery cost?
Organizations that require almost 100% availability and fractional if any data loss,  build redundant datacenters. Entire datacenters with hardware and software and identical copies of the appliction and data. Sometimes these extra datacenter(s) are used to offload traffic from the primary datacenter(s) during times of peak usage. Example, during the November/December shopping season or March/April tax season, companies like an “online auction company” and an “online tax company”, will offload some traffic from their primary servers to secondary servers. Other times, these extra datacenter(s) are simply sitting idle waiting for a disaster to occur at the primary datacenter. Hardware, software, electric power, air conditioning,  building rent and physical security all “just in case” or “when we need it. The invesment in the secondary datacenter(s) is up-front capital expenditure. Secondary datacenters are expensive for computing power that is often idle!

Introducing elastic cloud computing for Disaster Recovery, Fault Tolerance and High Availability.
What is an elastic band? A thin strip of rubber that stretches or contracts. If I need to bind three DVDs together before loaning them to a friend, the elastic band may stretch to perhaps 50% of capacity. If I  am loaning six or seven DVDs, the elastic band may stretch to maximum capacity, but either way I only need one elastic band, I don’t need to buy two. This concept illustrates the elasticity of cloud computing:  use and pay for computing services only when you need it.

In my above example, when tax season begins, “online tax company” can request and purchase computing services from a provider, such as Amazon Web Services,  and then at the end of tax season cease using and paying for those computing services. Thus “online tax company” can provide a HA tax processing service to their customers, because their primary datacenter(s) will not be overloaded. “Online tax company” will automatically scale out to a cloud provider such as Amazon as needed.  Secondly, the extra computing services provide FT, in case there is a disaster at their physical site during peak tax season.

High Availability, Fault Tolerance and Disaster Recovery in the Cloud
What about an organization that does not own a physical datacenter, but instead runs their entire operation in the cloud. How does their cloud provider provide FT, HA and DR? How does this compare to HA, FT and DR in physical datacenters?

What follows is a precis of two white-papers from Amazon Web Services, Building Fault Tolerant Applications on AWS and Using AWS for Disaster Recovery

This table is  a technical overview of AWS’s FT, HA and DR features compared to a physical environment:

Failure or fault

Physical server

Amazon Web Services

Server failure

To provide fault tolerance I need to have a second server on standby, with an identical copy of the application. I also need to re-direct traffic to the second server by changing the IP address of the second server to the IP address of the primary server.

I launch a new instance of my failed AMI from an AMI template using a script, an API call or the web console. I then map my elastic IP address to the new instance.  AWS CloudFormation allows me to create a collection of AMIs and resources.

Backups for FT

Physical backups often use tape or other mediume for duplication often stored offsite. Backup and Restore can be time consuming.

AWS backups are snapshots of an AMI that can be easily restored using command line or the web console in near instantaneous time.


If a physical server uses Network Attached Storage, the data is preserved if the server fails. If the server uses an internal drive, the drive is unavailable and potentially lost if the server fails.

Elastic Block Storage is separate from the AMI and persists even if the AMI does not. EBS is built on highly redundant storage that has a failure rate of 0.1 to 0.5% compared to 4% for a standard hard-drive.

IP addressing

An IP address is bound to a physical server and manual configuration is required to modify the IP address of a server.

AWS IP addresses are bound to an AWS account and separate from the AMI. They can be dynamically associated or disassociated from an AMI.

Scale/Growth for HA

To extend the capacity, while maintaining High Availability, of an application running on physical servers, I have to purchase more hardware, rackspace, electrical power and cooling. When I have excess capacity, these servers are idle, consume space, power and waste money.

To extend the capacity of an application running in an AWS cloud, I can use autoscaling which automatically adds AMIs to my capacity based on rules. Similary, I can terminate instances when no longer needed. This also allows me to refresh instances if they degrade.

Load Balancing for HA and FT

A physical load balancer balances traffic across known physical servers. It can detect a server is overloaded and direct traffic to other less utilized servers.

Elastic Load Balancing distributes traffic across AMIs, detects which AMIs may be less responsive and redirects traffic away from them until they are restored.

Multiple Geographies for HA

Physical servers are stored in datacenters around the world to provide HA.

AWS is distributed across geographic regions with Availability Zones within each region to provide HA.

Guaranteed Failover and FT

An organization that purchases its own physical servers is guaranteed failover to those servers.

An AWS customer can purchase Reserved AMIs that are guaranteed to the customer, regardless of what overall load AWS  may experience.

An AWS customer can purchase Reserved AMIs that are guaranteed to the customer, regardless of what overall load AWS  may experience.

I can maintain a physical data center and use the AWS as needed. There are four configuration options.

  1. AWS as backup - I can back up my physical environment to AWS using DirectConnect or AWS import/export, thus using AWS as for backup.
  2. Minimal AWS - In this scenario I have core services permanently running in AWS, for example copies of my data/databases. In case of failure of my physical datacenter, I need to startup AMIs that contain my applications that then connect to my redundant datastores in AWS and modify DNS settings to route traffic to AWS.
  3. Partial AWS - In this scenario I have a complete duplicate of my services permanently running in AWS, but on a minimal number of AMIs. In case of failure of my physical datacenter, I scale up the configuration of my AMIs to cope with the increased load.
  4. Complete AWS - In this scenario I have a complete duplicated configuration of my physical datacenter in AWS. I can use weighted 53 DNS service from AWS to redirect traffic, application logic to use theAWS datastores and EC2 auto-scaling to grow capacity within AWS.

More Stories By Jonathan Gershater

Jonathan Gershater has lived and worked in Silicon Valley since 1996, primarily doing system and sales engineering specializing in: Web Applications, Identity and Security. At Red Hat, he provides Technical Marketing for Virtualization and Cloud. Prior to joining Red Hat, Jonathan worked at 3Com, Entrust (by acquisition) two startups, Sun Microsystems and Trend Micro.

(The views expressed in this blog are entirely mine and do not represent my employer - Jonathan).

@CloudExpo Stories
Poor data quality and analytics drive down business value. In fact, Gartner estimated that the average financial impact of poor data quality on organizations is $9.7 million per year. But bad data is much more than a cost center. By eroding trust in information, analytics and the business decisions based on these, it is a serious impediment to digital transformation.
Daniel Jones is CTO of EngineerBetter, helping enterprises deliver value faster. Previously he was an IT consultant, indie video games developer, head of web development in the finance sector, and an award-winning martial artist. Continuous Delivery makes it possible to exploit findings of cognitive psychology and neuroscience to increase the productivity and happiness of our teams.
Predicting the future has never been more challenging - not because of the lack of data but because of the flood of ungoverned and risk laden information. Microsoft states that 2.5 exabytes of data are created every day. Expectations and reliance on data are being pushed to the limits, as demands around hybrid options continue to grow.
The standardization of container runtimes and images has sparked the creation of an almost overwhelming number of new open source projects that build on and otherwise work with these specifications. Of course, there's Kubernetes, which orchestrates and manages collections of containers. It was one of the first and best-known examples of projects that make containers truly useful for production use. However, more recently, the container ecosystem has truly exploded. A service mesh like Istio addr...
Digital Transformation: Preparing Cloud & IoT Security for the Age of Artificial Intelligence. As automation and artificial intelligence (AI) power solution development and delivery, many businesses need to build backend cloud capabilities. Well-poised organizations, marketing smart devices with AI and BlockChain capabilities prepare to refine compliance and regulatory capabilities in 2018. Volumes of health, financial, technical and privacy data, along with tightening compliance requirements by...
Business professionals no longer wonder if they'll migrate to the cloud; it's now a matter of when. The cloud environment has proved to be a major force in transitioning to an agile business model that enables quick decisions and fast implementation that solidify customer relationships. And when the cloud is combined with the power of cognitive computing, it drives innovation and transformation that achieves astounding competitive advantage.
Evan Kirstel is an internationally recognized thought leader and social media influencer in IoT (#1 in 2017), Cloud, Data Security (2016), Health Tech (#9 in 2017), Digital Health (#6 in 2016), B2B Marketing (#5 in 2015), AI, Smart Home, Digital (2017), IIoT (#1 in 2017) and Telecom/Wireless/5G. His connections are a "Who's Who" in these technologies, He is in the top 10 most mentioned/re-tweeted by CMOs and CIOs (2016) and have been recently named 5th most influential B2B marketeer in the US. H...
DevOpsSummit New York 2018, colocated with CloudEXPO | DXWorldEXPO New York 2018 will be held November 11-13, 2018, in New York City. Digital Transformation (DX) is a major focus with the introduction of DXWorldEXPO within the program. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of bus...
DXWordEXPO New York 2018, colocated with CloudEXPO New York 2018 will be held November 11-13, 2018, in New York City and will bring together Cloud Computing, FinTech and Blockchain, Digital Transformation, Big Data, Internet of Things, DevOps, AI, Machine Learning and WebRTC to one location.
Cloud Expo | DXWorld Expo have announced the conference tracks for Cloud Expo 2018. Cloud Expo will be held June 5-7, 2018, at the Javits Center in New York City, and November 6-8, 2018, at the Santa Clara Convention Center, Santa Clara, CA. Digital Transformation (DX) is a major focus with the introduction of DX Expo within the program. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive ov...
DXWorldEXPO | CloudEXPO are the world's most influential, independent events where Cloud Computing was coined and where technology buyers and vendors meet to experience and discuss the big picture of Digital Transformation and all of the strategies, tactics, and tools they need to realize their goals. Sponsors of DXWorldEXPO | CloudEXPO benefit from unmatched branding, profile building and lead generation opportunities.
Dion Hinchcliffe is an internationally recognized digital expert, bestselling book author, frequent keynote speaker, analyst, futurist, and transformation expert based in Washington, DC. He is currently Chief Strategy Officer at the industry-leading digital strategy and online community solutions firm, 7Summits.
@DevOpsSummit New York 2018, colocated with CloudEXPO | DXWorldEXPO New York 2018 will be held November 11-13, 2018, in New York City. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the world's largest enterprises - and delivering real results.
The dynamic nature of the cloud means that change is a constant when it comes to modern cloud-based infrastructure. Delivering modern applications to end users, therefore, is a constantly shifting challenge. Delivery automation helps IT Ops teams ensure that apps are providing an optimal end user experience over hybrid-cloud and multi-cloud environments, no matter what the current state of the infrastructure is. To employ a delivery automation strategy that reflects your business rules, making r...
DXWorldEXPO LLC announced today that Dez Blanchfield joined the faculty of CloudEXPO's "10-Year Anniversary Event" which will take place on November 11-13, 2018 in New York City. Dez is a strategic leader in business and digital transformation with 25 years of experience in the IT and telecommunications industries developing strategies and implementing business initiatives. He has a breadth of expertise spanning technologies such as cloud computing, big data and analytics, cognitive computing, m...
Digital Transformation and Disruption, Amazon Style - What You Can Learn. Chris Kocher is a co-founder of Grey Heron, a management and strategic marketing consulting firm. He has 25+ years in both strategic and hands-on operating experience helping executives and investors build revenues and shareholder value. He has consulted with over 130 companies on innovating with new business models, product strategies and monetization. Chris has held management positions at HP and Symantec in addition to ...
DXWorldEXPO LLC announced today that Kevin Jackson joined the faculty of CloudEXPO's "10-Year Anniversary Event" which will take place on November 11-13, 2018 in New York City. Kevin L. Jackson is a globally recognized cloud computing expert and Founder/Author of the award winning "Cloud Musings" blog. Mr. Jackson has also been recognized as a "Top 100 Cybersecurity Influencer and Brand" by Onalytica (2015), a Huffington Post "Top 100 Cloud Computing Experts on Twitter" (2013) and a "Top 50 C...
Cloud-enabled transformation has evolved from cost saving measure to business innovation strategy -- one that combines the cloud with cognitive capabilities to drive market disruption. Learn how you can achieve the insight and agility you need to gain a competitive advantage. Industry-acclaimed CTO and cloud expert, Shankar Kalyana presents. Only the most exceptional IBMers are appointed with the rare distinction of IBM Fellow, the highest technical honor in the company. Shankar has also receive...
Enterprises have taken advantage of IoT to achieve important revenue and cost advantages. What is less apparent is how incumbent enterprises operating at scale have, following success with IoT, built analytic, operations management and software development capabilities - ranging from autonomous vehicles to manageable robotics installations. They have embraced these capabilities as if they were Silicon Valley startups.
Digital transformation is about embracing digital technologies into a company's culture to better connect with its customers, automate processes, create better tools, enter new markets, etc. Such a transformation requires continuous orchestration across teams and an environment based on open collaboration and daily experiments. In his session at 21st Cloud Expo, Alex Casalboni, Technical (Cloud) Evangelist at Cloud Academy, explored and discussed the most urgent unsolved challenges to achieve fu...