@CloudExpo Authors: Pat Romanski, Dean Madison, Liz McMillan, Yeshim Deniz, Elizabeth White

Related Topics: @CloudExpo

@CloudExpo: Blog Post

Amazon's Outage: Winners and Losers

There were three startling revelations from this failure

In case you haven’t heard, last week Amazon’s Web Services had an extended outage that affected a lot of cloud users and has created a big stir in the cloud computing community.  Here is my take on the outage, what it means, and how it affected both us and our customers.

First, the outage – Amazon’s Elastic Block Storage (EBS) system failed in one “small” part of Amazon’s mammoth cloud.  The easiest way to identify with this failing system is to think of it as the hard drive in your laptop or home computer dying.  Almost everyone has experienced this frustrating failure and it is always a frightening event.  Your hard disk contains all of the data and programs you know and love (as well as all of the complex configurations that few really understand) that are required to give your computer its identity and run properly.  When it fails, you get that sinking feeling – what have I lost, do I have any backups, how long is it going to take me to get up and running again?  If you are lucky, it’s actually something else that failed, and you can remove the disk drive and plug it into a new computer, or at least recover the data from it.

Well, that is what happened in Amazon – the service that provided disk drives to the virtual machines in the cloud had a failure.  Thousands (or perhaps tens of thousands) of computers suddenly had their hard drives “die”, and the big question was (repeated over and over)– what have I lost, do I have any backups, how long is it going to take me to get running again?  At a more pressing level the question becomes did Amazon lose our data, or just connectivity to that data?  In some circles there is no distinction between the two, but for most people, it just like that laptop – can I get my stuff back, or do I have to start from scratch?  The good news is that it appears that most, if not all, of the data was recovered from this failure – indicating that the failure was in connectivity or that the data protection scheme that Amazon has in place was good enough to recover the data from the failed systems (or both).

There were three startling revelations from this failure:

  1. Cloud storage can fail – Ok, so this should not be startling, but we have not seen a failure like this in Amazon before, and we took it for granted that Amazon had “protected” the data and systems.  There are nice features like “snapshot to S3 storage” that let users make copies of their disks into Amazon’s well known and respected Simple Storage Service (S3).  This feature made people feel safe about their backups – right up to the point that they could not access the snapshots during the failure.
  2. People were using this storage without knowing it – Amazon is using their own infrastructure to deliver their other services as evidenced by other features being degraded or un-available when the EBS system went down.  Some have speculated that in addition to the RDS service and the new Elastic Beanstalk, that some core networking functions could have been affected.
  3. This storage failure apparently jumped across “data centers” - OK, so this is a big one; Amazon encouraged us to build applications for failures and after an outage early in 2008, they introduced the notion of Availability Zones.  These zones were designed to be independent data centers (or at least on separate power supplies, different networks, and not sharing core services).  This would allow companies deploying into Amazon to place servers and applications into different zones (or data centers) to account for the inevitable faults in one zone.  The fact that this issue “spread” to more than one zone is a big deal - those who designed for failures using different zones in Amazon’s east region were surprised to find that they could not “recover” from this failure.

This outage is a major event in cloud computing – the leader in cloud computing had a failure, a service went down for an extended period of time, and lots of companies were impacted by the fault.  Now everyone is looking for the winners and losers – those who survived this outage and those who didn’t.  Those who continued operation with little or no disruption fall into three groups – 1) Those who were lucky, 2) Those who were not using the effected services, 3) Those who had designed for this level of failure.

Based on our experiences and much of what I have read, the majority of the success cases during this failure were related to luck and those who didn’t use the service.  Keep in mind that Amazon is huge, and they have “regions” all over the world including East and West coast of the US, Singapore, Tokyo, and Ireland.  Each of these regions has at least two availability zones.  The failure was primarily focused on one zone within one region.  This means that everything running in other zones and other regions remained up and running during this outage and thus the majority of deployments worldwide were unaffected.  I have read a few blogs of major users that stated that they don’t use the EBS service and thus had little or no trouble during this outage.

So what is required to survive this kind of failure?  Many would say new architectures and designs are required to deal with the inherent unreliability of the cloud. I believe that customers can keep the same techniques, architectures, and designs that have been developed over the last 30 or more years, and it is one of the cornerstones of the CloudSwitch strategy.  We believe that it should be your choice on where to use new features and solutions, and where to use your traditional systems and processes, and it should be easy to blend the two.

To that end, some of our customers are using a technique that extends their environments into the cloud; using the cloud to pick up additional load on their systems.  In this failure case, they were able to rely  on their internal systems to continue their operations.  In other cases, customers want to use their existing backup systems to create an independent copy of their critical data (either in a different region, or in their existing data center).  With the cloud, they can bring up new systems utilizing their backups, and continue with operations.  The CloudSwitch system allows them to bring up systems in different regions, or even different clouds in response to outages; our tight integration with the data center tools allows them to use their existing monitoring systems and adjust for problems encountered in the cloud through automation.

How did we do?  We’re very heavy users of the cloud, and many of our servers in Amazon were not impacted.  Of the few that were impacted by the outage, a few key systems were “switched” back to our data center, and a unfortunately a few went down.  On the servers that went down, we had decided to use Amazon’s snapshot feature as the data protection mechanism;  we felt this was sufficient for these applications, and therefore  we did not bother to run more traditional backups (or data replication).  Given what we have learned from this experience and from observing how the community dealt with this outage we will now review those decisions.  In the end, we’ll have a few more traditionally protected systems, and a few less that rely solely on the cloud providers infrastructure for data protection.

The outage from Amazon severely impacted many businesses and has caused many others to question the wisdom of clouds. The reality is that public and private clouds are a fact in the compute landscape, the only question is how do we insure that we have adequate protection? The answer lies in the experience that we have gained over the past couple of decades in building robust systems – in other words: what’s old is new.

Read the original blog entry...

More Stories By John Considine

John Considine is Co-Founder & CTO of Cloudswitch. He brings two decades of technology vision and proven experience in complex enterprise system development, integration and product delivery to CloudSwitch. Before founding CloudSwitch, he was Director of the Platform Products Group at Sun Microsystems, where he was responsible for the 69xx virtualized block storage system, 53xx NAS products, the 5800 Object Archive system, as well as the next generation NAS portfolio.

Considine came to Sun through the acquisition of Pirus Networks, where he was part of the early engineering team responsible for the development and release of the Pirus NAS product, including advanced development of parallel NAS functions and the Segmented File System. He has started and boot-strapped a number of start-ups with breakthrough technology in high-performance distributed systems and image processing. He has been granted patents for RAID and distributed file system technology. He began his career as an engineer at Raytheon Missile Systems, and holds a BS in Electrical Engineering from Rensselaer Polytechnic Institute.

@CloudExpo Stories
@DevOpsSummit New York 2018, colocated with CloudEXPO | DXWorldEXPO New York 2018 will be held November 11-13, 2018, in New York City. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the world's largest enterprises - and delivering real results.
Dion Hinchcliffe is an internationally recognized digital expert, bestselling book author, frequent keynote speaker, analyst, futurist, and transformation expert based in Washington, DC. He is currently Chief Strategy Officer at the industry-leading digital strategy and online community solutions firm, 7Summits.
DXWorldEXPO LLC announced today that Dez Blanchfield joined the faculty of CloudEXPO's "10-Year Anniversary Event" which will take place on November 11-13, 2018 in New York City. Dez is a strategic leader in business and digital transformation with 25 years of experience in the IT and telecommunications industries developing strategies and implementing business initiatives. He has a breadth of expertise spanning technologies such as cloud computing, big data and analytics, cognitive computing, m...
"We started a Master of Science in business analytics - that's the hot topic. We serve the business community around San Francisco so we educate the working professionals and this is where they all want to be," explained Judy Lee, Associate Professor and Department Chair at Golden Gate University, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Digital Transformation and Disruption, Amazon Style - What You Can Learn. Chris Kocher is a co-founder of Grey Heron, a management and strategic marketing consulting firm. He has 25+ years in both strategic and hands-on operating experience helping executives and investors build revenues and shareholder value. He has consulted with over 130 companies on innovating with new business models, product strategies and monetization. Chris has held management positions at HP and Symantec in addition to ...
DXWorldEXPO LLC announced today that Kevin Jackson joined the faculty of CloudEXPO's "10-Year Anniversary Event" which will take place on November 11-13, 2018 in New York City. Kevin L. Jackson is a globally recognized cloud computing expert and Founder/Author of the award winning "Cloud Musings" blog. Mr. Jackson has also been recognized as a "Top 100 Cybersecurity Influencer and Brand" by Onalytica (2015), a Huffington Post "Top 100 Cloud Computing Experts on Twitter" (2013) and a "Top 50 C...
Cloud-enabled transformation has evolved from cost saving measure to business innovation strategy -- one that combines the cloud with cognitive capabilities to drive market disruption. Learn how you can achieve the insight and agility you need to gain a competitive advantage. Industry-acclaimed CTO and cloud expert, Shankar Kalyana presents. Only the most exceptional IBMers are appointed with the rare distinction of IBM Fellow, the highest technical honor in the company. Shankar has also receive...
Enterprises have taken advantage of IoT to achieve important revenue and cost advantages. What is less apparent is how incumbent enterprises operating at scale have, following success with IoT, built analytic, operations management and software development capabilities - ranging from autonomous vehicles to manageable robotics installations. They have embraced these capabilities as if they were Silicon Valley startups.
Poor data quality and analytics drive down business value. In fact, Gartner estimated that the average financial impact of poor data quality on organizations is $9.7 million per year. But bad data is much more than a cost center. By eroding trust in information, analytics and the business decisions based on these, it is a serious impediment to digital transformation.
Daniel Jones is CTO of EngineerBetter, helping enterprises deliver value faster. Previously he was an IT consultant, indie video games developer, head of web development in the finance sector, and an award-winning martial artist. Continuous Delivery makes it possible to exploit findings of cognitive psychology and neuroscience to increase the productivity and happiness of our teams.
The standardization of container runtimes and images has sparked the creation of an almost overwhelming number of new open source projects that build on and otherwise work with these specifications. Of course, there's Kubernetes, which orchestrates and manages collections of containers. It was one of the first and best-known examples of projects that make containers truly useful for production use. However, more recently, the container ecosystem has truly exploded. A service mesh like Istio addr...
Predicting the future has never been more challenging - not because of the lack of data but because of the flood of ungoverned and risk laden information. Microsoft states that 2.5 exabytes of data are created every day. Expectations and reliance on data are being pushed to the limits, as demands around hybrid options continue to grow.
Business professionals no longer wonder if they'll migrate to the cloud; it's now a matter of when. The cloud environment has proved to be a major force in transitioning to an agile business model that enables quick decisions and fast implementation that solidify customer relationships. And when the cloud is combined with the power of cognitive computing, it drives innovation and transformation that achieves astounding competitive advantage.
Digital Transformation: Preparing Cloud & IoT Security for the Age of Artificial Intelligence. As automation and artificial intelligence (AI) power solution development and delivery, many businesses need to build backend cloud capabilities. Well-poised organizations, marketing smart devices with AI and BlockChain capabilities prepare to refine compliance and regulatory capabilities in 2018. Volumes of health, financial, technical and privacy data, along with tightening compliance requirements by...
Evan Kirstel is an internationally recognized thought leader and social media influencer in IoT (#1 in 2017), Cloud, Data Security (2016), Health Tech (#9 in 2017), Digital Health (#6 in 2016), B2B Marketing (#5 in 2015), AI, Smart Home, Digital (2017), IIoT (#1 in 2017) and Telecom/Wireless/5G. His connections are a "Who's Who" in these technologies, He is in the top 10 most mentioned/re-tweeted by CMOs and CIOs (2016) and have been recently named 5th most influential B2B marketeer in the US. H...
The best way to leverage your Cloud Expo presence as a sponsor and exhibitor is to plan your news announcements around our events. The press covering Cloud Expo and @ThingsExpo will have access to these releases and will amplify your news announcements. More than two dozen Cloud companies either set deals at our shows or have announced their mergers and acquisitions at Cloud Expo. Product announcements during our show provide your company with the most reach through our targeted audiences.
DevOpsSummit New York 2018, colocated with CloudEXPO | DXWorldEXPO New York 2018 will be held November 11-13, 2018, in New York City. Digital Transformation (DX) is a major focus with the introduction of DXWorldEXPO within the program. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of bus...
With 10 simultaneous tracks, keynotes, general sessions and targeted breakout classes, @CloudEXPO and DXWorldEXPO are two of the most important technology events of the year. Since its launch over eight years ago, @CloudEXPO and DXWorldEXPO have presented a rock star faculty as well as showcased hundreds of sponsors and exhibitors! In this blog post, we provide 7 tips on how, as part of our world-class faculty, you can deliver one of the most popular sessions at our events. But before reading...
As you move to the cloud, your network should be efficient, secure, and easy to manage. An enterprise adopting a hybrid or public cloud needs systems and tools that provide: Agility: ability to deliver applications and services faster, even in complex hybrid environments Easier manageability: enable reliable connectivity with complete oversight as the data center network evolves Greater efficiency: eliminate wasted effort while reducing errors and optimize asset utilization Security: implemen...
DXWordEXPO New York 2018, colocated with CloudEXPO New York 2018 will be held November 11-13, 2018, in New York City and will bring together Cloud Computing, FinTech and Blockchain, Digital Transformation, Big Data, Internet of Things, DevOps, AI, Machine Learning and WebRTC to one location.