Welcome!

@CloudExpo Authors: Elizabeth White, Gopala Krishna Behara, Sridhar Chalasani, Tirumala Khandrika, LeanTaaS Blog

Related Topics: @CloudExpo

@CloudExpo: Blog Post

Amazon's Outage: Winners and Losers

There were three startling revelations from this failure

In case you haven’t heard, last week Amazon’s Web Services had an extended outage that affected a lot of cloud users and has created a big stir in the cloud computing community.  Here is my take on the outage, what it means, and how it affected both us and our customers.

First, the outage – Amazon’s Elastic Block Storage (EBS) system failed in one “small” part of Amazon’s mammoth cloud.  The easiest way to identify with this failing system is to think of it as the hard drive in your laptop or home computer dying.  Almost everyone has experienced this frustrating failure and it is always a frightening event.  Your hard disk contains all of the data and programs you know and love (as well as all of the complex configurations that few really understand) that are required to give your computer its identity and run properly.  When it fails, you get that sinking feeling – what have I lost, do I have any backups, how long is it going to take me to get up and running again?  If you are lucky, it’s actually something else that failed, and you can remove the disk drive and plug it into a new computer, or at least recover the data from it.

Well, that is what happened in Amazon – the service that provided disk drives to the virtual machines in the cloud had a failure.  Thousands (or perhaps tens of thousands) of computers suddenly had their hard drives “die”, and the big question was (repeated over and over)– what have I lost, do I have any backups, how long is it going to take me to get running again?  At a more pressing level the question becomes did Amazon lose our data, or just connectivity to that data?  In some circles there is no distinction between the two, but for most people, it just like that laptop – can I get my stuff back, or do I have to start from scratch?  The good news is that it appears that most, if not all, of the data was recovered from this failure – indicating that the failure was in connectivity or that the data protection scheme that Amazon has in place was good enough to recover the data from the failed systems (or both).

There were three startling revelations from this failure:

  1. Cloud storage can fail – Ok, so this should not be startling, but we have not seen a failure like this in Amazon before, and we took it for granted that Amazon had “protected” the data and systems.  There are nice features like “snapshot to S3 storage” that let users make copies of their disks into Amazon’s well known and respected Simple Storage Service (S3).  This feature made people feel safe about their backups – right up to the point that they could not access the snapshots during the failure.
  2. People were using this storage without knowing it – Amazon is using their own infrastructure to deliver their other services as evidenced by other features being degraded or un-available when the EBS system went down.  Some have speculated that in addition to the RDS service and the new Elastic Beanstalk, that some core networking functions could have been affected.
  3. This storage failure apparently jumped across “data centers” - OK, so this is a big one; Amazon encouraged us to build applications for failures and after an outage early in 2008, they introduced the notion of Availability Zones.  These zones were designed to be independent data centers (or at least on separate power supplies, different networks, and not sharing core services).  This would allow companies deploying into Amazon to place servers and applications into different zones (or data centers) to account for the inevitable faults in one zone.  The fact that this issue “spread” to more than one zone is a big deal - those who designed for failures using different zones in Amazon’s east region were surprised to find that they could not “recover” from this failure.

This outage is a major event in cloud computing – the leader in cloud computing had a failure, a service went down for an extended period of time, and lots of companies were impacted by the fault.  Now everyone is looking for the winners and losers – those who survived this outage and those who didn’t.  Those who continued operation with little or no disruption fall into three groups – 1) Those who were lucky, 2) Those who were not using the effected services, 3) Those who had designed for this level of failure.

Based on our experiences and much of what I have read, the majority of the success cases during this failure were related to luck and those who didn’t use the service.  Keep in mind that Amazon is huge, and they have “regions” all over the world including East and West coast of the US, Singapore, Tokyo, and Ireland.  Each of these regions has at least two availability zones.  The failure was primarily focused on one zone within one region.  This means that everything running in other zones and other regions remained up and running during this outage and thus the majority of deployments worldwide were unaffected.  I have read a few blogs of major users that stated that they don’t use the EBS service and thus had little or no trouble during this outage.

So what is required to survive this kind of failure?  Many would say new architectures and designs are required to deal with the inherent unreliability of the cloud. I believe that customers can keep the same techniques, architectures, and designs that have been developed over the last 30 or more years, and it is one of the cornerstones of the CloudSwitch strategy.  We believe that it should be your choice on where to use new features and solutions, and where to use your traditional systems and processes, and it should be easy to blend the two.

To that end, some of our customers are using a technique that extends their environments into the cloud; using the cloud to pick up additional load on their systems.  In this failure case, they were able to rely  on their internal systems to continue their operations.  In other cases, customers want to use their existing backup systems to create an independent copy of their critical data (either in a different region, or in their existing data center).  With the cloud, they can bring up new systems utilizing their backups, and continue with operations.  The CloudSwitch system allows them to bring up systems in different regions, or even different clouds in response to outages; our tight integration with the data center tools allows them to use their existing monitoring systems and adjust for problems encountered in the cloud through automation.

How did we do?  We’re very heavy users of the cloud, and many of our servers in Amazon were not impacted.  Of the few that were impacted by the outage, a few key systems were “switched” back to our data center, and a unfortunately a few went down.  On the servers that went down, we had decided to use Amazon’s snapshot feature as the data protection mechanism;  we felt this was sufficient for these applications, and therefore  we did not bother to run more traditional backups (or data replication).  Given what we have learned from this experience and from observing how the community dealt with this outage we will now review those decisions.  In the end, we’ll have a few more traditionally protected systems, and a few less that rely solely on the cloud providers infrastructure for data protection.

The outage from Amazon severely impacted many businesses and has caused many others to question the wisdom of clouds. The reality is that public and private clouds are a fact in the compute landscape, the only question is how do we insure that we have adequate protection? The answer lies in the experience that we have gained over the past couple of decades in building robust systems – in other words: what’s old is new.

Read the original blog entry...

More Stories By John Considine

John Considine is Co-Founder & CTO of Cloudswitch. He brings two decades of technology vision and proven experience in complex enterprise system development, integration and product delivery to CloudSwitch. Before founding CloudSwitch, he was Director of the Platform Products Group at Sun Microsystems, where he was responsible for the 69xx virtualized block storage system, 53xx NAS products, the 5800 Object Archive system, as well as the next generation NAS portfolio.

Considine came to Sun through the acquisition of Pirus Networks, where he was part of the early engineering team responsible for the development and release of the Pirus NAS product, including advanced development of parallel NAS functions and the Segmented File System. He has started and boot-strapped a number of start-ups with breakthrough technology in high-performance distributed systems and image processing. He has been granted patents for RAID and distributed file system technology. He began his career as an engineer at Raytheon Missile Systems, and holds a BS in Electrical Engineering from Rensselaer Polytechnic Institute.

@CloudExpo Stories
"I focus on what we are calling CAST Highlight, which is our SaaS application portfolio analysis tool. It is an extremely lightweight tool that can integrate with pretty much any build process right now," explained Andrew Siegmund, Application Migration Specialist for CAST, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
SYS-CON Events announced today that Synametrics Technologies will exhibit at SYS-CON's 22nd International Cloud Expo®, which will take place on June 5-7, 2018, at the Javits Center in New York, NY. Synametrics Technologies is a privately held company based in Plainsboro, New Jersey that has been providing solutions for the developer community since 1997. Based on the success of its initial product offerings such as WinSQL, Xeams, SynaMan and Syncrify, Synametrics continues to create and hone inn...
"We started a Master of Science in business analytics - that's the hot topic. We serve the business community around San Francisco so we educate the working professionals and this is where they all want to be," explained Judy Lee, Associate Professor and Department Chair at Golden Gate University, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
DevOps promotes continuous improvement through a culture of collaboration. But in real terms, how do you: Integrate activities across diverse teams and services? Make objective decisions with system-wide visibility? Use feedback loops to enable learning and improvement? With technology insights and real-world examples, in his general session at @DevOpsSummit, at 21st Cloud Expo, Andi Mann, Chief Technology Advocate at Splunk, explored how leading organizations use data-driven DevOps to close th...
"Evatronix provides design services to companies that need to integrate the IoT technology in their products but they don't necessarily have the expertise, knowledge and design team to do so," explained Adam Morawiec, VP of Business Development at Evatronix, in this SYS-CON.tv interview at @ThingsExpo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
The 22nd International Cloud Expo | 1st DXWorld Expo has announced that its Call for Papers is open. Cloud Expo | DXWorld Expo, to be held June 5-7, 2018, at the Javits Center in New York, NY, brings together Cloud Computing, Digital Transformation, Big Data, Internet of Things, DevOps, Machine Learning and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding busin...
To get the most out of their data, successful companies are not focusing on queries and data lakes, they are actively integrating analytics into their operations with a data-first application development approach. Real-time adjustments to improve revenues, reduce costs, or mitigate risk rely on applications that minimize latency on a variety of data sources. In his session at @BigDataExpo, Jack Norris, Senior Vice President, Data and Applications at MapR Technologies, reviewed best practices to ...
There is a huge demand for responsive, real-time mobile and web experiences, but current architectural patterns do not easily accommodate applications that respond to events in real time. Common solutions using message queues or HTTP long-polling quickly lead to resiliency, scalability and development velocity challenges. In his session at 21st Cloud Expo, Ryland Degnan, a Senior Software Engineer on the Netflix Edge Platform team, will discuss how by leveraging a reactive stream-based protocol,...
In his Opening Keynote at 21st Cloud Expo, John Considine, General Manager of IBM Cloud Infrastructure, led attendees through the exciting evolution of the cloud. He looked at this major disruption from the perspective of technology, business models, and what this means for enterprises of all sizes. John Considine is General Manager of Cloud Infrastructure Services at IBM. In that role he is responsible for leading IBM’s public cloud infrastructure including strategy, development, and offering m...
"Digital transformation - what we knew about it in the past has been redefined. Automation is going to play such a huge role in that because the culture, the technology, and the business operations are being shifted now," stated Brian Boeggeman, VP of Alliances & Partnerships at Ayehu, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
The dynamic nature of the cloud means that change is a constant when it comes to modern cloud-based infrastructure. Delivering modern applications to end users, therefore, is a constantly shifting challenge. Delivery automation helps IT Ops teams ensure that apps are providing an optimal end user experience over hybrid-cloud and multi-cloud environments, no matter what the current state of the infrastructure is. To employ a delivery automation strategy that reflects your business rules, making r...
Nordstrom is transforming the way that they do business and the cloud is the key to enabling speed and hyper personalized customer experiences. In his session at 21st Cloud Expo, Ken Schow, VP of Engineering at Nordstrom, discussed some of the key learnings and common pitfalls of large enterprises moving to the cloud. This includes strategies around choosing a cloud provider(s), architecture, and lessons learned. In addition, he covered some of the best practices for structured team migration an...
You know you need the cloud, but you're hesitant to simply dump everything at Amazon since you know that not all workloads are suitable for cloud. You know that you want the kind of ease of use and scalability that you get with public cloud, but your applications are architected in a way that makes the public cloud a non-starter. You're looking at private cloud solutions based on hyperconverged infrastructure, but you're concerned with the limits inherent in those technologies. What do you do?
As many know, the first generation of Cloud Management Platform (CMP) solutions were designed for managing virtual infrastructure (IaaS) and traditional applications. But that's no longer enough to satisfy evolving and complex business requirements. In his session at 21st Cloud Expo, Scott Davis, Embotics CTO, explored how next-generation CMPs ensure organizations can manage cloud-native and microservice-based application architectures, while also facilitating agile DevOps methodology. He expla...
In a recent survey, Sumo Logic surveyed 1,500 customers who employ cloud services such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). According to the survey, a quarter of the respondents have already deployed Docker containers and nearly as many (23 percent) are employing the AWS Lambda serverless computing framework. It’s clear: serverless is here to stay. The adoption does come with some needed changes, within both application development and operations. Tha...
Modern software design has fundamentally changed how we manage applications, causing many to turn to containers as the new virtual machine for resource management. As container adoption grows beyond stateless applications to stateful workloads, the need for persistent storage is foundational - something customers routinely cite as a top pain point. In his session at @DevOpsSummit at 21st Cloud Expo, Bill Borsari, Head of Systems Engineering at Datera, explored how organizations can reap the bene...
The past few years have brought a sea change in the way applications are architected, developed, and consumed—increasing both the complexity of testing and the business impact of software failures. How can software testing professionals keep pace with modern application delivery, given the trends that impact both architectures (cloud, microservices, and APIs) and processes (DevOps, agile, and continuous delivery)? This is where continuous testing comes in. D
SYS-CON Events announced today that Evatronix will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Evatronix SA offers comprehensive solutions in the design and implementation of electronic systems, in CAD / CAM deployment, and also is a designer and manufacturer of advanced 3D scanners for professional applications.
Digital transformation is about embracing digital technologies into a company's culture to better connect with its customers, automate processes, create better tools, enter new markets, etc. Such a transformation requires continuous orchestration across teams and an environment based on open collaboration and daily experiments. In his session at 21st Cloud Expo, Alex Casalboni, Technical (Cloud) Evangelist at Cloud Academy, explored and discussed the most urgent unsolved challenges to achieve f...
With tough new regulations coming to Europe on data privacy in May 2018, Calligo will explain why in reality the effect is global and transforms how you consider critical data. EU GDPR fundamentally rewrites the rules for cloud, Big Data and IoT. In his session at 21st Cloud Expo, Adam Ryan, Vice President and General Manager EMEA at Calligo, examined the regulations and provided insight on how it affects technology, challenges the established rules and will usher in new levels of diligence arou...