@CloudExpo Authors: Yeshim Deniz, Elizabeth White, Pat Romanski, Liz McMillan, Zakia Bouachraoui

Related Topics: @CloudExpo, Java IoT, Microservices Expo, Machine Learning

@CloudExpo: Blog Feed Post

Failure as a Service

The conversation stems from two power outages on May 4 and an extended power loss early on Saturday, May 8

A recent seven hour outage at Amazon Web Services on Saturday has renewed the discussion about cloud failures and whether the customer or the provider of the services should be held responsible. The conversation stems from two power outages on May 4 and an extended power loss early on Saturday, May 8. Saturday’s outage began at about 12:20 a.m. and lasted until 7:20 a.m., and affected a “set of racks,” according to Amazon, which said the bulk of customers in its U.S. East availability zone remained unaffected.

In one of the most direct posts, Amazon EBS sucks I just lost all my data, Dave Dopson said "they [AWS] promise redundancy, it is BS." Going on to point to AWS's statement " EBS volumes are designed to be highly available and reliable. Amazon EBS volume data is replicated across multiple servers in an Availability Zone to prevent the loss of data from the failure of any single component. The durability of your volume depends both on the size of your volume and the percentage of the data that has changed since your last snapshot. As an example, volumes that operate with 20 GB or lessof modified data since their most recent Amazon EBS snapshot can expect an annual failure rate (AFR) of between 0.1% –0.5%, where failure refers to a complete loss of the volume. This compares with commodity hard disks that will typically fail with an AFR of around 4%, making EBS volumes 10 times more reliable than typical commodity disk drives."

Like many new users to cloud computing, he assumed that he could just use the service and upon failure AWS's redundancy would automatically fix any problems, because they do (sort of) say that they prevent data loss. What Amazon actually states is a little different in that "the durability of your volume depends both on the size of your volume and the percentage of the data that has changed since your last snapshot" placing the responsibility on the customer. On one hand they say that they prevent data loss, but only if you use the AWS cloud correctly, otherwise you're SOL. The reality is that AWS for most users requires significant failure planning -- in this case the use of EBS's snap shot capability. The problem is that most [new] users have a hard time learning the rules of the road. A quick search for AWS failure planning on the AWS forums resulted in little additional insights and really appears to mostly about trial and error.

In the case of hardware failures Amazon expects you to design your architecture correctly for these kinds of events by use of redundancy, for example use mutliple VM's etc. They expect a certain level of knowledge of both system administration as well as how AWS itself has been designed to be used. Newbies need not apply or should use at you're own risk. Which isn't all that clear to a new user, who hears that cloud computing is safe and the answer to all your problems. Which I admit should be a red flag in itself. The problem is two fold, an over hyped technology and unclear failure models which combine to create a perfect storm. You need the late adopters for the real revenue opportunities, but these same late adopters require a different more gentle kind of cloud service, probably one a little more platform than infrastructure focused. As IaaS matures it is becoming obvious that the "Über Geek" developers who first adopted the service is not where the long tail revenue opportunities are. To make IaaS viable to a broader market, AWS and other IaaS vendors need to mature their platforms for a lesser type of user. (A lower or least common denominator) One who is smart enough to be dangerous, otherwise they're doomed to be limited to the only for experts only segment.

The bigger question is should a cloud user have to worry about hardware failures or should these types of failures be the sole responsiblity of the service provider? My opinion is deploying to the cloud should reduce complexity, not increase it. The user should be responsible for what they have access to, so in the case of AWS, they should be responsible for failures that are brought about by the applications and related components they build and deploy, not by the hardware. If hardware fails (which it will) this should be the responsibility of those who manage and provide it. Making things worst is promising to be highly available, reliable and redundant, but with the fine print of "if you are smart enough to use all our services in the proper way" which isn't fair. If EBS is automatically replicated why did Dave lose all his data?

In a optimal cloud environment any single server failures shouldn't matter. But it appears at AWS it does.

More Stories By Reuven Cohen

An instigator, part time provocateur, bootstrapper, amateur cloud lexicographer, and purveyor of random thoughts, 140 characters at a time.

Reuven is an early innovator in the cloud computing space as the founder of Enomaly in 2004 (Acquired by Virtustream in February 2012). Enomaly was among the first to develop a self service infrastructure as a service (IaaS) platform (ECP) circa 2005. As well as SpotCloud (2011) the first commodity style cloud computing Spot Market.

Reuven is also the co-creator of CloudCamp (100+ Cities around the Globe) CloudCamp is an unconference where early adopters of Cloud Computing technologies exchange ideas and is the largest of the ‘barcamp’ style of events.

CloudEXPO Stories
Sanjeev Sharma Joins November 11-13, 2018 @DevOpsSummit at @CloudEXPO New York Faculty. Sanjeev Sharma is an internationally known DevOps and Cloud Transformation thought leader, technology executive, and author. Sanjeev's industry experience includes tenures as CTO, Technical Sales leader, and Cloud Architect leader. As an IBM Distinguished Engineer, Sanjeev is recognized at the highest levels of IBM's core of technical leaders.
DXWorldEXPO LLC announced today that Kevin Jackson joined the faculty of CloudEXPO's "10-Year Anniversary Event" which will take place on November 11-13, 2018 in New York City. Kevin L. Jackson is a globally recognized cloud computing expert and Founder/Author of the award winning "Cloud Musings" blog. Mr. Jackson has also been recognized as a "Top 100 Cybersecurity Influencer and Brand" by Onalytica (2015), a Huffington Post "Top 100 Cloud Computing Experts on Twitter" (2013) and a "Top 50 Cloud Computing Blogger for IT Integrators" by CRN (2015). Mr. Jackson's professional career includes service in the US Navy Space Systems Command, Vice President J.P. Morgan Chase, Worldwide Sales Executive for IBM and NJVC Vice President, Cloud Services. He is currently part of a team responsible for onboarding mission applications to the US Intelligence Community cloud computing environment (IC ...
Charles Araujo is an industry analyst, internationally recognized authority on the Digital Enterprise and author of The Quantum Age of IT: Why Everything You Know About IT is About to Change. As Principal Analyst with Intellyx, he writes, speaks and advises organizations on how to navigate through this time of disruption. He is also the founder of The Institute for Digital Transformation and a sought after keynote speaker. He has been a regular contributor to both InformationWeek and CIO Insight and has been quoted or published in Time, CIO, Computerworld, USA Today and Forbes.
When talking IoT we often focus on the devices, the sensors, the hardware itself. The new smart appliances, the new smart or self-driving cars (which are amalgamations of many ‘things'). When we are looking at the world of IoT, we should take a step back, look at the big picture. What value are these devices providing. IoT is not about the devices, its about the data consumed and generated. The devices are tools, mechanisms, conduits. This paper discusses the considerations when dealing with the massive amount of information associated with these devices. Ed presented sought out sessions at CloudEXPO Silicon Valley 2017 and CloudEXPO New York 2017. He is a regular contributor to Cloud Computing Journal.
Digital Transformation is much more than a buzzword. The radical shift to digital mechanisms for almost every process is evident across all industries and verticals. This is often especially true in financial services, where the legacy environment is many times unable to keep up with the rapidly shifting demands of the consumer. The constant pressure to provide complete, omnichannel delivery of customer-facing solutions to meet both regulatory and customer demands is putting enormous pressure on organizations of all sizes and in every line of business. Fintech is a constant battleground for this technology expanding trend and the lessons learned here can be applied anywhere. Digital transformation isn't going to go away and the need for greater understanding and skills around managing, guiding, and understanding the greater landscape of change is required for effective transformations.