Welcome!

@CloudExpo Authors: Carmen Gonzalez, Liz McMillan, Pat Romanski, Elizabeth White, Harry Trott

Related Topics: Containers Expo Blog, @CloudExpo

Containers Expo Blog: Article

Cloud Computing and Reliability

The Cloud offers some enticing advantages with respect to reliability

Eric Novikoff's Blog

IT managers and pundits speak of the reliability of a system in "nines." Two nines is the same as 99%, which comes to (100%-99%)*365 or 3.65 days of downtime per year, which is typical for non-redundant hardware if you include the time to reload the operating system and restore backups (if you have them) after a failure. Three nines is about 8 hours of downtime, four nines is about 52 minutes and the holy grail of 5 nines is 7 minutes.

From a users' point of view, downtime is downtime, but for a provider/vendor/web site manager, downtime is divided into planned and unplanned. Cloud computing can offer some benefits for planned downtime, but the place that it can have the largest effect on a business is in reducing unplanned downtime.

Planned downtime is usually the result of having to do some sort of software maintenance or release process, which is usually outside the domain of the cloud vendor, unless that vendor also offers IT operations services. Other sources of planned downtime are upgrades or scheduled equipment repairs. Most cloud vendors have some planned downtime, but because their business is based on providing high uptime, scheduled downtimes are kept to a minimum.

Unplanned downtime is where cloud vendors have the most to offer, and also the most to lose. Recent large outages at Amazon and Google have shown that even the largest cloud vendors can still have glitches that take considerable time to repair and give potential cloud customers a scare (perhaps it is because they didn't take some planned downtime??) On the other hand, cloud vendors have the experienced staff and proven processes that should produce overall hardware and network reliability that meets or exceeds that of the average corporate data center, and far exceeds anything you can achieve with colocated or self-managed servers.

However, despite claims of reliablity, few cloud vendors have tight SLAs (service level agreements) that promise controlled downtime or offer rebates for excess downtime. Amazon goes the opposite direction and doesn't offer any uptime guarantees, even cautioning users that their instance (or server) can disappear at any time and that they should plan accordingly. AppLogic-based clouds, provided by companies such as ENKI, are capable of offering better guarantees of uptime because of its inherent self-healing capabilities that can enable 3-4 nines of uptime. (The exact number depends on how the AppLogic system is set up and administered, which affects the time needed for the system to heal itself.) However, any cloud computing system, even even those based on AppLogic or similar technologies, can experience unplanned downtime for a variety of reasons, including the common culprit of human error. While I believe it is possible to produce a cloud computing service that exceeds 4-9s of uptime, the costs would be so high that few would buy it when they compared the price to the average cloud offering.


When you're purchasing cloud computing, it makes sense to look at the SLA of the vendor as well as the reliability of the underlying technology. But if your needs for uptime exceed that which the vendors and their technology can offer, there are time-honored techniques for improving it, most of which involve doubling the amount of computing nodes in your application. There's an old adage that each additional "9" of uptime you get doubles your cost, and that's because you need backup systems that are in place to take over if the primaries fail. This involves creating a system architecture for your application that allows for either active/passive failover (meaning that the backup nodes are running but not doing anything) or active/active failover (meaning that the backup nodes are normally providing application computing capability).

These solutions can be implemented in any cloud technology but they always require extra design and configuration effort for your application, and they should be tested rigorously to make sure they will work when the chips are down. Failover solutions are generally less expensive to implement in the Cloud because of the on-demand or pay-as-you go nature of cloud services, which means that you can easily size the backup server nodes to meet your needs and save on computing resources.

An important component of reliability is a good backup strategy. With cloud computing systems like AppLogic offering highly reliable storage as part of the package, many customers are tempted to skip backup. But data loss and the resulting unplanned downtime can result not just from failures in the cloud platform, but also software bugs, human error, or malfeasance such as hacking. If you don't have a backup, you'll be down a long time - and this applies equally to cloud and non-cloud solutions. The advantages of cloud solutions is that there is usually an inexpensive and large storage facility coupled with the cloud computing offering which gives you a convenient place to store your backups.

For the truly fanatical, backing up your data from one cloud vendor to another provides that extra measure of security. It pays to think through your backup strategy because most of today's backup software packages or remote backup services were designed for physical servers and not virtual environments having many virtual servers such as you might find in the cloud. This can mean very high software costs for doing backup if your backup software charges on a "per server" basis and your application is spread across many instances. If your cloud vendor has a backup offering, usually they have found a way to make backup affordable even if your application consists of many compute instances.

Another aspect of reliability that often escapes cloud computing customers new to the world of computing services is monitoring. It's very hard to react to unplanned downtime if you don't know your system is down. It's also hard to avoid unplanned downtime if you don't know you're about to run out of disk space or memory, or perhaps your application is complaining about data corruption. A remote monitoring service can scan your servers in the cloud on a regular basis for faults, application problems, or even measure the performance of your application (like how long it takes to buy a widget in your web store) and report to you if anything is out of the ordinary. I say "service" because if you were to install your own monitoring server into your cloud and the cloud went down, so would your monitoring! At ENKI, we solve this problem by having our monitoring service hosted in a separate data center and under a different software environment than our primary cloud hosting service.

The last aspect of reliability is security. However, that would require another entire article to cover, since security in the cloud is a complex and relatively new topic.

To sum up, the Cloud offers some enticing advantages with respect to reliability, perhaps the largest of which is that you can give your data center operations responsibility to someone who theoretically can do a much better job at a lower cost than you can. However, to get very good reliability, you must still apply traditional approaches of redundancy and observability that have been used in physical data centers for decades - or, you have to find a cloud computing services provider that can implement them for you.

More Stories By Eric Novikoff

Eric Novikoff is COO of ENKI, A Cloud Services Vendor. He has over 20 years of experience in the electronics and software industries, over a range of positions from integrated circuit designer to software/hardware project manager, to Director of Development at an Internet Software As A Service startup, Netsuite.com. His technical, project, and financial management skills have been honed in multiple positions at Hewlett-Packard and Agilent Technologies on a variety of product lines, including managing the development and roll-out of a worldwide CRM and sales automation application for Agilent's $350 million Automatic Test Equipment business. Novikoff also has a strong interest in SME (Small/Medium Size Enterprise) management, process development, and operations as a consequence of working at a web based ERP service startup serving SMEs, and through his small-business ERP consulting work.

Comments (1) View Comments

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Most Recent Comments
faseidl 09/10/08 11:45:52 AM EDT

Despite what many pundits have to say, reliability issues will not be the downfall of cloud computing. Using cloud computing does not mean neglecting to architect solutions that meet their business requirements, including reliability requirements.

I wrote more about this idea here:

Cloud Computing and Reliability
http://faseidl.com/public/item/212584

@CloudExpo Stories
Information technology is an industry that has always experienced change, and the dramatic change sweeping across the industry today could not be truthfully described as the first time we've seen such widespread change impacting customer investments. However, the rate of the change, and the potential outcomes from today's digital transformation has the distinct potential to separate the industry into two camps: Organizations that see the change coming, embrace it, and successful leverage it; and...
The 20th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held June 6-8, 2017, at the Javits Center in New York City, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Containers, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit your speaking proposal ...
You have great SaaS business app ideas. You want to turn your idea quickly into a functional and engaging proof of concept. You need to be able to modify it to meet customers' needs, and you need to deliver a complete and secure SaaS application. How could you achieve all the above and yet avoid unforeseen IT requirements that add unnecessary cost and complexity? You also want your app to be responsive in any device at any time. In his session at 19th Cloud Expo, Mark Allen, General Manager of...
20th Cloud Expo, taking place June 6-8, 2017, at the Javits Center in New York City, NY, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy.
Cloud Expo, Inc. has announced today that Andi Mann returns to 'DevOps at Cloud Expo 2017' as Conference Chair The @DevOpsSummit at Cloud Expo will take place on June 6-8, 2017, at the Javits Center in New York City, NY. "DevOps is set to be one of the most profound disruptions to hit IT in decades," said Andi Mann. "It is a natural extension of cloud computing, and I have seen both firsthand and in independent research the fantastic results DevOps delivers. So I am excited to help the great t...
DevOps is being widely accepted (if not fully adopted) as essential in enterprise IT. But as Enterprise DevOps gains maturity, expands scope, and increases velocity, the need for data-driven decisions across teams becomes more acute. DevOps teams in any modern business must wrangle the ‘digital exhaust’ from the delivery toolchain, "pervasive" and "cognitive" computing, APIs and services, mobile devices and applications, the Internet of Things, and now even blockchain. In this power panel at @...
Internet of @ThingsExpo has announced today that Chris Matthieu has been named tech chair of Internet of @ThingsExpo 2017 New York The 7th Internet of @ThingsExpo will take place on June 6-8, 2017, at the Javits Center in New York City, New York. Chris Matthieu is the co-founder and CTO of Octoblu, a revolutionary real-time IoT platform recently acquired by Citrix. Octoblu connects things, systems, people and clouds to a global mesh network allowing users to automate and control design flo...
Bert Loomis was a visionary. This general session will highlight how Bert Loomis and people like him inspire us to build great things with small inventions. In their general session at 19th Cloud Expo, Harold Hannon, Architect at IBM Bluemix, and Michael O'Neill, Strategic Business Development at Nvidia, discussed the accelerating pace of AI development and how IBM Cloud and NVIDIA are partnering to bring AI capabilities to "every day," on-demand. They also reviewed two "free infrastructure" pr...
Financial Technology has become a topic of intense interest throughout the cloud developer and enterprise IT communities. Accordingly, attendees at the upcoming 20th Cloud Expo at the Javits Center in New York, June 6-8, 2017, will find fresh new content in a new track called FinTech.
SYS-CON Events has announced today that Roger Strukhoff has been named conference chair of Cloud Expo and @ThingsExpo 2017 New York. The 20th Cloud Expo and 7th @ThingsExpo will take place on June 6-8, 2017, at the Javits Center in New York City, NY. "The Internet of Things brings trillions of dollars of opportunity to developers and enterprise IT, no matter how you measure it," stated Roger Strukhoff. "More importantly, it leverages the power of devices and the Internet to enable us all to im...
Get deep visibility into the performance of your databases and expert advice for performance optimization and tuning. You can't get application performance without database performance. Give everyone on the team a comprehensive view of how every aspect of the system affects performance across SQL database operations, host server and OS, virtualization resources and storage I/O. Quickly find bottlenecks and troubleshoot complex problems.
"Dice has been around for the last 20 years. We have been helping tech professionals find new jobs and career opportunities," explained Manish Dixit, VP of Product and Engineering at Dice, in this SYS-CON.tv interview at 19th Cloud Expo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
An IoT product’s log files speak volumes about what’s happening with your products in the field, pinpointing current and potential issues, and enabling you to predict failures and save millions of dollars in inventory. But until recently, no one knew how to listen. In his session at @ThingsExpo, Dan Gettens, Chief Research Officer at OnProcess, discussed recent research by Massachusetts Institute of Technology and OnProcess Technology, where MIT created a new, breakthrough analytics model for ...
"At ROHA we develop an app called Catcha. It was developed after we spent a year meeting with, talking to, interacting with senior citizens watching them use their smartphones and talking to them about how they use their smartphones so we could get to know their smartphone behavior," explained Dave Woods, Chief Innovation Officer at ROHA, in this SYS-CON.tv interview at 19th Cloud Expo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
"Venafi has a platform that allows you to manage, centralize and automate the complete life cycle of keys and certificates within the organization," explained Gina Osmond, Sr. Field Marketing Manager at Venafi, in this SYS-CON.tv interview at DevOps at 19th Cloud Expo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
Effectively SMBs and government programs must address compounded regulatory compliance requirements. The most recent are Controlled Unclassified Information and the EU's GDPR have Board Level implications. Managing sensitive data protection will likely result in acquisition criteria, demonstration requests and new requirements. Developers, as part of the pre-planning process and the associated supply chain, could benefit from updating their code libraries and design by incorporating changes. In...
"ReadyTalk is an audio and web video conferencing provider. We've really come to embrace WebRTC as the platform for our future of technology," explained Dan Cunningham, CTO of ReadyTalk, in this SYS-CON.tv interview at WebRTC Summit at 19th Cloud Expo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
Whether your IoT service is connecting cars, homes, appliances, wearable, cameras or other devices, one question hangs in the balance – how do you actually make money from this service? The ability to turn your IoT service into profit requires the ability to create a monetization strategy that is flexible, scalable and working for you in real-time. It must be a transparent, smoothly implemented strategy that all stakeholders – from customers to the board – will be able to understand and comprehe...
Keeping pace with advancements in software delivery processes and tooling is taxing even for the most proficient organizations. Point tools, platforms, open source and the increasing adoption of private and public cloud services requires strong engineering rigor – all in the face of developer demands to use the tools of choice. As Agile has settled in as a mainstream practice, now DevOps has emerged as the next wave to improve software delivery speed and output. To make DevOps work, organization...
"Qosmos has launched L7Viewer, a network traffic analysis tool, so it analyzes all the traffic between the virtual machine and the data center and the virtual machine and the external world," stated Sebastien Synold, Product Line Manager at Qosmos, in this SYS-CON.tv interview at 19th Cloud Expo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.