Welcome!

Cloud Expo Authors: Yeshim Deniz, Elizabeth White, Liz McMillan, Pat Romanski, Adrian Bridgwater

Related Topics: Virtualization, Cloud Expo

Virtualization: Article

Cloud Computing and Reliability

The Cloud offers some enticing advantages with respect to reliability

Eric Novikoff's Blog

IT managers and pundits speak of the reliability of a system in "nines." Two nines is the same as 99%, which comes to (100%-99%)*365 or 3.65 days of downtime per year, which is typical for non-redundant hardware if you include the time to reload the operating system and restore backups (if you have them) after a failure. Three nines is about 8 hours of downtime, four nines is about 52 minutes and the holy grail of 5 nines is 7 minutes.

From a users' point of view, downtime is downtime, but for a provider/vendor/web site manager, downtime is divided into planned and unplanned. Cloud computing can offer some benefits for planned downtime, but the place that it can have the largest effect on a business is in reducing unplanned downtime.

Planned downtime is usually the result of having to do some sort of software maintenance or release process, which is usually outside the domain of the cloud vendor, unless that vendor also offers IT operations services. Other sources of planned downtime are upgrades or scheduled equipment repairs. Most cloud vendors have some planned downtime, but because their business is based on providing high uptime, scheduled downtimes are kept to a minimum.

Unplanned downtime is where cloud vendors have the most to offer, and also the most to lose. Recent large outages at Amazon and Google have shown that even the largest cloud vendors can still have glitches that take considerable time to repair and give potential cloud customers a scare (perhaps it is because they didn't take some planned downtime??) On the other hand, cloud vendors have the experienced staff and proven processes that should produce overall hardware and network reliability that meets or exceeds that of the average corporate data center, and far exceeds anything you can achieve with colocated or self-managed servers.

However, despite claims of reliablity, few cloud vendors have tight SLAs (service level agreements) that promise controlled downtime or offer rebates for excess downtime. Amazon goes the opposite direction and doesn't offer any uptime guarantees, even cautioning users that their instance (or server) can disappear at any time and that they should plan accordingly. AppLogic-based clouds, provided by companies such as ENKI, are capable of offering better guarantees of uptime because of its inherent self-healing capabilities that can enable 3-4 nines of uptime. (The exact number depends on how the AppLogic system is set up and administered, which affects the time needed for the system to heal itself.) However, any cloud computing system, even even those based on AppLogic or similar technologies, can experience unplanned downtime for a variety of reasons, including the common culprit of human error. While I believe it is possible to produce a cloud computing service that exceeds 4-9s of uptime, the costs would be so high that few would buy it when they compared the price to the average cloud offering.


When you're purchasing cloud computing, it makes sense to look at the SLA of the vendor as well as the reliability of the underlying technology. But if your needs for uptime exceed that which the vendors and their technology can offer, there are time-honored techniques for improving it, most of which involve doubling the amount of computing nodes in your application. There's an old adage that each additional "9" of uptime you get doubles your cost, and that's because you need backup systems that are in place to take over if the primaries fail. This involves creating a system architecture for your application that allows for either active/passive failover (meaning that the backup nodes are running but not doing anything) or active/active failover (meaning that the backup nodes are normally providing application computing capability).

These solutions can be implemented in any cloud technology but they always require extra design and configuration effort for your application, and they should be tested rigorously to make sure they will work when the chips are down. Failover solutions are generally less expensive to implement in the Cloud because of the on-demand or pay-as-you go nature of cloud services, which means that you can easily size the backup server nodes to meet your needs and save on computing resources.

An important component of reliability is a good backup strategy. With cloud computing systems like AppLogic offering highly reliable storage as part of the package, many customers are tempted to skip backup. But data loss and the resulting unplanned downtime can result not just from failures in the cloud platform, but also software bugs, human error, or malfeasance such as hacking. If you don't have a backup, you'll be down a long time - and this applies equally to cloud and non-cloud solutions. The advantages of cloud solutions is that there is usually an inexpensive and large storage facility coupled with the cloud computing offering which gives you a convenient place to store your backups.

For the truly fanatical, backing up your data from one cloud vendor to another provides that extra measure of security. It pays to think through your backup strategy because most of today's backup software packages or remote backup services were designed for physical servers and not virtual environments having many virtual servers such as you might find in the cloud. This can mean very high software costs for doing backup if your backup software charges on a "per server" basis and your application is spread across many instances. If your cloud vendor has a backup offering, usually they have found a way to make backup affordable even if your application consists of many compute instances.

Another aspect of reliability that often escapes cloud computing customers new to the world of computing services is monitoring. It's very hard to react to unplanned downtime if you don't know your system is down. It's also hard to avoid unplanned downtime if you don't know you're about to run out of disk space or memory, or perhaps your application is complaining about data corruption. A remote monitoring service can scan your servers in the cloud on a regular basis for faults, application problems, or even measure the performance of your application (like how long it takes to buy a widget in your web store) and report to you if anything is out of the ordinary. I say "service" because if you were to install your own monitoring server into your cloud and the cloud went down, so would your monitoring! At ENKI, we solve this problem by having our monitoring service hosted in a separate data center and under a different software environment than our primary cloud hosting service.

The last aspect of reliability is security. However, that would require another entire article to cover, since security in the cloud is a complex and relatively new topic.

To sum up, the Cloud offers some enticing advantages with respect to reliability, perhaps the largest of which is that you can give your data center operations responsibility to someone who theoretically can do a much better job at a lower cost than you can. However, to get very good reliability, you must still apply traditional approaches of redundancy and observability that have been used in physical data centers for decades - or, you have to find a cloud computing services provider that can implement them for you.

More Stories By Eric Novikoff

Eric Novikoff is COO of ENKI, A Cloud Services Vendor. He has over 20 years of experience in the electronics and software industries, over a range of positions from integrated circuit designer to software/hardware project manager, to Director of Development at an Internet Software As A Service startup, Netsuite.com. His technical, project, and financial management skills have been honed in multiple positions at Hewlett-Packard and Agilent Technologies on a variety of product lines, including managing the development and roll-out of a worldwide CRM and sales automation application for Agilent's $350 million Automatic Test Equipment business. Novikoff also has a strong interest in SME (Small/Medium Size Enterprise) management, process development, and operations as a consequence of working at a web based ERP service startup serving SMEs, and through his small-business ERP consulting work.

Comments (1) View Comments

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Most Recent Comments
faseidl 09/10/08 11:45:52 AM EDT

Despite what many pundits have to say, reliability issues will not be the downfall of cloud computing. Using cloud computing does not mean neglecting to architect solutions that meet their business requirements, including reliability requirements.

I wrote more about this idea here:

Cloud Computing and Reliability
http://faseidl.com/public/item/212584

Cloud Expo Latest Stories
With the explosion of the cloud, more businesses are transitioning to a recurring revenue model to generate reliable sales, grow profits, and open new markets. This opportunity requires businesses to get to market quickly with the pricing and packaging options customers want. In addition, you will want to take advantage of the ensuing tidal wave of data to more effectively upsell, cross-sell and manage your customers. All of this is possible, but only with the right approach. At 15th Cloud Expo, Brendan O'Brien, Co-founder at Aria Systems and the inventor of cloud billing panelists, will lead a panel discussion on what it takes to launch and manage a successful recurring revenue business. The panelists will offer their insights about what each department will need to consider, from financial management to line of business and IT. The panelists will also offer examples from their success in recurring revenue with companies such as Audi, Constant Contact, Experian, Pitney-Bowes, Teleko...
Planning scalable environments isn't terribly difficult, but it does require a change of perspective. In his session at 15th Cloud Expo, Phil Jackson, Development Community Advocate for SoftLayer, will broaden your views to think on an Internet scale by dissecting a video publishing application built with The SoftLayer Platform, Message Queuing, Object Storage, and Drupal. By examining a scalable modular application build that can handle unpredictable traffic, attendees will able to grow your development arsenal and pick up a few strategies to apply to your own projects.
Come learn about what you need to consider when moving your data to the cloud. In her session at 15th Cloud Expo, Skyla Loomis, a Program Director of Cloudant Development at Cloudant, will discuss the security, performance, and operational implications of keeping your data on premise, moving it to the cloud, or taking a hybrid approach. She will use real customer examples to illustrate the tradeoffs, key decision points, and how to be successful with a cloud or hybrid cloud solution.
The cloud provides an easy onramp to building and deploying Big Data solutions. Transitioning from initial deployment to large-scale, highly performant operations may not be as easy. In his session at 15th Cloud Expo, Harold Hannon, Sr. Software Architect at SoftLayer, will discuss the benefits, weaknesses, and performance characteristics of public and bare metal cloud deployments that can help you make the right decisions.
Over the last few years the healthcare ecosystem has revolved around innovations in Electronic Health Record (HER) based systems. This evolution has helped us achieve much desired interoperability. Now the focus is shifting to other equally important aspects – scalability and performance. While applying cloud computing environments to the EHR systems, a special consideration needs to be given to the cloud enablement of Veterans Health Information Systems and Technology Architecture (VistA), i.e., the largest single medical system in the United States.
Cloud and Big Data present unique dilemmas: embracing the benefits of these new technologies while maintaining the security of your organization’s assets. When an outside party owns, controls and manages your infrastructure and computational resources, how can you be assured that sensitive data remains private and secure? How do you best protect data in mixed use cloud and big data infrastructure sets? Can you still satisfy the full range of reporting, compliance and regulatory requirements? In his session at 15th Cloud Expo, Derek Tumulak, Vice President of Product Management at Vormetric, will discuss how to address data security in cloud and Big Data environments so that your organization isn’t next week’s data breach headline.
Scott Jenson leads a project called The Physical Web within the Chrome team at Google. Project members are working to take the scalability and openness of the web and use it to talk to the exponentially exploding range of smart devices. Nearly every company today working on the IoT comes up with the same basic solution: use my server and you'll be fine. But if we really believe there will be trillions of these devices, that just can't scale. We need a system that is open a scalable and by using the URL as a basic building block, we open this up and get the same resilience that the web enjoys.
Is your organization struggling to deal with skyrocketing volumes of digital assets? The amount of data is growing exponentially and organizations are having a hard time managing this growth. In his session at 15th Cloud Expo, Amar Kapadia, Senior Director of Open Cloud Strategy at Seagate, will walk through the essential considerations when developing a cloud storage strategy. In this discussion, you will understand the challenges IT is facing, why companies need to move to cloud, and how the right cloud model can help your business economically overcome the data struggle.
If cloud computing benefits are so clear, why have so few enterprises migrated their mission-critical apps? The answer is often inertia and FUD. No one ever got fired for not moving to the cloud – not yet. In his session at 15th Cloud Expo, Michael Hoch, SVP, Cloud Advisory Service at Virtustream, will discuss the six key steps to justify and execute your MCA cloud migration.
The 16th International Cloud Expo announces that its Call for Papers is now open. 16th International Cloud Expo, to be held June 9–11, 2015, at the Javits Center in New York City brings together Cloud Computing, APM, APIs, Security, Big Data, Internet of Things, DevOps and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit your speaking proposal today!
Most of today’s hardware manufacturers are building servers with at least one SATA Port, but not every systems engineer utilizes them. This is considered a loss in the game of maximizing potential storage space in a fixed unit. The SATADOM Series was created by Innodisk as a high-performance, small form factor boot drive with low power consumption to be plugged into the unused SATA port on your server board as an alternative to hard drive or USB boot-up. Built for 1U systems, this powerful device is smaller than a one dollar coin, and frees up otherwise dead space on your motherboard. To meet the requirements of tomorrow’s cloud hardware, Innodisk invested internal R&D resources to develop our SATA III series of products. The SATA III SATADOM boasts 500/180MBs R/W Speeds respectively, or double R/W Speed of SATA II products.
In today's application economy, enterprise organizations realize that it's their applications that are the heart and soul of their business. If their application users have a bad experience, their revenue and reputation are at stake. In his session at 15th Cloud Expo, Anand Akela, Senior Director of Product Marketing for Application Performance Management at CA Technologies, will discuss how a user-centric Application Performance Management solution can help inspire your users with every application transaction.
SYS-CON Events announced today that Gridstore™, the leader in software-defined storage (SDS) purpose-built for Windows Servers and Hyper-V, will exhibit at SYS-CON's 15th International Cloud Expo®, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Gridstore™ is the leader in software-defined storage purpose built for virtualization that is designed to accelerate applications in virtualized environments. Using its patented Server-Side Virtual Controller™ Technology (SVCT) to eliminate the I/O blender effect and accelerate applications Gridstore delivers vmOptimized™ Storage that self-optimizes to each application or VM across both virtual and physical environments. Leveraging a grid architecture, Gridstore delivers the first end-to-end storage QoS to ensure the most important App or VM performance is never compromised. The storage grid, that uses Gridstore’s performance optimized nodes or capacity optimized nodes, starts with as few a...
SYS-CON Events announced today that Cloudian, Inc., the leading provider of hybrid cloud storage solutions, has been named “Bronze Sponsor” of SYS-CON's 15th International Cloud Expo®, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Cloudian is a Foster City, Calif.-based software company specializing in cloud storage. Cloudian HyperStore® is an S3-compatible cloud object storage platform that enables service providers and enterprises to build reliable, affordable and scalable hybrid cloud storage solutions. Cloudian actively partners with leading cloud computing environments including Amazon Web Services, Citrix Cloud Platform, Apache CloudStack, OpenStack and the vast ecosystem of S3 compatible tools and applications. Cloudian's customers include Vodafone, Nextel, NTT, Nifty, and LunaCloud. The company has additional offices in China and Japan.
SYS-CON Events announced today that TechXtend (formerly Programmer’s Paradise), a leading value-added provider of server and storage virtualization, and r-evolution will exhibit at SYS-CON's 15th International Cloud Expo®, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. TechXtend (formerly Programmer’s Paradise) is a leading value-added provider of software, systems and solutions for corporations, government organizations, and academic institutions across the United States and Canada. TechXtend is the Exclusive Reseller in the United States for r-evolution