@CloudExpo Authors: Liz McMillan, Zakia Bouachraoui, Elizabeth White, Pat Romanski, Carmen Gonzalez

Related Topics: Industrial IoT, Containers Expo Blog, @CloudExpo

Industrial IoT: Article

Managing Cloud Applications

Cloud computing will change the processes and tools that IT organizations currently use

As enterprises evaluate if and how cloud computing fits into their core IT services, they must consider how they will manage cloud services as part of their day-to-day operations. This article examines how operational management of cloud computing differs from traditional methods, and examine techniques for addressing these needs.

Cloud computing will change the processes and tools that IT organizations currently use. In a traditional datacenter environment, IT organizations have complete control and visibility into their infrastructure. They install each piece of hardware and therefore have complete configuration control. All components in the network are accessible and can be monitored with the right tools. Most enterprises have invested heavily in complex tools in order to manage this environment so that they can identify service-affecting conditions, and analyze performance metrics so they may tune their systems to optimize performance.

For cloud computing services, the enterprise no longer has control and visibility into the components of the service. Yet if the cloud is to replace a core service, how can the IT organization guarantee the equivalent availability and performance service levels? In today's IT environment, isolating problems between an enterprise and its vendor are the most difficult to resolve. Cloud vendors are painting a future in which an enterprise will pick multiple cloud services from a market of cloud services, which means these problems will become more common and more complex. Yet enterprises will not deploy services, even if they are more costly and more agile, if they cannot provide an acceptable level of service. The relationship between cloud vendors and enterprises must evolve. Vendors must not only earn the trust of enterprises, but must provide mechanisms where they can verify that trust in a transparent manner. One step toward that goal is to have management tools that can provide the in-depth views that customers need and that can prove promised service levels are being met.

Let's look at how such a system might work, from a technical perspective. Most enterprise class management systems include the following basic features:

The ability to gather metric information from a variety of components, including:

  • Linux systems: CPU utilization, load, memory, swap, disk, processes. etc.
    -Windows: CPU, disk, memory, services, WMI metrics, etc.
    -Network: ping, TCP port response, latency, SNMP polls, etc.
    -Application: HTTP, web services, logs, etc.
  • Generate alerts based on metric thresholds exceeded.
  • Automated notification on alerts.
  • Performance reports on metrics for system tuning.

The process of metric gathering should open as simply as possible by enabling many scripting options. Many current enterprise tools require specific and deep programming skills in order to expand monitoring. This limits the use of the tool since most management systems are deployed by system administrators, not software developers. For managing cloud applications, this is even more important since interfacing to specific cloud vendors will require writing to their APIs, which is typically a REST-type interface.

Enterprise operations typically have additional requirements. Ideally, a tool can integrate with other tools that the enterprise uses. For example, monitoring tools may get their configuration information from a configuration database. Alerts may be fed into a problem ticketing system. The more automation capability that tool has, the better chance it has of fitting into current IT operational processes.

How might this system deal with the dynamic configuration of cloud systems? For many existing tools the provisioning process is a problem. Existing systems management tools are designed to follow a host name/IP address model, not a virtualized model. You need to define the IP/host of each managed system. However, cloud instances are typically dynamically defined. Take the case of Amazon EC2. The host name and IP address are assigned on instance startup. As shown in Figure 1, in order to use these tools, the instance must first be started, the provisioning parameters (IP/host) extracted from the Amazon API and implemented in the management system, and then the management system must be reloaded to implement the change. The exact mechanics vary depending on the management system and cloud vendor, but all rely on a tight dependency between cloud configuration and the monitoring system. This disconnect can cause a lag time in monitoring the true cloud configuration, or worse, an incomplete monitoring system.

Another issue with existing management tools is limited visibility into the cloud infrastructure's operational data. Each cloud vendor has their own configuration definition and operational parameter fields. I call this data, the cloud vendor's "metadata." In the case of Amazon EC2, the metadata includes instance-id, Amazon image ID (AMI), security groups, location, public DNS name and private DNS name. Existing tools are not designed to gather this metadata. When IT operations personnel are troubleshooting EC2 problems, it would be difficult to understand the entire scenario without this metadata. Every cloud vendor has their own metadata, so the problem is further complicated with each additional cloud vendor.

An alternative to using existing management tools is to rely on the vendor to provide the required visibility. However, today, most vendor tools and APIs provide limited visibility. Most infrastructure providers only show whether the instance is running or not. From the infrastructure cloud provider, this makes sense. They are responsible for the virtual server, not what a user might install on it. Amazon has recognized this issue and has tried to address it with their CloudWatch service. This is an optional service that allows the user to gather additional instance metrics like CPU utilization, disk read/write operations, and throughput from Amazon's APIs. However, Amazon only exposes the information - it is up to the user to use the data for alerts or reporting. Though there are some entry-level cloud tools that read API information for status, they do not provide the management features previously listed.

Cloud-specific tools are usually not designed for use by IT enterprise operations. Simple web browser-oriented interfaces are fine for monitoring a few development instances, but enterprises can require monitoring of hundreds of instances and thousands of metrics, which would be beyond the capability of most Web applications. For enterprises, IT operations that are used for in-depth monitoring and high function capability, vendor services alone are inadequate.

The preferred approach is to combine the capability of high-function, enterprise management tools integrated with information from vendor APIs. This system is shown in Figure 2.

In this system, standard monitoring scripts can be deployed either under the control of agents on monitored instances, or as active checks from the management server. Supporting open source scripts, such as those from the Nagios plug-in project, will allow in-depth monitoring of many components, including those listed above. However, this basic monitoring information must be augmented with vendor API information. Vendor data may be queried from the agent, the management server, or both. This approach allows the management system to process vendor metadata combined with monitoring data. Views presented to operators and system administrators can then show much richer information.

Dynamic changes in the cloud must be immediately recognized by the system. One way this can be handled is to change the requirement for managed systems to be pre-defined. Events received by managed systems can be processed as long as they are authenticated. This "event-based" model allows new instances to manage as soon as they start.

Let's look at an example. Suppose we are monitoring a set of application servers running in the Amazon EC2 cloud. A standard script used to get memory utilization from the Linux system would result in the output, "Free memory at 98%. Critical severity."

In a traditional system, this information is associated with the host that it is run on. The management information received after the execution of such a script would look like Table 1.

However, if we combine the cloud metadata, the event information would be (using Amazon EC2) (see Table 2).

An operator looking at this event information would have more complete information on the source and impact of this memory alert. Greater value can be realized by correlating all event information by vendor metadata. For example, grouping instances by their location (or which vendor datacenter the virtual instance is running) might explain system behavior. Access to the metadata also gives the management system the opportunity to perform higher-level functional checks on the managed cloud application.

The following application scenario is based on a real user. The application consists of many server instances running in the Amazon EC2 cloud. The application consists of groups of servers, each performing a different role. There may be up to 50 groups running in the application. The group role is determined by a parameter passed in the "User Data" field of the Amazon EC2 metadata.

The management system accesses the metadata and customizes its active checks based on the role contained in the metadata. Since dynamic changes are handled, as the number of instances within each role group changes, the management system will adapt. Existing management tools were in place, so the cloud management system gathered some of its metric information from an existing tool rather than requiring re-instrumentation. This made the transition to managing the cloud easier for operations.

Since the application consists of many groups of instances, the status of a single instance is not as important as the status of the role's group. One type of higher-level check that was applied is to compute average (or optionally, maximum or minimum) values across the group. Alerts can then be generated based on group metrics rather than instance metrics. The operator has the ability to drill down from the group to the instance values.

This example is using infrastructure cloud provider services. An equivalent scenario can be applied for platform service providers. The difference would be that the monitoring metrics would be testing the platform providers APIs rather than traditional measurements. Again, this implies a tight integration between the management system and the cloud vendor's services.

The management system in this scenario can be used as a basis for establishing vendor trust because of the following advantages:

  • Tight interface with the cloud and enables vendor configuration data to be integrated into the management system
  • Open monitoring capability enables deep monitoring beyond vendor APIs
  • Any monitoring information available from the vendor API can be incorporated
  • Any monitoring information from existing management tools can be incorporated
  • The ability to create higher-level management metrics based on lower-level measurements

These advantages also enable the system to be a basis for trust-enabling applications. For example, a billing report can be generated based on the telemetry information gathered by the system. This report would be independent from the vendor, generated by the gathered metric data and the vendor's stated billing policy. This report could be used as an independent audit of the vendor's bill. Another example would be to gather the security information from the vendor's configuration and perform TCP port checks defined by those groups. This verifies the security policy stated by the vendor is enforced for this user's cloud configuration.

If we project this scenario to the future where multiple cloud vendors are used by an enterprise, the management system would look like the following shown in Figure 3.

This system would be a consolidation point for all cloud services, and would translate the heterogeneous cloud services into a common view, simplifying IT operations.

For cloud computing to fulfill its promise of enabling enterprise IT organizations to improve the service it provides to its users, traditional IT operational processes and tools must adapt to a new ways of interacting with external vendor services. I've examined some of the issues that enterprises are encountering, and have offered solutions. But it is clear that these issues are a barrier for adoption of cloud services, and the needs of enterprise operations must be addressed by the cloud community.

More Stories By Peter Loh

Peter Loh is CEO of Tap In Systems. He has been developing, marketing and selling network and systems management software for over 25 years. He has implemented IT management systems for large Fortune 500 companies, such as Bank of America, ATT, Visa and American Express. He spent 10 years in technical and marketing positions at IBM, and has since been involved with a number of start up companies. Most recently he was in charge of engineering for GroundWork Open Source, a company leveraging open source software to implement IT management solutions. Peter holds a BS in Electrical Engineering.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.

CloudEXPO Stories
ScaleMP is presenting at CloudEXPO 2019, held June 24-26 in Santa Clara, and we’d love to see you there. At the conference, we’ll demonstrate how ScaleMP is solving one of the most vexing challenges for cloud — memory cost and limit of scale — and how our innovative vSMP MemoryONE solution provides affordable larger server memory for the private and public cloud. Please visit us at Booth No. 519 to connect with our experts and learn more about vSMP MemoryONE and how it is already serving some of the world’s largest data centers. Click here to schedule a meeting with our experts and executives.
At CloudEXPO Silicon Valley, June 24-26, 2019, Digital Transformation (DX) is a major focus with expanded DevOpsSUMMIT and FinTechEXPO programs within the DXWorldEXPO agenda. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of business. Only 12% still survive. Similar percentages are found throughout enterprises of all sizes.
When you're operating multiple services in production, building out forensics tools such as monitoring and observability becomes essential. Unfortunately, it is a real challenge balancing priorities between building new features and tools to help pinpoint root causes. Linkerd provides many of the tools you need to tame the chaos of operating microservices in a cloud native world. Because Linkerd is a transparent proxy that runs alongside your application, there are no code changes required. It even comes with Prometheus to store the metrics for you and pre-built Grafana dashboards to show exactly what is important for your services - success rate, latency, and throughput.
In his general session at 21st Cloud Expo, Greg Dumas, Calligo’s Vice President and G.M. of US operations, discussed the new Global Data Protection Regulation and how Calligo can help business stay compliant in digitally globalized world. Greg Dumas is Calligo's Vice President and G.M. of US operations. Calligo is an established service provider that provides an innovative platform for trusted cloud solutions. Calligo’s customers are typically most concerned about GDPR compliance, application performance guarantees & data privacy.
Modern software design has fundamentally changed how we manage applications, causing many to turn to containers as the new virtual machine for resource management. As container adoption grows beyond stateless applications to stateful workloads, the need for persistent storage is foundational - something customers routinely cite as a top pain point. In his session at @DevOpsSummit at 21st Cloud Expo, Bill Borsari, Head of Systems Engineering at Datera, explored how organizations can reap the benefits of the cloud without losing performance as containers become the new paradigm.