Welcome!

@CloudExpo Authors: Automic Blog, Liz McMillan, Elizabeth White, Pat Romanski, Nate Vickery

Related Topics: @CloudExpo, Containers Expo Blog, Cloud Security

@CloudExpo: Article

IT Monitoring Clickbait | @CloudExpo #APM #Cloud

Here are a few common alerting problems, along with the reasons they often crop up and how to solve them

It is a sad but very real truth that many, dare I say most, IT professionals consider alerts to be the bane of their existence. After all, they're annoying, noisy, mostly useless and frequently false. Thus, we IT professionals who specialize in IT monitoring are likely well acquainted with that familiar sinking feeling brought on by the discovery that the alert you so painstakingly crafted is being ignored by the team who receives it.

In that moment of professional heartbreak, you may have considered changing those alerts to make them more eye-catching, more interesting and more urgent. To achieve this, you might have even considered choosing from a menu of possible alert messages. For example:

  • Snarky: Hey server team! Do you even read these alerts anymore?!?
  • Hyperbolic: DANGER, WILL ROBINSON! Router will EXPLODE in 5 minutes!
  • Sympathetic (or just pathetic): Hey, I'm the IIS server and it just got really dark and cold in here. Can someone come turn the lights back on? I'm afraid of the dark.

Or you may have considered going the clickbait route. For example:

  • This server's response time dropped below 75 percent. You won't believe what happened next!
  • We showed these sysadmins the cluster failure at 2:15 a.m. Their reactions were priceless.
  • You swore you would never restart this service. What happened at 2:15am will change your mind forever.
  • Three naughty long-running queries you never hear about.
  • Hot, Hotter, Hottest! This wireless heat map reveals the Wi-Fi dead zones your access points are trying to hide!
  • Watch what happens when this VM ends up next to a noisy neighbor. The results will shock you!

While all of the above approaches are interesting to say the least, they, of course, miss the larger point: it is deceptively difficult to craft an alert that is meaningful, informative and actionable. To combat this issue and ensure teams are poring over your alerts in the future without needing tantrums, gimmicks or bribery, here are a few common alerting problems, along with the reasons they often crop up and how to solve them.

Problem: Multiple alerts (and tickets) for the same issue, every few minutes

Reason
This issue is called "sawtoothing" and describes a situation when a particular incident or condition happens, then resolves, then happens again, and so on and so forth, and your monitoring system creates a new alert each time.

Solution:
To solve this, first understand that some sawtoothing is an indication of a real problem that needs to be fixed. For example, a device that is repeatedly rebooting. But usually this happens because a device is "riding the edge" of a trigger threshold; for example, if a CPU alert is set to trigger at 90 percent, and a device is hovering between 88 and 92.

There are a few common approaches to solving the issue:

  • Set a time-based delay in the alert trigger so that the device has to be over a certain percentage CPU for more than a pre-set number of minutes. Now, the alert will only find devices that are consistently and continuously over the limit.
  • Use the reset option built into any good monitoring solution and set it lower than the trigger value. For example, set the trigger when CPU is over 90 percent for 10 minutes, but only reset the alert when it's under 80 percent for 20 minutes. This reset option establishes a certain standard of stability within the environment.
  • Use the ticket system API to create two-way communication between the monitoring solution and the ticket system that ensures a new ticket cannot be opened if there is already an existing ticket for a specific problem on that device.

Problem: A key device goes down - for example, the edge router at a remote site - and the team gets clobbered with alerts for every other device at the site

Reason
If the visibility of a particular device is impaired, monitoring systems sometimes call that "down." However, that doesn't necessarily mean it is down; a device upstream could be down and nothing further can be monitored until it comes back up.

Solution
Any worthwhile monitoring solution will have an option to suppress alerts based on "upstream" or "parent-child" connections. Make sure this option is enabled and that the monitoring solution understands the device dependencies in your environment.

Problem: You have to set up multitudes of alerts because each machine is slightly different

Reason
You may find yourself having to set up the same general alert (CPU utilization, disk full, application down, etc.) for an ungodly number of devices because each machine requires a slightly different threshold, timing, recipient or other element.

Solution
We monitoring engineers find ourselves in this situation when we (or the tool we're using) don't leverage custom fields. In other words, any sophisticated monitoring solution should allow for custom properties for things like "CPU_Critical_Value." This is set on a per-device basis, so that an alert goes from looking like this, "Alert when CPU % Utilization is >= 90%," to this, "Alert when CPU % Utilization is >= CPU_Critical_Value."

This solution allows each system to have its own customized threshold, but a single alert can handle it all. This same technique can be used for alert recipients. Instead of having a separate, but identical alert for CPU for the server, network and storage teams, each device can have a custom field called "Owner_Group_Email" that has an email group name. Then you create a single alert where the alert is sent to whatever is in that field.

Problem: Certain devices trigger at certain times because the work they're doing causes them to "run hot"

Reason
During the normal course of business, some systems have periods of high utilization that are completely normal, but also completely above the regular run rate. This could be due to month-end report processing; code compile sequences overnight or on the weekend; or any other cyclical, predictable operation.

The problem here is that the normal threshold for the condition in question is fine, but the "high usage" value is above that, so an alert triggers. But if you set the threshold for that system to the "high usage" level, you will miss issues that are important but often lower than the higher threshold.

Solution
Rather than triggering a threshold on a set value - even if it is set per device as described earlier - you can use the monitoring data to your advantage. Remember, monitoring is not an alert or page, nor is it a blinky dot on a screen. Monitoring is nothing more (or less) than the regular, steady, ongoing collection of a consistent set of metrics from a set of devices. All the rest - alerts, emails, blinky dots and more - is the happy byproduct you enjoy when you do monitoring correctly.

If you've been collecting all that data, why not analyze it to see what "normal" looks like for each device? This is called a "baseline" and it reflects not just an overall average, but also the normal run rate per day and even per hour. If you can derive this "baseline" value, then your alert trigger can go from, "Alert when CPU % utilization is >= <some fixed value>," to, "Alert when CPU % utilization is >= 10% over the baseline for this time period."

IT pros tried these weird monitoring tricks and the results will shock you!
When monitoring engineers implement and use the capabilities of their monitoring solutions to the fullest, the results are liberating for all parties. Alerts become both more specific and less frequent, which gives teams more time to actually get work done. This in turn causes those same teams to trust the alerts more and react to them in a timely fashion, which benefits the entire business. Best of all, everyone experiences the true value that good monitoring brings and starts engaging us monitoring engineers to create alerts and build insight that helps stabilize and improve the environment even more.

"Monitoring Team Saved This Company $$$" isn't some fake headline designed to get clicks. With a little work, it can be the truth for every organization.

For even more alerting insights, check out the latest episode of SolarWinds Lab here - what happens at 22:17 will blow you away!

More Stories By Leon Adato

Leon Adato is a Head Geek and technical evangelist at SolarWinds and is a Cisco® Certified Network Associate (CCNA), MCSE and SolarWinds Certified Professional (he was once a customer, after all). His 25 years of network management experience spans financial, healthcare, food and beverage, and other industries.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@CloudExpo Stories
SYS-CON Events announced today that Evatronix will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Evatronix SA offers comprehensive solutions in the design and implementation of electronic systems, in CAD / CAM deployment, and also is a designer and manufacturer of advanced 3D scanners for professional applications.
"I focus on what we are calling CAST Highlight, which is our SaaS application portfolio analysis tool. It is an extremely lightweight tool that can integrate with pretty much any build process right now," explained Andrew Siegmund, Application Migration Specialist for CAST, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
To get the most out of their data, successful companies are not focusing on queries and data lakes, they are actively integrating analytics into their operations with a data-first application development approach. Real-time adjustments to improve revenues, reduce costs, or mitigate risk rely on applications that minimize latency on a variety of data sources. In his session at @BigDataExpo, Jack Norris, Senior Vice President, Data and Applications at MapR Technologies, reviewed best practices to ...
As many know, the first generation of Cloud Management Platform (CMP) solutions were designed for managing virtual infrastructure (IaaS) and traditional applications. But that's no longer enough to satisfy evolving and complex business requirements. In his session at 21st Cloud Expo, Scott Davis, Embotics CTO, explored how next-generation CMPs ensure organizations can manage cloud-native and microservice-based application architectures, while also facilitating agile DevOps methodology. He expla...
SYS-CON Events announced today that Synametrics Technologies will exhibit at SYS-CON's 22nd International Cloud Expo®, which will take place on June 5-7, 2018, at the Javits Center in New York, NY. Synametrics Technologies is a privately held company based in Plainsboro, New Jersey that has been providing solutions for the developer community since 1997. Based on the success of its initial product offerings such as WinSQL, Xeams, SynaMan and Syncrify, Synametrics continues to create and hone inn...
"Evatronix provides design services to companies that need to integrate the IoT technology in their products but they don't necessarily have the expertise, knowledge and design team to do so," explained Adam Morawiec, VP of Business Development at Evatronix, in this SYS-CON.tv interview at @ThingsExpo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
The dynamic nature of the cloud means that change is a constant when it comes to modern cloud-based infrastructure. Delivering modern applications to end users, therefore, is a constantly shifting challenge. Delivery automation helps IT Ops teams ensure that apps are providing an optimal end user experience over hybrid-cloud and multi-cloud environments, no matter what the current state of the infrastructure is. To employ a delivery automation strategy that reflects your business rules, making r...
The past few years have brought a sea change in the way applications are architected, developed, and consumed—increasing both the complexity of testing and the business impact of software failures. How can software testing professionals keep pace with modern application delivery, given the trends that impact both architectures (cloud, microservices, and APIs) and processes (DevOps, agile, and continuous delivery)? This is where continuous testing comes in. D
Modern software design has fundamentally changed how we manage applications, causing many to turn to containers as the new virtual machine for resource management. As container adoption grows beyond stateless applications to stateful workloads, the need for persistent storage is foundational - something customers routinely cite as a top pain point. In his session at @DevOpsSummit at 21st Cloud Expo, Bill Borsari, Head of Systems Engineering at Datera, explored how organizations can reap the bene...
In a recent survey, Sumo Logic surveyed 1,500 customers who employ cloud services such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). According to the survey, a quarter of the respondents have already deployed Docker containers and nearly as many (23 percent) are employing the AWS Lambda serverless computing framework. It’s clear: serverless is here to stay. The adoption does come with some needed changes, within both application development and operations. Tha...
Digital transformation is about embracing digital technologies into a company's culture to better connect with its customers, automate processes, create better tools, enter new markets, etc. Such a transformation requires continuous orchestration across teams and an environment based on open collaboration and daily experiments. In his session at 21st Cloud Expo, Alex Casalboni, Technical (Cloud) Evangelist at Cloud Academy, explored and discussed the most urgent unsolved challenges to achieve f...
With tough new regulations coming to Europe on data privacy in May 2018, Calligo will explain why in reality the effect is global and transforms how you consider critical data. EU GDPR fundamentally rewrites the rules for cloud, Big Data and IoT. In his session at 21st Cloud Expo, Adam Ryan, Vice President and General Manager EMEA at Calligo, examined the regulations and provided insight on how it affects technology, challenges the established rules and will usher in new levels of diligence arou...
In his general session at 21st Cloud Expo, Greg Dumas, Calligo’s Vice President and G.M. of US operations, discussed the new Global Data Protection Regulation and how Calligo can help business stay compliant in digitally globalized world. Greg Dumas is Calligo's Vice President and G.M. of US operations. Calligo is an established service provider that provides an innovative platform for trusted cloud solutions. Calligo’s customers are typically most concerned about GDPR compliance, application p...
Mobile device usage has increased exponentially during the past several years, as consumers rely on handhelds for everything from news and weather to banking and purchases. What can we expect in the next few years? The way in which we interact with our devices will fundamentally change, as businesses leverage Artificial Intelligence. We already see this taking shape as businesses leverage AI for cost savings and customer responsiveness. This trend will continue, as AI is used for more sophistica...
In his Opening Keynote at 21st Cloud Expo, John Considine, General Manager of IBM Cloud Infrastructure, led attendees through the exciting evolution of the cloud. He looked at this major disruption from the perspective of technology, business models, and what this means for enterprises of all sizes. John Considine is General Manager of Cloud Infrastructure Services at IBM. In that role he is responsible for leading IBM’s public cloud infrastructure including strategy, development, and offering m...
Smart cities have the potential to change our lives at so many levels for citizens: less pollution, reduced parking obstacles, better health, education and more energy savings. Real-time data streaming and the Internet of Things (IoT) possess the power to turn this vision into a reality. However, most organizations today are building their data infrastructure to focus solely on addressing immediate business needs vs. a platform capable of quickly adapting emerging technologies to address future ...
In his session at 21st Cloud Expo, Raju Shreewastava, founder of Big Data Trunk, provided a fun and simple way to introduce Machine Leaning to anyone and everyone. He solved a machine learning problem and demonstrated an easy way to be able to do machine learning without even coding. Raju Shreewastava is the founder of Big Data Trunk (www.BigDataTrunk.com), a Big Data Training and consulting firm with offices in the United States. He previously led the data warehouse/business intelligence and B...
Most technology leaders, contemporary and from the hardware era, are reshaping their businesses to do software. They hope to capture value from emerging technologies such as IoT, SDN, and AI. Ultimately, irrespective of the vertical, it is about deriving value from independent software applications participating in an ecosystem as one comprehensive solution. In his session at @ThingsExpo, Kausik Sridhar, founder and CTO of Pulzze Systems, discussed how given the magnitude of today's application ...
The 22nd International Cloud Expo | 1st DXWorld Expo has announced that its Call for Papers is open. Cloud Expo | DXWorld Expo, to be held June 5-7, 2018, at the Javits Center in New York, NY, brings together Cloud Computing, Digital Transformation, Big Data, Internet of Things, DevOps, Machine Learning and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding busin...
Nordstrom is transforming the way that they do business and the cloud is the key to enabling speed and hyper personalized customer experiences. In his session at 21st Cloud Expo, Ken Schow, VP of Engineering at Nordstrom, discussed some of the key learnings and common pitfalls of large enterprises moving to the cloud. This includes strategies around choosing a cloud provider(s), architecture, and lessons learned. In addition, he covered some of the best practices for structured team migration an...