Welcome!

@CloudExpo Authors: Elizabeth White, Yeshim Deniz, Liz McMillan, Pat Romanski, Ravi Rajamiyer

Related Topics: @CloudExpo, Containers Expo Blog, Cloud Security

@CloudExpo: Article

IT Monitoring Clickbait | @CloudExpo #APM #Cloud

Here are a few common alerting problems, along with the reasons they often crop up and how to solve them

It is a sad but very real truth that many, dare I say most, IT professionals consider alerts to be the bane of their existence. After all, they're annoying, noisy, mostly useless and frequently false. Thus, we IT professionals who specialize in IT monitoring are likely well acquainted with that familiar sinking feeling brought on by the discovery that the alert you so painstakingly crafted is being ignored by the team who receives it.

In that moment of professional heartbreak, you may have considered changing those alerts to make them more eye-catching, more interesting and more urgent. To achieve this, you might have even considered choosing from a menu of possible alert messages. For example:

  • Snarky: Hey server team! Do you even read these alerts anymore?!?
  • Hyperbolic: DANGER, WILL ROBINSON! Router will EXPLODE in 5 minutes!
  • Sympathetic (or just pathetic): Hey, I'm the IIS server and it just got really dark and cold in here. Can someone come turn the lights back on? I'm afraid of the dark.

Or you may have considered going the clickbait route. For example:

  • This server's response time dropped below 75 percent. You won't believe what happened next!
  • We showed these sysadmins the cluster failure at 2:15 a.m. Their reactions were priceless.
  • You swore you would never restart this service. What happened at 2:15am will change your mind forever.
  • Three naughty long-running queries you never hear about.
  • Hot, Hotter, Hottest! This wireless heat map reveals the Wi-Fi dead zones your access points are trying to hide!
  • Watch what happens when this VM ends up next to a noisy neighbor. The results will shock you!

While all of the above approaches are interesting to say the least, they, of course, miss the larger point: it is deceptively difficult to craft an alert that is meaningful, informative and actionable. To combat this issue and ensure teams are poring over your alerts in the future without needing tantrums, gimmicks or bribery, here are a few common alerting problems, along with the reasons they often crop up and how to solve them.

Problem: Multiple alerts (and tickets) for the same issue, every few minutes

Reason
This issue is called "sawtoothing" and describes a situation when a particular incident or condition happens, then resolves, then happens again, and so on and so forth, and your monitoring system creates a new alert each time.

Solution:
To solve this, first understand that some sawtoothing is an indication of a real problem that needs to be fixed. For example, a device that is repeatedly rebooting. But usually this happens because a device is "riding the edge" of a trigger threshold; for example, if a CPU alert is set to trigger at 90 percent, and a device is hovering between 88 and 92.

There are a few common approaches to solving the issue:

  • Set a time-based delay in the alert trigger so that the device has to be over a certain percentage CPU for more than a pre-set number of minutes. Now, the alert will only find devices that are consistently and continuously over the limit.
  • Use the reset option built into any good monitoring solution and set it lower than the trigger value. For example, set the trigger when CPU is over 90 percent for 10 minutes, but only reset the alert when it's under 80 percent for 20 minutes. This reset option establishes a certain standard of stability within the environment.
  • Use the ticket system API to create two-way communication between the monitoring solution and the ticket system that ensures a new ticket cannot be opened if there is already an existing ticket for a specific problem on that device.

Problem: A key device goes down - for example, the edge router at a remote site - and the team gets clobbered with alerts for every other device at the site

Reason
If the visibility of a particular device is impaired, monitoring systems sometimes call that "down." However, that doesn't necessarily mean it is down; a device upstream could be down and nothing further can be monitored until it comes back up.

Solution
Any worthwhile monitoring solution will have an option to suppress alerts based on "upstream" or "parent-child" connections. Make sure this option is enabled and that the monitoring solution understands the device dependencies in your environment.

Problem: You have to set up multitudes of alerts because each machine is slightly different

Reason
You may find yourself having to set up the same general alert (CPU utilization, disk full, application down, etc.) for an ungodly number of devices because each machine requires a slightly different threshold, timing, recipient or other element.

Solution
We monitoring engineers find ourselves in this situation when we (or the tool we're using) don't leverage custom fields. In other words, any sophisticated monitoring solution should allow for custom properties for things like "CPU_Critical_Value." This is set on a per-device basis, so that an alert goes from looking like this, "Alert when CPU % Utilization is >= 90%," to this, "Alert when CPU % Utilization is >= CPU_Critical_Value."

This solution allows each system to have its own customized threshold, but a single alert can handle it all. This same technique can be used for alert recipients. Instead of having a separate, but identical alert for CPU for the server, network and storage teams, each device can have a custom field called "Owner_Group_Email" that has an email group name. Then you create a single alert where the alert is sent to whatever is in that field.

Problem: Certain devices trigger at certain times because the work they're doing causes them to "run hot"

Reason
During the normal course of business, some systems have periods of high utilization that are completely normal, but also completely above the regular run rate. This could be due to month-end report processing; code compile sequences overnight or on the weekend; or any other cyclical, predictable operation.

The problem here is that the normal threshold for the condition in question is fine, but the "high usage" value is above that, so an alert triggers. But if you set the threshold for that system to the "high usage" level, you will miss issues that are important but often lower than the higher threshold.

Solution
Rather than triggering a threshold on a set value - even if it is set per device as described earlier - you can use the monitoring data to your advantage. Remember, monitoring is not an alert or page, nor is it a blinky dot on a screen. Monitoring is nothing more (or less) than the regular, steady, ongoing collection of a consistent set of metrics from a set of devices. All the rest - alerts, emails, blinky dots and more - is the happy byproduct you enjoy when you do monitoring correctly.

If you've been collecting all that data, why not analyze it to see what "normal" looks like for each device? This is called a "baseline" and it reflects not just an overall average, but also the normal run rate per day and even per hour. If you can derive this "baseline" value, then your alert trigger can go from, "Alert when CPU % utilization is >= <some fixed value>," to, "Alert when CPU % utilization is >= 10% over the baseline for this time period."

IT pros tried these weird monitoring tricks and the results will shock you!
When monitoring engineers implement and use the capabilities of their monitoring solutions to the fullest, the results are liberating for all parties. Alerts become both more specific and less frequent, which gives teams more time to actually get work done. This in turn causes those same teams to trust the alerts more and react to them in a timely fashion, which benefits the entire business. Best of all, everyone experiences the true value that good monitoring brings and starts engaging us monitoring engineers to create alerts and build insight that helps stabilize and improve the environment even more.

"Monitoring Team Saved This Company $$$" isn't some fake headline designed to get clicks. With a little work, it can be the truth for every organization.

For even more alerting insights, check out the latest episode of SolarWinds Lab here - what happens at 22:17 will blow you away!

More Stories By Leon Adato

Leon Adato is a Head Geek and technical evangelist at SolarWinds and is a Cisco® Certified Network Associate (CCNA), MCSE and SolarWinds Certified Professional (he was once a customer, after all). His 25 years of network management experience spans financial, healthcare, food and beverage, and other industries.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@CloudExpo Stories
SYS-CON Events announced today that Nihon Micron will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Nihon Micron Co., Ltd. strives for technological innovation to establish high-density, high-precision processing technology for providing printed circuit board and metal mount RFID tags used for communication devices. For more inf...
SYS-CON Events announced today that Massive Networks, that helps your business operate seamlessly with fast, reliable, and secure internet and network solutions, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. As a premier telecommunications provider, Massive Networks is headquartered out of Louisville, Colorado. With years of experience under their belt, their team of...
SYS-CON Events announced today that Suzuki Inc. will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Suzuki Inc. is a semiconductor-related business, including sales of consuming parts, parts repair, and maintenance for semiconductor manufacturing machines, etc. It is also a health care business providing experimental research for...
"Our strategy is to focus on the hyperscale providers - AWS, Azure, and Google. Over the last year we saw that a lot of developers need to learn how to do their job in the cloud and we see this DevOps movement that we are catering to with our content," stated Alessandro Fasan, Head of Global Sales at Cloud Academy, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
21st International Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Me...
Enterprises are moving to the cloud faster than most of us in security expected. CIOs are going from 0 to 100 in cloud adoption and leaving security teams in the dust. Once cloud is part of an enterprise stack, it’s unclear who has responsibility for the protection of applications, services, and data. When cloud breaches occur, whether active compromise or a publicly accessible database, the blame must fall on both service providers and users. In his session at 21st Cloud Expo, Ben Johnson, C...
Many organizations adopt DevOps to reduce cycle times and deliver software faster; some take on DevOps to drive higher quality and better end-user experience; others look to DevOps for a clearer line-of-sight to customers to drive better business impacts. In truth, these three foundations go together. In this power panel at @DevOpsSummit 21st Cloud Expo, moderated by DevOps Conference Co-Chair Andi Mann, industry experts will discuss how leading organizations build application success from all...
SYS-CON Events announced today that mruby Forum will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. mruby is the lightweight implementation of the Ruby language. We introduce mruby and the mruby IoT framework that enhances development productivity. For more information, visit http://forum.mruby.org/.
Cloud-based disaster recovery is critical to any production environment and is a high priority for many enterprise organizations today. Nearly 40% of organizations have had to execute their BCDR plan due to a service disruption in the past two years. Zerto on IBM Cloud offer VMware and Microsoft customers simple, automated recovery of on-premise VMware and Microsoft workloads to IBM Cloud data centers.
Why Federal cloud? What is in Federal Clouds and integrations? This session will identify the process and the FedRAMP initiative. But is it sufficient? What is the remedy for keeping abreast of cutting-edge technology? In his session at 21st Cloud Expo, Rasananda Behera will examine the proposed solutions: Private or public or hybrid cloud Responsible governing bodies How can we accomplish?
Today traditional IT approaches leverage well-architected compute/networking domains to control what applications can access what data, and how. DevOps includes rapid application development/deployment leveraging concepts like containerization, third-party sourced applications and databases. Such applications need access to production data for its test and iteration cycles. Data Security? That sounds like a roadblock to DevOps vs. protecting the crown jewels to those in IT.
SYS-CON Events announced today that Cedexis will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Cedexis is the leader in data-driven enterprise global traffic management. Whether optimizing traffic through datacenters, clouds, CDNs, or any combination, Cedexis solutions drive quality and cost-effectiveness.
Elon Musk is among the notable industry figures who worries about the power of AI to destroy rather than help society. Mark Zuckerberg, on the other hand, embraces all that is going on. AI is most powerful when deployed across the vast networks being built for Internets of Things in the manufacturing, transportation and logistics, retail, healthcare, government and other sectors. Is AI transforming IoT for the good or the bad? Do we need to worry about its potential destructive power? Or will we...
In his session at @ThingsExpo, Greg Gorman is the Director, IoT Developer Ecosystem, Watson IoT, will provide a short tutorial on Node-RED, a Node.js-based programming tool for wiring together hardware devices, APIs and online services in new and interesting ways. It provides a browser-based editor that makes it easy to wire together flows using a wide range of nodes in the palette that can be deployed to its runtime in a single-click. There is a large library of contributed nodes that help so...
IBM helps FinTechs and financial services companies build and monetize cognitive-enabled financial services apps quickly and at scale. Hosted on IBM Bluemix, IBM’s platform builds in customer insights, regulatory compliance analytics and security to help reduce development time and testing. In his session at 21st Cloud Expo, Lennart Frantzell, a Developer Advocate with IBM, will discuss how these tools simplify the time-consuming tasks of selection, mapping and data integration, allowing devel...
The last two years has seen discussions about cloud computing evolve from the public / private / hybrid split to the reality that most enterprises will be creating a complex, multi-cloud strategy. Companies are wary of committing all of their resources to a single cloud, and instead are choosing to spread the risk – and the benefits – of cloud computing across multiple providers and internal infrastructures, as they follow their business needs. Will this approach be successful? How large is the ...
SYS-CON Events announced today that B2Cloud will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. B2Cloud specializes in IoT devices for preventive and predictive maintenance in any kind of equipment retrieving data like Energy consumption, working time, temperature, humidity, pressure, etc.
What is the best strategy for selecting the right offshore company for your business? In his session at 21st Cloud Expo, Alan Winters, U.S. Head of Business Development at MobiDev, will discuss the things to look for - positive and negative - in evaluating your options. He will also discuss how to maximize productivity with your offshore developers. Before you start your search, clearly understand your business needs and how that impacts software choices.
SYS-CON Events announced today that NetApp has been named “Bronze Sponsor” of SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. NetApp is the data authority for hybrid cloud. NetApp provides a full range of hybrid cloud data services that simplify management of applications and data across cloud and on-premises environments to accelerate digital transformation. Together with their partners, NetApp em...
SYS-CON Events announced today that SIGMA Corporation will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. uLaser flow inspection device from the Japanese top share to Global Standard! Then, make the best use of data to flip to next page. For more information, visit http://www.sigma-k.co.jp/en/.