@CloudExpo Authors: Carmen Gonzalez, Yeshim Deniz, Zakia Bouachraoui, Chander Damodaran, Elizabeth White

Related Topics: @CloudExpo, Java IoT, Microservices Expo, Microsoft Cloud, Agile Computing, @DXWorldExpo

@CloudExpo: Article

Part 2 | Understand the Impact of IT on Business

It Takes More than Advanced Correlation

View Part 1 here

Part 2 of a two part blog series looking at the journey enterprise IT departments take as they increasingly seek to understand the relationships and impact of  IT infrastructure performance on application performance and business services.

Stage 4: Correlation
Through observation, Fred notices that even though his alarms, based on dynamic baselines, catch problems in his environment, they're also catching busy days, quiet days, and even slightly odd days. He starts to realize that just looking at the level of metrics is necessary, but not sufficient. Fred also needs to look at how the metrics work together - he needs statistical correlation.

With a real-time correlation package in place, Fred realizes that with correlation in addition to baselines, he not only knows when the level of activity is unusual, he knows when the profile of activity is unusual as well. Now, Fred can finally see the difference between odd and alarming. The numbers of alerts he sees is more manageable as well, since he's putting much more weight behind what the correlation has to tell him. With the end in sight, Fred decides to send the alarms he's getting directly to the NOC. Finally, he's confident that he's produced the best possible outcome.

Had I written this two years ago, this would be the end. I would have introduced Netuitive's real-time analytics and correlation capabilities as a key differentiator for our solution, and I'd be nearly as satisfied with myself as Fred. But we'd both be wrong.

Stage 5: Integrated Knowledge
Fred gets a drop-in visit from Audrey, who manages the Level 1 technicians at Acmecorp. Her job is to ensure that her team either diagnoses and resolves issues or escalates issues as quickly as possible to the right Level 2 specialty team. Fred can tell that Audrey has something on her mind.

"Fred, what does this mean?" She hands him a printout of an alarm from Fred's new monitoring system.

He examines the printout. "Oh, this is telling you that the queue lengths between the BuyNow checkout and the credit card processing service is unusually high...at the same time."

"I get that," Audrey interrupts. "My point is, what can my team do with this? If we have to run to you for the big picture every time an alarm rolls out, we're not exactly living up to our mission of a rapid response. Granted, there's useful stuff here, but we're not recognizing it quickly enough. Some we never figure out, and for others, we know why it's alarming, but it's not even something we need to worry about."

Fred sighs, "Okay, let me work on it." Audrey leaves the printout on the desk as Fred stares at the ceiling. It dawns on him that instead of creating a better run book, he's better off taking his knowledge of the application and building it into the application as much as possible. With an embedded sense of how to interpret the statistics, Audrey's team can have the immediate benefit of Fred's experience even when he's not available. If he builds it right, Audrey's team will only see the issues they can quickly handle on their own, or escalate to the right Level 2 team, or development.

Fred takes out his worn notebook titled "Incidents, Outages and Misfortunes" and begins to consolidate them down to a knowledge base of standard operation procedures specific to the BuyThis platform, and its set of customer constraints. After discussions with Audrey, he decides to focus his attention on I/O contention issues that have caused four outages over the last two months. Through analysis of past incidents, they've noticed that a strong symptom of their I/O contention is when their message queues lengths rise suddenly at the same time that the transaction rate on a particular set of databases falls below the expected. Fred includes the necessary conditions to indicate the problem, as well as his incident avoidance plan - in this case, looking for rogue batch processing jobs on the database servers. When he's done, he's defined several entries in his new knowledge base. They provide him with an automated way to diagnose a problem, as well as give a heads up to Audrey's team on what he feels is the correct escalation procedure.

"This," Fred says to himself, "is definitely better." After conferring with Audrey to make sure she and her team understand the plan and agree to follow the recommendations from the knowledge base, Fred moves the new entries into production and prepares himself for the next step.

Step 6: Acceptance and Maintenance
It doesn't take long for Fred to realize his job isn't done with getting buy-in from Audrey's team. Although his late night calls from the Level 1 NOC have tapered off due to his Knowledge plan, he finds himself spending more time with the Level 2 team leads. Some believe the knowledge Fred has poured into the system is good, but it still needs tweaks for the particulars of their environment. Others, like the DBAs, feel it needs to be thoroughly overhauled before they can possibly make use of it.

Fred calls them together for a meeting, along with Audrey and the EVP in charge of the BuyThis platform, Kevin. "Guys," he begins. "I understand your concerns. You want more input into the kinds of knowledge we're embedding into the system. As I see it, I have a good set of knowledge I could use to get the ball rolling with this system. Not perfect, but a strong start, and we've already seen the benefits."

"However, we could do better, both in terms of tuning our existing knowledge base, as well as expanding it. Luckily..." and Fred paused and looked around the table, "we've got the right set of experts around the table to begin the process."

With Kevin's blessing, Fred convenes a weekly session with the Level 1 and 2 department heads to proactively review the state of the current knowledge base, as well as to propose and vet possible enhancements based on recent performance. Though there is pushback at first, a combination of executive pressure and positive results brings even the reluctant around to Fred's way of thinking. Fred comes to realize that he would have been better off beginning the process with buy-in from the team leads, rather than starting off by mandate.

As the process of creating and refining knowledge base entries continues, Audrey's team becomes much more effective as they spend their critical moments on issue avoidance rather than diagnosis.  BuyThis enters a streak of solid performance.  Slow downs and downtime are a thing of the past, the impact of IT and application performance on the business are now measurable, and the teams are more integrated and efficient than ever before.

As for Fred, he's even begun to look towards Stage 7, automating the response to certain knowledge base entries. For now, though, he's happy to be sleeping much better.

More Stories By Marcus Jackson

Marcus is Director of Product Management at Netuitive. He is responsible for the direction of Netuitive's flagship product, including analytics and data visualization. He has over 20 years of experience in software engineering and performance management. Previously, he headed development for Netuitive and the IEEE Computer Society. Marcus holds a bachelor's degree in Computer Science from Harvard University.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.

CloudEXPO Stories
Moving to Azure is the path to digital transformation, but not every journey is effective. Organizations that start with a cohesive, well-planned migration strategy can avoid common mistakes and stay a step ahead of the competition. Learn from Atmosera CEO, Jon Thomsen about the opportunities and challenges found in three pivotal phases of the journey to the cloud: Evaluation and Architecting, Migration and Management, and Optimization & Innovation. In each phase, there are distinct insights that can give a company the edge and make sure cloud adoption is closely aligned to core business goals. Keeping these in mind will make your migration to the Azure simpler and more effective.
CloudEXPO has been the M&A capital for Cloud companies for more than a decade with memorable acquisition news stories which came out of CloudEXPO expo floor. DevOpsSUMMIT New York faculty member Greg Bledsoe shared his views on IBM's Red Hat acquisition live from NASDAQ floor. Acquisition news was announced during CloudEXPO New York which took place November 12-13, 2019 in New York City. Our Silicon Valley 2019 schedule will showcase 200 keynotes, sessions, general sessions, power panels, and hands on tutorials presented by 150 rockstar speakers in 10 hottest conference tracks of 2019:
Cloud is the motor for innovation and digital transformation. CIOs will run 25% of total application workloads in the cloud by the end of 2018, based on recent Morgan Stanley report. Having the right enterprise cloud strategy in place, often in a multi cloud environment, also helps companies become a more intelligent business. Companies that master this path have something in common: they create a culture of continuous innovation. In his presentation, Dilipkumar Khandelwal outlined the latest research and steps companies can take to make innovation a daily work habit by using enterprise cloud computing. He shared examples from companies that have benefited from enterprise cloud computing and took a look into the future of how the cloud helps companies become a more intelligent business.
Data center, on-premise, public-cloud, private-cloud, multi-cloud, hybrid-cloud, IoT, AI, edge, SaaS, PaaS... it's an availability, security, performance and integration nightmare even for the best of the best IT experts. Organizations realize the tremendous benefits of everything the digital transformation has to offer. Cloud adoption rates are increasing significantly, and IT budgets are morphing to follow suit. But distributing applications and infrastructure around increases risk, introduces complexity and challenges availability at every turn. To embrace DX and to come out on top, there are four underlying principles that should guide you. Understanding these four essentials along with their relevance and impact will elevate you to DX Hero status now. Jonathan will provide a high-level overview of these principles and how some of his organization's clients have embraced them w...
DevOps has long focused on reinventing the SDLC (e.g. with CI/CD, ARA, pipeline automation etc.), while reinvention of IT Ops has lagged. However, new approaches like Site Reliability Engineering, Observability, Containerization, Operations Analytics, and ML/AI are driving a resurgence of IT Ops. In this session our expert panel will focus on how these new ideas are [putting the Ops back in DevOps orbringing modern IT Ops to DevOps].