@CloudExpo Authors: Liz McMillan, Pat Romanski, William Schmarzo, Hollis Tibbetts, Elizabeth White

Related Topics: @DevOpsSummit, Java IoT, Linux Containers, Agile Computing, @BigDataExpo

@DevOpsSummit: Blog Post

Application Failures in Production | @DevOpsSummit [DevOps]

The wealth of out-of-the-box insights you could obtain from a single urgent, albeit unspecific log message

How to Approach Application Failures in Production

In my recent article, "Software Quality Metrics for your Continuous Delivery Pipeline - Part III - Logging," I wrote about the good parts and the not-so-good parts of logging and concluded that logging usually fails to deliver what it is so often mistakenly used for: as a mechanism for analyzing application failures in production. In response to the heated debates on reddit.com/r/devops and reddit.com/r/programing, I want to demonstrate the wealth of out-of-the-box insights you could obtain from a single urgent, albeit unspecific log message if you only are equipped with the magic ingredient; full transaction context:

Examples of insights you could obtain from full transaction context on a single log message

Bear with me until I get to explain what this actually means and how it helps you get almost immediate answers to the most urgent questions when your users are struck by an application failure:

  • "How many users are affected and who are they?"
  • "Which tiers are affected by which errors and what is the root cause?"

Operator: I'm here because you broke something. (courtesy of ThinkGeek.com)

When All You Have Is a Lousy Log Message
Does this story sound familiar to you? It's a Friday afternoon and you just received the release artifacts from the development team belatedly, which need to be released by Monday morning. After spending the night and another day in operations to get this release out into production timely, you notice the Monday after that everything you have achieved in the end was some lousy log message:

08:55:26 SEVERE com.company.product.login.LoginLogic - LoginException occurred when processing Login transaction

While this scenario hopefully does not reflect a common case for you, it still shows an important aspect in the life of development and operations: working as an operator involves monitoring the production environment and providing assistance in troubleshooting application failures mainly with the help of log messages - things that developers have baked into their code. While certainly not all log messages need to be as poor as this one, getting down to the bottom of a production failure is often a tedious endeavor (see this comment on reddit by RecklessKelly who sometimes needs weeks to get his "Eureka moment") - if at all possible.

Why There Is No Such Thing as a 100% Error-Free Code
Production failures can become a major pain for your business with long-term effects: they will not only make your visitors buy elsewhere, but depending on the level of frustrations, your customers may choose to stay at your competition instead of giving you another chance.

As we all know, we just cannot get rid of application failures in production entirely. Agile methodologies, such as Extreme Programming or Scrum, aim to build quality into our processes; however, there is still no such thing as a 100% error-free application. "We need to write more tests!" you may argue and I would agree: disciplines such as TDD and ATDD should be an integral part of your software development process since they, if applied correctly, help you produce better code and fewer bugs. Still, it is simply impossible to test each and every corner of your application for all possible combinations of input parameters and application state. Essentially, we can run only a limited subset of all possible test scenarios. The common goal of developers and test automation engineers, hence, must be to implement a testing strategy, which allows them to deliver code of sufficient quality. Consequently, there is always a chance that something can go wrong, and, as a serious business, you will want to be prepared for the unpredictable and, additionally, have as much control over it as possible:

Why you cannot get rid of application failures in production: remaining failure probability

Without further ado, let's examine some precious out-of-the-box insights you could obtain if you are equipped with full transaction context and are able to capture all transactions.

Why this is important? Because it enables you to see the contributions of input parameters, processes, infrastructure and users at all times whenever a failure occurred, solve problems faster, and additionally use the presented information such as unexpected input parameters to further improve your testing strategy.

Initial Situation: Aggregated Log Messages
Instead of crawling a bunch of possibly distributed log files to determine the count of particular log messages, we may, first of all, want to have this done automatically for us just as they happen. This gives a good overview on the respective message frequencies and facilitates prioritization:

Aggregated log events: severity, logger name, message and count

What we see here (analysis view based on our PurePath technology) is that there have been 104 occurrences of the same log message in the application. We could also observe other captured event data, such as the severity level and the name of the logger instance (usually the name of the class that created the logger).

Question #1: How many users are affected and who are they?

Failed Business Transactions: "Logins" and "Logins by Username"

Having the full transactional context and not just the log message allows us to figure out which critical Business Transactions of our application are impacted. From the dashboard above we can observe that "Logins" and "Logins by Username" have failed: we see that 61 users attempted the 104 logins and who these users were by their username.

For questions 2 and 3, and for further insight, click here for the full article.

More Stories By Martin Etmajer

Leveraging his outstanding technical skills as a lead software engineer, Martin Etmajer has been a key contributor to a number of large-scale systems across a range of industries. He is as passionate about great software as he is about applying Lean Startup principles to the development of products that customers love.

Martin is a life-long learner who frequently speaks at international conferences and meet-ups. When not spending time with family, he enjoys swimming and Yoga. He holds a master's degree in Computer Engineering from the Vienna University of Technology, Austria, with a focus on dependable distributed real-time systems.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.

@CloudExpo Stories
So you think you are a DevOps warrior, huh? Put your money (not really, it’s free) where your metrics are and prove it by taking The Ultimate DevOps Geek Quiz Challenge, sponsored by DevOps Summit. Battle through the set of tough questions created by industry thought leaders to earn your bragging rights and win some cool prizes.
We're entering the post-smartphone era, where wearable gadgets from watches and fitness bands to glasses and health aids will power the next technological revolution. With mass adoption of wearable devices comes a new data ecosystem that must be protected. Wearables open new pathways that facilitate the tracking, sharing and storing of consumers’ personal health, location and daily activity data. Consumers have some idea of the data these devices capture, but most don’t realize how revealing and...
A completely new computing platform is on the horizon. They’re called Microservers by some, ARM Servers by others, and sometimes even ARM-based Servers. No matter what you call them, Microservers will have a huge impact on the data center and on server computing in general. Although few people are familiar with Microservers today, their impact will be felt very soon. This is a new category of computing platform that is available today and is predicted to have triple-digit growth rates for some ...
Governments around the world are adopting Safe Harbor privacy provisions to protect customer data from leaving sovereign territories. Increasingly, global companies are required to create new instances of their server clusters in multiple countries to keep abreast of these new Safe Harbor laws. Is it worth it? In his session at 19th Cloud Expo, Adam Rogers, Managing Director of Anexia, Inc., will discuss how to keep your data legal and still stay in business.
SYS-CON Events announced today that MathFreeOn will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. MathFreeOn is Software as a Service (SaaS) used in Engineering and Math education. Write scripts and solve math problems online. MathFreeOn provides online courses for beginners or amateurs who have difficulties in writing scripts. In accordance with various mathematical topics, there are more tha...
In past @ThingsExpo presentations, Joseph di Paolantonio has explored how various Internet of Things (IoT) and data management and analytics (DMA) solution spaces will come together as sensor analytics ecosystems. This year, in his session at @ThingsExpo, Joseph di Paolantonio from DataArchon, will be adding the numerous Transportation areas, from autonomous vehicles to “Uber for containers.” While IoT data in any one area of Transportation will have a huge impact in that area, combining sensor...
SYS-CON Events announced today that SoftNet Solutions will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. SoftNet Solutions specializes in Enterprise Solutions for Hadoop and Big Data. It offers customers the most open, robust, and value-conscious portfolio of solutions, services, and tools for the shortest route to success with Big Data. The unique differentiator is the ability to architect and ...
The best way to leverage your Cloud Expo presence as a sponsor and exhibitor is to plan your news announcements around our events. The press covering Cloud Expo and @ThingsExpo will have access to these releases and will amplify your news announcements. More than two dozen Cloud companies either set deals at our shows or have announced their mergers and acquisitions at Cloud Expo. Product announcements during our show provide your company with the most reach through our targeted audiences.
Successful transition from traditional IT to cloud computing requires three key ingredients: an IT architecture that allows companies to extend their internal best practices to the cloud, a cost point that allows economies of scale, and automated processes that manage risk exposure and maintain regulatory compliance with industry regulations (FFIEC, PCI-DSS, HIPAA, FISMA). The unique combination of VMware, the IBM Cloud, and Cloud Raxak, a 2016 Gartner Cool Vendor in IT Automation, provides a co...
More and more brands have jumped on the IoT bandwagon. We have an excess of wearables – activity trackers, smartwatches, smart glasses and sneakers, and more that track seemingly endless datapoints. However, most consumers have no idea what “IoT” means. Creating more wearables that track data shouldn't be the aim of brands; delivering meaningful, tangible relevance to their users should be. We're in a period in which the IoT pendulum is still swinging. Initially, it swung toward "smart for smar...
@ThingsExpo has been named the Top 5 Most Influential Internet of Things Brand by Onalytica in the ‘The Internet of Things Landscape 2015: Top 100 Individuals and Brands.' Onalytica analyzed Twitter conversations around the #IoT debate to uncover the most influential brands and individuals driving the conversation. Onalytica captured data from 56,224 users. The PageRank based methodology they use to extract influencers on a particular topic (tweets mentioning #InternetofThings or #IoT in this ...
SYS-CON Events announced today that Niagara Networks will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Niagara Networks offers the highest port-density systems, and the most complete Next-Generation Network Visibility systems including Network Packet Brokers, Bypass Switches, and Network TAPs.
In an era of historic innovation fueled by unprecedented access to data and technology, the low cost and risk of entering new markets has leveled the playing field for business. Today, any ambitious innovator can easily introduce a new application or product that can reinvent business models and transform the client experience. In their Day 2 Keynote at 19th Cloud Expo, Mercer Rowe, IBM Vice President of Strategic Alliances, and Raejeanne Skillern, Intel Vice President of Data Center Group and ...
Data is the fuel that drives the machine learning algorithmic engines and ultimately provides the business value. In his session at Cloud Expo, Ed Featherston, a director and senior enterprise architect at Collaborative Consulting, will discuss the key considerations around quality, volume, timeliness, and pedigree that must be dealt with in order to properly fuel that engine.
SYS-CON Events announced today that StarNet Communications will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. StarNet Communications’ FastX is the industry first cloud-based remote X Windows emulator. Using standard Web browsers (FireFox, Chrome, Safari, etc.) users from around the world gain highly secure access to applications and data hosted on Linux-based servers in a central data center. ...
SYS-CON Events announced today that Cemware will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Use MATLAB functions by just visiting website mathfreeon.com. MATLAB compatible, freely usable, online platform services. As of October 2016, 80,000 users from 180 countries are enjoying our platform service.
Traditional on-premises data centers have long been the domain of modern data platforms like Apache Hadoop, meaning companies who build their business on public cloud were challenged to run Big Data processing and analytics at scale. But recent advancements in Hadoop performance, security, and most importantly cloud-native integrations, are giving organizations the ability to truly gain value from all their data. In his session at 19th Cloud Expo, David Tishgart, Director of Product Marketing ...
Virgil consists of an open-source encryption library, which implements Cryptographic Message Syntax (CMS) and Elliptic Curve Integrated Encryption Scheme (ECIES) (including RSA schema), a Key Management API, and a cloud-based Key Management Service (Virgil Keys). The Virgil Keys Service consists of a public key service and a private key escrow service. 

SYS-CON Events announced today that eCube Systems, the leading provider of modern development tools and best practices for Continuous Integration on OpenVMS, will exhibit at SYS-CON's @DevOpsSummit at Cloud Expo New York, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. eCube Systems offers a family of middleware products and development tools that maximize return on technology investment by leveraging existing technical equity to meet evolving business needs. ...
Effectively SMBs and government programs must address compounded regulatory compliance requirements. The most recent are Controlled Unclassified Information and the EU’s GDPR have Board Level implications. Managing sensitive data protection will likely result in acquisition criteria, demonstration requests and new requirements. Developers, as part of the pre-planning process and the associated supply chain, could benefit from updating their code libraries and design by incorporating changes.