|By Michael Kopp||
|December 12, 2012 07:30 AM EST||
Anyone who ever monitored or analyzed an application uses or has used averages. They are simple to understand and calculate. We tend to ignore just how wrong the picture is that averages paint of the world. To emphasis the point let me give you a real-world example outside of the performance space that I read recently in a newspaper.
The article was explaining that the average salary in a certain region in Europe was 1900 Euro's (to be clear this would be quite good in that region!). However when looking closer they found out that the majority, namely 9 out of 10 people, only earned around 1000 Euros and one would earn 10.000 (I over simplified this of course, but you get the idea). If you do the math you will see that the average of this is indeed 1900, but we can all agree that this does not represent the "average" salary as we would use the word in day to day live. So now let's apply this thinking to application performance.
The Average Response Time
The average response time is by far the most commonly used metric in application performance management. We assume that this represents a "normal" transaction, however this would only be true if the response time is always the same (all transaction run at equal speed) or the response time distribution is roughly bell curved.
A Bell curve represents the "normal" distribution of response times in which the average and the median are the same. It rarely ever occurs in real applications
In a Bell Curve the average (mean) and median are the same. In other words observed performance would represent the majority (half or more than half) of the transactions.
In reality most applications have few very heavy outliers; a statistician would say that the curve has a long tail. A long tail does not imply many slow transactions, but few that are magnitudes slower than the norm.
This is a typical Response Time Distribution with few but heavy outliers - it has a long tail. The average here is dragged to the right by the long tail.
We recognize that the average no longer represents the bulk of the transactions but can be a lot higher than the median.
You can now argue that this is not a problem as long as the average doesn't look better than the median. I would disagree, but let's look at another real-world scenario experienced by many of our customers:
This is another typical Response Time Distribution. Here we have quite a few very fast transactions that drag the average to the left of the actual median
In this case a considerable percentage of transactions are very, very fast (10-20 percent), while the bulk of transactions are several times slower. The median would still tell us the true story, but the average all of a sudden looks a lot faster than most of our transactions actually are. This is very typical in search engines or when caches are involved - some transactions are very fast, but the bulk are normal. Another reason for this scenario are failed transactions, more specifically transactions that failed fast. Many real-world applications have a failure rate of 1-10 percent (due to user errors or validation errors). These failed transactions are often magnitudes faster than the real ones and consequently distorted an average.
Of course performance analysts are not stupid and regularly try to compensate with higher frequency charts (compensating by looking at smaller aggregates visually) and by taking in minimum and maximum observed response times. However we can often only do this if we know the application very well, those unfamiliar with the application might easily misinterpret the charts. Because of the depth and type of knowledge required for this, it's difficult to communicate your analysis to other people - think how many arguments between IT teams have been caused by this. And that's before we even begin to think about communicating with business stakeholders!
A better metric by far are percentiles, because they allow us to understand the distribution. But before we look at percentiles, let's take a look a key feature in every production monitoring solution: Automatic Baselining and Alerting.
Automatic Baselining and Alerting
In real-world environments, performance gets attention when it is poor and has a negative impact on the business and users. But how can we identify performance issues quickly to prevent negative effects? We cannot alert on every slow transaction, since there are always some. In addition, most operations teams have to maintain a large number of applications and are not familiar with all of them, so manually setting thresholds can be inaccurate, quite painful and time-consuming.
The industry has come up with a solution called Automatic Baselining. Baselining calculates out the "normal" performance and only alerts us when an application slows down or produces more errors than usual. Most approaches rely on averages and standard deviations.
Without going into statistical details, this approach again assumes that the response times are distributed over a bell curve:
The Standard Deviation represents 33% of all transactions with the mean as the middle. 2xStandard Deviation represents 66% and thus the majority, everything outside could be considered an outlier. However most real world scenarios are not bell curved...
Typically, transactions that are outside two times standard deviation are treated as slow and captured for analysis. An alert is raised if the average moves significantly. In a bell curve this would account for the slowest 16.5 percent (and you can of course adjust that); however; if the response time distribution does not represent a bell curve, it becomes inaccurate. We either end up with a lot of false positives (transactions that are a lot slower than the average but when looking at the curve lie within the norm) or we miss a lot of problems (false negatives). In addition if the curve is not a bell curve, then the average can differ a lot from the median; applying a standard deviation to such an average can lead to quite a different result than you would expect. To work around this problem these algorithms have many tunable variables and a lot of "hacks" for specific use cases.
Why I Love Percentiles
A percentile tells me which part of the curve I am looking at and how many transactions are represented by that metric. To visualize this look at the following chart:
This chart shows the 50th and 90th percentile along with the average of the same transaction. It shows that the average is influenced far mor heavily by the 90th, thus by outliers and not by the bulk of the transactions
The green line represents the average. As you can see it is very volatile. The other two lines represent the 50th and 90th percentile. As we can see the 50th percentile (or median) is rather stable but has a couple of jumps. These jumps represent real performance degradation for the majority (50%) of the transactions. The 90th percentile (this is the start of the "tail") is a lot more volatile, which means that the outliers slowness depends on data or user behavior. What's important here is that the average is heavily influenced (dragged) by the 90th percentile, the tail, rather than the bulk of the transactions.
If the 50th percentile (median) of a response time is 500ms that means that 50% of my transactions are either as fast or faster than 500ms. If the 90th percentile of the same transaction is at 1000ms it means that 90% are as fast or faster and only 10% are slower. The average in this case could either be lower than 500ms (on a heavy front curve), a lot higher (long tail) or somewhere in between. A percentile gives me a much better sense of my real world performance, because it shows me a slice of my response time curve.
For exactly that reason percentiles are perfect for automatic baselining. If the 50th percentile moves from 500ms to 600ms I know that 50% of my transactions suffered a 20% performance degradation. You need to react to that.
In many cases we see that the 75th or 90th percentile does not change at all in such a scenario. This means the slow transactions didn't get any slower, only the normal ones did. Depending on how long your tail is the average might not have moved at all in such a scenario.!
In other cases we see the 98th percentile degrading from 1s to 1.5 seconds while the 95th is stable at 900ms. This means that your application as a whole is stable, but a few outliers got worse, nothing to worry about immediately. Percentile-based alerts do not suffer from false positives, are a lot less volatile and don't miss any important performance degradations! Consequently a baselining approach that uses percentiles does not require a lot of tuning variables to work effectively.
The screenshot below shows the Median (50th Percentile) for a particular transaction jumping from about 50ms to about 500ms and triggering an alert as it is significantly above the calculated baseline (green line). The chart labeled "Slow Response Time" on the other hand shows the 90th percentile for the same transaction. These "outliers" also show an increase in response time but not significant enough to trigger an alert.
Here we see an automatic baselining dashboard with a violation at the 50th percentile. The violation is quite clear, at the same time the 90th percentile (right upper chart) does not violate. Because the outliers are so much slower than the bulk of the transaction an average would have been influenced by them and would not have have reacted quite as dramatically as the 50th percentile. We might have missed this clear violation!
How Can We Use Percentiles for Tuning?
Percentiles are also great for tuning, and giving your optimizations a particular goal. Let's say that something within my application is too slow in general and I need to make it faster. In this case I want to focus on bringing down the 90th percentile. This would ensure sure that the overall response time of the application goes down. In other cases I have unacceptably long outliers I want to focus on bringing down response time for transactions beyond the 98th or 99th percentile (only outliers). We see a lot of applications that have perfectly acceptable performance for the 90th percentile, with the 98th percentile being magnitudes worse.
In throughput oriented applications on the other hand I would want to make the majority of my transactions very fast, while accepting that an optimization makes a few outliers slower. I might therefore make sure that the 75th percentile goes down while trying to keep the 90th percentile stable or not getting a lot worse.
I could not make the same kind of observations with averages, minimum and maximum, but with percentiles they are very easy indeed.
Averages are ineffective because they are too simplistic and one-dimensional. Percentiles are a really great and easy way of understanding the real performance characteristics of your application. They also provide a great basis for automatic baselining, behavioral learning and optimizing your application with a proper focus. In short, percentiles are great!
|rtalexander 11/21/12 12:58:00 AM EST|
Hey, could you post a reference or two that covers the theory and/or practicalities of the approach you describe?
The cloud is everywhere and growing, and with it SaaS has become an accepted means for software delivery. SaaS is more than just a technology, it is a thriving business model estimated to be worth around $53 billion dollars by 2015, according to IDC. The question is - how do you build and scale a profitable SaaS business model? In his session at 15th Cloud Expo, Jason Cumberland, Vice President, SaaS Solutions at Dimension Data, discussed the common mistakes businesses make when transitioning t...
May. 27, 2015 07:30 AM EDT Reads: 3,454
In their session at @ThingsExpo, Shyam Varan Nath, Principal Architect at GE, and Ibrahim Gokcen, who leads GE's advanced IoT analytics, focused on the Internet of Things / Industrial Internet and how to make it operational for business end-users. Learn about the challenges posed by machine and sensor data and how to marry it with enterprise data. They also discussed the tips and tricks to provide the Industrial Internet as an end-user consumable service using Big Data Analytics and Industrial C...
May. 27, 2015 07:30 AM EDT Reads: 5,678
Cloud and Big Data present unique dilemmas: embracing the benefits of these new technologies while maintaining the security of your organization's assets. When an outside party owns, controls and manages your infrastructure and computational resources, how can you be assured that sensitive data remains private and secure? How do you best protect data in mixed use cloud and big data infrastructure sets? Can you still satisfy the full range of reporting, compliance and regulatory requirements? In...
May. 27, 2015 07:00 AM EDT Reads: 4,521
Storage administrators find themselves walking a line between meeting employees’ demands to use public cloud storage services, and their organizations’ need to store information on-premises for security, performance, cost and compliance reasons. However, as file sharing protocols like CIFS and NFS continue to lose their relevance, simply relying only on a NAS-based environment creates inefficiencies that hurt productivity and the bottom line. IT wants to implement cloud storage it can purchase a...
May. 27, 2015 06:30 AM EDT Reads: 3,029
There has been a lot of discussion recently in the DevOps space over whether there is a unique form of DevOps for large enterprises or is it just vendors looking to sell services and tools. In his session at DevOps Summit, Chris Riley, a technologist, discussed whether Enterprise DevOps is a unique species or not. What makes DevOps adoption in the enterprise unique or what doesn’t? Unique or not, what does this mean for adopting DevOps in enterprise size organizations? He also explored differe...
May. 27, 2015 06:30 AM EDT Reads: 3,832
The move to the cloud brings a number of new security challenges, but the application remains your last line of defense. In his session at 15th Cloud Expo, Arthur Hicken, Evangelist at Parasoft, discussed how developers are extremely well-poised to perform tasks critical for securing the application – provided that certain key obstacles are overcome. Arthur Hicken has been involved in automating various practices at Parasoft for almost 20 years. He has worked on projects including database dev...
May. 27, 2015 06:00 AM EDT Reads: 3,008
In this scenarios approach Joe Thykattil, Technology Architect & Sales at TimeWarner / Navisite, presented examples that will allow business-savvy professionals to make informed decisions based on a sound business model. This model covered the technology options in detail as well as a financial analysis. The TCO (Total Cost of Ownership) and ROI (Return on Investment) demonstrated how to start, develop and formulate a business case that will allow both small and large scale projects to achieve...
May. 27, 2015 05:30 AM EDT Reads: 3,463
Cloud Foundry open Platform as a Service makes it easy to operate, scale and deploy application for your dedicated cloud environments. It enables developers and operators to be significantly more agile, writing great applications and deliver them in days instead of months. Cloud Foundry takes care of all the infrastructure and network plumbing that you need to build, run and operate your applications and can do this while patching and updating systems and services without any downtime.
May. 27, 2015 05:30 AM EDT Reads: 3,786
Are your Big Data initiatives resulting in Big Impact or Big Mess? In her session at Big Data Expo, Penelope Everall Gordon, Emerging Technology Strategist at 1Plug Corporation, shared her successes in improving Big Decision outcomes by building stories compelling to the target audience – and her failures when she lost sight of the plotline, distracted by the glitter of technology and the lure of buried insights. The cast of characters includes the agency head [city official? elected official?...
May. 27, 2015 05:30 AM EDT Reads: 3,671
Building low-cost wearable devices can enhance the quality of our lives. In his session at Internet of @ThingsExpo, Sai Yamanoor, Embedded Software Engineer at Altschool, provided an example of putting together a small keychain within a $50 budget that educates the user about the air quality in their surroundings. He also provided examples such as building a wearable device that provides transit or recreational information. He then reviewed the resources available to build wearable devices at ...
May. 27, 2015 04:30 AM EDT Reads: 4,291
After a couple of false starts, cloud-based desktop solutions are picking up steam, driven by trends such as BYOD and pervasive high-speed connectivity. In his session at 15th Cloud Expo, Seth Bostock, CEO of IndependenceIT, cut through the hype and the acronyms, and discussed the emergence of full-featured cloud workspaces that do for the desktop what cloud infrastructure did for the server. He also discussed VDI vs DaaS, implementation strategies and evaluation criteria.
May. 27, 2015 04:00 AM EDT Reads: 4,750
The emergence of cloud computing and Big Data warrants a greater role for the PMO to successfully manage enterprise transformation driven by these powerful trends. As the adoption of cloud-based services continues to grow, a governance model is needed to orchestrate enterprise cloud implementations and harness the power of Big Data analytics. In his session at Cloud Expo, Mahesh Singh, President of BigData, Inc., discussed how the Enterprise PMO takes center stage not only in developing the app...
May. 27, 2015 03:30 AM EDT Reads: 2,923
How do APIs and IoT relate? The answer is not as simple as merely adding an API on top of a dumb device, but rather about understanding the architectural patterns for implementing an IoT fabric. There are typically two or three trends: Exposing the device to a management framework Exposing that management framework to a business centric logic Exposing that business layer and data to end users. This last trend is the IoT stack, which involves a new shift in the separation of what stuff happe...
May. 27, 2015 03:00 AM EDT Reads: 5,970
We certainly live in interesting technological times. And no more interesting than the current competing IoT standards for connectivity. Various standards bodies, approaches, and ecosystems are vying for mindshare and positioning for a competitive edge. It is clear that when the dust settles, we will have new protocols, evolved protocols, that will change the way we interact with devices and infrastructure. We will also have evolved web protocols, like HTTP/2, that will be changing the very core...
May. 27, 2015 02:30 AM EDT Reads: 5,630
Connected devices and the Internet of Things are getting significant momentum in 2014. In his session at Internet of @ThingsExpo, Jim Hunter, Chief Scientist & Technology Evangelist at Greenwave Systems, examined three key elements that together will drive mass adoption of the IoT before the end of 2015. The first element is the recent advent of robust open source protocols (like AllJoyn and WebRTC) that facilitate M2M communication. The second is broad availability of flexible, cost-effective ...
May. 27, 2015 02:00 AM EDT Reads: 6,384
Collecting data in the field and configuring multitudes of unique devices is a time-consuming, labor-intensive process that can stretch IT resources. Horan & Bird [H&B], Australia’s fifth-largest Solar Panel Installer, wanted to automate sensor data collection and monitoring from its solar panels and integrate the data with its business and marketing systems. After data was collected and structured, two major areas needed to be addressed: improving developer workflows and extending access to a b...
May. 27, 2015 01:00 AM EDT Reads: 2,583
When an enterprise builds a hybrid IaaS cloud connecting its data center to one or more public clouds, security is often a major topic along with the other challenges involved. Security is closely intertwined with the networking choices made for the hybrid cloud. Traditional networking approaches for building a hybrid cloud try to kludge together the enterprise infrastructure with the public cloud. Consequently this approach requires risky, deep "surgery" including changes to firewalls, subnets...
May. 27, 2015 12:00 AM EDT Reads: 4,560
Containers Expo Blog covers the world of containers, as this lightweight alternative to virtual machines enables developers to work with identical dev environments and stacks. Containers Expo Blog offers top articles, news stories, and blog posts from the world's well-known experts and guarantees better exposure for its authors than any other publication. Bookmark Containers Expo Blog ▸ Here Follow new article posts on Twitter at @ContainersExpo
May. 26, 2015 11:00 PM EDT Reads: 1,218
There is little doubt that Big Data solutions will have an increasing role in the Enterprise IT mainstream over time. 8th International Big Data Expo, co-located with 17th International Cloud Expo - to be held November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA - has announced its Call for Papers is open. As advanced data storage, access and analytics technologies aimed at handling high-volume and/or fast moving data all move center stage, aided by the cloud computing bo...
May. 26, 2015 08:45 PM EDT Reads: 2,165
The 5th International DevOps Summit, co-located with 17th International Cloud Expo – being held November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA – announces that its Call for Papers is open. Born out of proven success in agile development, cloud computing, and process automation, DevOps is a macro trend you cannot afford to miss. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the...
May. 26, 2015 05:30 PM EDT Reads: 4,578