Welcome!

@CloudExpo Authors: Harry Trott, Elizabeth White, Liz McMillan, Stackify Blog, Pat Romanski

Related Topics: @CloudExpo, Java IoT, @DevOpsSummit

@CloudExpo: Blog Feed Post

Best Practices for Load Server Calibration | @CloudExpo #Cloud

Load server calibration: what it is and why you need to do it

Enough Is Enough!  Or Is It? Best Practices for Load Server Calibration
By Dan Boutin

Here’s a question I get asked all the time by customers and prospects alike: I use “that tool out west” and I can usually get 50 vUsers per load server for my current scripts/tests. How many vUsers can SOASTA get per load server?

My typical response: Do you currently use a load server calibration process?

The dead silence and blank stare I get in return usually means the person thinks they need a lab like the one above, or that they have no idea why you’d want to calibrate a load server, much less have an iterative process for doing it.

Here’s one reason why it’s important, and why I get asked this all the time:

I know of several large financial institutions that do all of their performance testing inside the firewall, and thus they own all of their own infrastructure, including dedicated servers that are used solely for load generation. With some of the large load generation requirements (which can be in the hundreds of thousands of vUsers), you can imagine that even a small bump in optimization of a load server could potentially save a company quite a bit of infrastructure costs in the load server hardware alone. Which is why, as part of our best practices, SOASTA advocates calibration of load servers when using CloudTest.

So, let’s walk though the process.

Load server calibration: What it is and why you need to do it
Load server calibration is an iterative process to accurately determine the appropriate number of virtual users to run from each load server.

Why calibrate? Two reasons:

  1. To identify the maximum number of users a load server can handle (per test clip/script) so computer resources are used in the most efficient way possible. If you are running very large tests where the number of load servers might be limited, you need to get as many users as possible on a load server. (Refer to my intro example as well.)
  2. Eliminate the load server as a potential bottleneck in the test. If the load server is overloaded, the first thing you will see in the results is an increase in average response time. If you don’t notice the load server is overloaded, you might incorrectly believe the target application is slowing down when in reality it is the load server itself.

So, how do you know when a load server is properly calibrated? Simple. Here are a few things to look for:

  • CPU — For cloud providers where instances are on shared hardware, the CPU should peak around 65-75% utilization. Average utilization should be around 65%. For bare-metal load servers, CPU utilization shouldn’t go beyond 90%.
  • Heap usage — The java heap usage and garbage collection needs to be “healthy” on the load servers. Heap usage should not increase as the text executes.
  • Network interface — No more than 90% utilization (i.e. 900-950Mbit/second).
  • Errors — No weird errors are seen during the test.

When should you perform a calibration test?

The calibration process is highly recommended for all tests, but especially in these situations:

  • Whenever a large test is being executed (i.e. over 10,000 users). This ensures that server resources are being used in the most efficient manner possible.
  • When your analysis of previous test results leads you to believe that the load servers might be getting in the way of the load test. As an example, maybe you’ve noticed the load servers spiking up to high levels of CPU usage in the CloudTest Monitor Dashboard. Maybe the target application doesn’t appear to be slowing down — despite metrics on the SOASTA dashboards that indicate an increase in average response time.

The calibration process explained
Depending on how many locations you plan to run the test from, how many test clips you have,and how long the testing session will last, the calibration process can take a varying amount of time. However, if done correctly, the output is a fairly precise number of virtual users that should be applied to each clip on each instance size on each cloud provider.

The number of virtual users that a given load server will support is determined by ramping up a test to a specified number of users over a 10-minute period. This relatively slow ramp (in most cases) will help identify when the key metrics on the load servers start to deviate from the norm.

What you need to know before you get started:

  • How many test clips will run during the test session?
  • How many hours will the test session last?
  • What cloud providers will be used for the test, if any?
  • Which instance sizes with each cloud provider will be used?

Make sure the target application can sustain the needed number of users against the site. As an example, let’s use 1,000 users as it’s likely more than the load generator can push (given an average test case and a load server consistent with an AWS Large instance). Lower levels will also work, but if you can get clearance for 1,000 that will ensure you have headroom to start high and see issues as you ramp up.

Step 1:  Test clip setup
The first step in the calibration process is making sure you have created all the test clips that will run during the test session. Ensure that all the think times are appropriate to the application (meaning, make sure the test clip is taking the correct amount of time to complete.) Then, make sure that all memory optimizations are complete in the test clip.

The two most important things are scopes and clearing the responses in scripts. Scopes should be set as ‘private’ unless local/public is needed for scripts. The clearResponse function should be used in any scripts (like validations or extractions) that access the response of a message.

Step 2:  Test environment setup

  1. Start a grid with 1 result server and 1 load server in the location you want to calibrate (for example, Rackspace OpenStack London).
  2. Normally load is generated from large-size instances, but if you have a specific reason to generate load from another instance size (and it needs to be calibrated separately), you will need to run the calibration process below with a separate grid that only has that instance.
  3. Once the grid starts, verify the monitoring of the grid started as well. This monitoring data provides critical information for the calibration The grid UI will confirm monitoring is started. You can also confirm monitoring started by opening the CloudTest monitor dashboard. If you see the load server and result server listed and capturing data, that means the monitoring was started correctly. The dashboard will look like this:

calibrate-1

Where to look during the calibration test
There are five main areas to watch during the calibration test:

  • Test results. Keep an eye on two main metrics: average response time and errors. If the average response time starts to rise (which it likely will as you ramp up to 1,000 users), start looking at the monitoring data. Try to determine if the increase in response time is due to the load increasing or because the load servers are getting in the way. Also, watch the errors. If you start to see “weird” errors (HTTP timeout errors, connection timeout errors, odd SSL errors, etc.), that might be an indication the load servers are overloaded.
  • Default monitoring dashboard. This is the monitoring dashboard that shows monitoring data over time.  It is accessed from SOASTA Central by clicking on ‘Monitors’, finding the load server you just started, and clicking the ‘View Analytics’ link. A better option is if you create your own monitoring dashboard that shows these same metrics. This is the better option because when you start the composition, you will link to the monitoring. If you use your own dashboard, then you will get correlated metrics in the dashboards. This will be your primary monitoring dashboard during the test.
  • Linux ‘top’. This command gives you deeper information than the default CPU chart provides. Specifically, on EC2, it gives you information about how much CPU is being stolen by the hypervisor. Most times, EC2 steals CPU resources (%st), so the max CPU utilization you might see in the SOASTA dashboard is ~75% utilization. This means the CPU is maxed out if there is ~25%st shown in top. Before the test starts, SSH into the load server (assuming the load server is Linux-based) and type ‘top’ at the command line. Keep this window up throughout the calibration test.
  • CloudTest monitor dashboard. This is the dashboard mentioned above. It is less useful in this test as there is only a single load. This dashboard is most useful during large tests where lots of load servers and result servers are active. You can see all the metrics on a single dashboard.
  • Monitoring combined charts. This dashboard is very useful since it shows the different monitoring metrics against Virtual Users and Send Rate. The only thing to be aware of is that you need to add the monitoring to the composition before the test starts by going to Composition Properties -> Monitoring -> Enable Server Monitoring and check the monitor of the grid you want to save the monitoring data.

Step 3: Create and execute the test composition
Given that the goal is to identify as closely as possible where the load server starts to get overloaded, the longer the ramp-up time, the better. However, you have to be realistic too. You don’t have days to do this. So, in this example, a 10-minute ramp to 1,000 users seems to be adequate.

Setup the test composition as follows:

  1. On Track 1, put the test clip being calibrated
  2. Set the track to use a ‘dedicated load server’
  3. Set composition to “load” mode
  4. Set 1 load server
  5. Set 1,000 virtual users on that load server
  6. Set ramp-up time on the track to 10 minutes
  7. Set the track with Renew Parallel Repeats
  8. Go to Composition Properties -> Monitoring and check the checkbox for the ‘Test Servers’ monitor (note that this will change on each stop/start of the grid, so confirm this setting if you restart the grid)

Once the test composition is setup correctly, load and play the composition to start the calibration test.

How to know when the load server is overloaded: five metrics to watch
Identifying when a server is overloaded is as much an art as it is a science. The metrics discussed below will lead you in the right direction, but how conservative you might be has an impact as well. Whenever you think you’ve identified a point in the test where you believe the load server is overloaded, note the number of virtual users. Then let the test run a few minutes more. It’s possible the metric was temporarily out of line.

Another thing to keep in mind is that the load server might have been overloaded before the test reached the virtual user level you noted as a possible limit. Monitoring might have detected it after the fact.

Here are the five key metrics to watch:

1. Both the default monitoring dashboard and top
If % id in top is close to 0, or the CPU metric in the Default Monitoring dashboard is basically flatlining around 75%-80% CPU, note the number of virtual users the test is at. Likely the server is overloaded at this point.

2. Heap usage/garbage
This metric is just as important to watch as the overall CPU usage. Between these two metrics, you can normally figure out when the server is overloaded. The primary metric to watch is the “JVM Heap Usage” widget on the Default Monitoring dashboard. Make sure it goes down (i.e. Garbage collection) after periods of increase. If it keeps going up and doesn’t come down much, that means java can’t do garbage collection appropriately.   This also means the CPU is probably overloaded as it continually tries to do garbage collection. Be patient though: major garbage collections can occur after longer periods of time. CPU utilization will tell you if you have any chance of a decent garbage collection. If CPU is high and garbage collection isn’t happening, you have probably overloaded the load server.

Below is an example of healthy garbage collection. Note how the heap usage increases and garbage collection returns it to a proper level. The trend should be effectively flat.

garbage

Below is an example of unhealthy garbage collection. Note that some minor garbage collection is occurring, but no major garbage collection is occurring. This indicates the load server is overloaded.

garbage-2

3. Network interface
Watch the Default Monitoring dashboard. As the load increases, the amount of bandwidth used will increase. If the amount of bandwidth used approaches 950Mbits/second, the Network Interface is becoming a bottleneck in the test and the load server is overloaded. This is rarely a bottleneck with the load

4. Disk IO
Watch the Default Monitoring dashboard. There are two disk-related metrics that will tell you if heavy disk use is occurring. This is very unlikely to be a bottleneck with the SOASTA load servers since most processing on the servers is done in memory.

5. Test results
Watch error rates and average response time. If any of the metrics above are starting to become a bottleneck, likely the average response time will increase and/or error rates will climb rapidly, in a hockey stick-like fashion.

Important: Memory usage can be a misleading metric
Do not rely on this metric to determine if the load server is overloaded. Java will take up all the memory — but that doesn’t mean there is a memory limitation. JVM heap usage is the metric that should be used instead.

What to do when the load server becomes overloaded

When you determine the load server is overloaded, you need to stop the test composition and change the number of virtual users to the number of users you identified the bottleneck at. Here’s how:

  1. Let’s say you identified a bottleneck at 600 virtual users. Then set the composition virtual users to 600 users.
  2. Reset the ramp-time to 10 minutes.
  3. Go through the same process as described above.
  4. If you find that over time 600 users is still too high, then lower the number of virtual users on the load server and restart the test.
  5. If the test runs without any bottleneck, then you’ve pretty much confirmed the number of virtual users you can run on the load server.

What to do when you think the virtual user threshold has been identified

Once you find the number of virtual users that seems to work with the test clip, you need to run that test for as long as the test session is scheduled to last. If it is two hours, then run that same test for two hours. This will help flush out any long-term problems with the test. If the JVM heap usage continues to go up (toward the 6GB limit on EC2), it is possible the test might die when it approaches 6GB heap usage.

It isn’t always possible to run this longer test. If you have lots of test clips or a limited amount of time before the test, you might not be able to run the test clip for hours. If that’s the case, don’t fret. Take a look at the results you’ve gathered to this point. Do you think that the number of virtual users you’ve identified as your limit is aggressive or conservative? If you think you are on the edge and being too aggressive with the number of virtual users, then cut that number back a bit.

Again, this process is a combination of art and science. Go with your gut.

Next steps and considerations
Once you’ve completed this process for a single test clip, you will need to repeat this process for each cloud vendor and instance size you plan to run this test clip on. Additionally, you will need to repeat this process for each test clip.

One important thing to note about cloud providers is that every instance is not the same. Take Amazon EC2, for example. Not every “m1.large” instance is the same. Some have different processor types (Intel vs. AMD). Some have different processor speeds. You don’t have control over whether you get a faster m1.large or a slower m1.large. So it is possible that you might have calibrated your test on one of the faster instances. As a result, when you run the larger test, it is possible that some of your load might come from slower load servers. You won’t know if you randomly get fast or slow servers. Again, this is a time when you have to make a call about how conservative you want to be. Maybe you subtract 100 users from the total number of virtual users to account for this. Or, if you don’t think it will make a big difference, just let it go.

One way to do a simple validation of your virtual user number is to stop the grid and restart it. You will get another random server. If you repeat the calibration test 2-3 times with different load servers and get the same result, you can be confident you have the right number of virtual users.

What NOT to do: An alternative way to get more virtual users out of a given load server is by artificially extending the test clip. In essence, this means increasing the think times during the test. This is not a recommended approach as it artificially decreases the number of HTTP requests being sent to the target application and won’t properly simulate the expected number of virtual users.

Advanced calibration tips
If this is a large test that you are going to do often, and therefore you would like a more precise calibration, you can calibrate with the same variations of load, but do so by increasing the number of load generators in use at each step, rather than by changing the load on a single load generator. Keep each single load server to a low number of users (at a point where you are pretty sure that it is not at capacity). You then track the same response time, errors, and other metrics as you did using the single load generator.

By comparing your two curves, you can get an idea of where the true capacity limits are. For example, if the curves degrade but are identical, then that suggests that the target under test is degrading rapidly, and load generator capacity is hardly even a factor.

On the other hand, if you see degradation in the first step (one load generator), but the second step (multiple load generators) shows no degradation (the charts show a nice linear increase in the factors as load increases), then that suggests the target site is not degrading at all, and all of the degradation you saw in the first step was due to load generator capacity being reached.

You may see something in-between, because often both sides are degrading at different rates as the load increases. In that case, you need to make a judgment call by looking at the two curves.

Let me illustrate
An illustration of how this might be done in this case, based on the assumption that we’re testing a very robust site and based upon the observations made so far that we seem to be able to get a fairly large number of users per load generator (and in this example we’ll start with a lower VU number and work our way up):

Step one: Run tests on a single load generator, with test runs at 100, 200, 300 … 2000 users, or until you see CPU and memory limitations come into play, or until significant degradation occurs. Plot a graph of the factors like response time, error rate, and throughput. If you see no degradation in the graph (everything is linear and the error rate is constant) until a CPU or memory limit is reached, then you are done and can stop — you have found the capacity point and need go no further. Otherwise, go to step two.

Step two: Run the same tests with a constant 100 users per load generator, with 1, 2, 3, 4… 20 load generators, or until you reach the stopping point of step one. Plot a graph of the factors like response time, error rate, and throughput.

Step three: Compare the two graphs and reach a conclusion as to the capacity of load generator vs. the capacity of the site under test. If you can, make a judgment of where the capacity of the load generator is reached. (Note that if the site under test is itself degrading rapidly, you might not be able to reach a conclusion about the load generator capacity.)

Final takeaways
Whether you are driving load from the cloud or from your own internal load generators, calibration of your load generators is considered a SOASTA best practice. Certainly cost savings is one main reason — either for cloud infrastructure costs (pennies) or internal hardware costs (potentially hundreds of thousands of dollars).

(SOASTA CloudTest can drive load from either source, as well as drive load from internal and cloud-based load generators at the same time — in the same test.  But I’ll explain that in an upcoming post.)

Calibration is also key to ensuring that you have the most accurate performance test baseline point for each test run. After all, our goal is to test the website or application, not the load server! Still have questions? Leave a comment below or ping me on Twitter.

Related posts:

More Stories By SOASTA Blog

The SOASTA platform enables digital business owners to gain unprecedented and continuous performance insights into their real user experience on mobile and web devices in real time and at scale.

@CloudExpo Stories
While some vendors scramble to create and sell you a fancy solution for monitoring your spanking new Amazon Lambdas, hear how you can do it on the cheap using just built-in Java APIs yourself. By exploiting a little-known fact that Lambdas aren’t exactly single-threaded, you can effectively identify hot spots in your serverless code. In his session at @DevOpsSummit at 21st Cloud Expo, Dave Martin, Product owner at CA Technologies, will give a live demonstration and code walkthrough, showing how ...
SYS-CON Events announced today that CA Technologies has been named “Platinum Sponsor” of SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business – from apparel to energy – is being rewritten by software. From planning to development to management to security, CA creates software that fuels transformation for companies in the applic...
Cloud adoption is often driven by a desire to increase efficiency, boost agility and save money. All too often, however, the reality involves unpredictable cost spikes and lack of oversight due to resource limitations. In his session at 20th Cloud Expo, Joe Kinsella, CTO and Founder of CloudHealth Technologies, tackled the question: “How do you build a fully optimized cloud?” He will examine: Why TCO is critical to achieving cloud success – and why attendees should be thinking holistically ab...
As more and more companies are making the shift from on-premises to public cloud, the standard approach to DevOps is evolving. From encryption, compliance and regulations like GDPR, security in the cloud has become a hot topic. Many DevOps-focused companies have hired dedicated staff to fulfill these requirements, often creating further siloes, complexity and cost. This session aims to highlight existing DevOps cultural approaches, tooling and how security can be wrapped in every facet of the bu...
Internet of @ThingsExpo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devic...
WebRTC is great technology to build your own communication tools. It will be even more exciting experience it with advanced devices, such as a 360 Camera, 360 microphone, and a depth sensor camera. In his session at @ThingsExpo, Masashi Ganeko, a manager at INFOCOM Corporation, will introduce two experimental projects from his team and what they learned from them. "Shotoku Tamago" uses the robot audition software HARK to track speakers in 360 video of a remote party. "Virtual Teleport" uses a...
yperConvergence came to market with the objective of being simple, flexible and to help drive down operating expenses. It reduced the footprint by bundling the compute/storage/network into one box. This brought a new set of challenges as the HyperConverged vendors are very focused on their own proprietary building blocks. If you want to scale in a certain way, let’s say you identified a need for more storage and want to add a device that is not sold by the HyperConverged vendor, forget about it....
SYS-CON Events announced today that Calligo has been named “Bronze Sponsor” of SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Calligo is an innovative cloud service provider offering mid-sized companies the highest levels of data privacy. Calligo offers unparalleled application performance guarantees, commercial flexibility and a personalized support service from its globally located cloud platform...
SYS-CON Events announced today that Elastifile will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Elastifile Cloud File System (ECFS) is software-defined data infrastructure designed for seamless and efficient management of dynamic workloads across heterogeneous environments. Elastifile provides the architecture needed to optimize your hybrid cloud environment, by facilitating efficient...
As DevOps methodologies expand their reach across the enterprise, organizations face the daunting challenge of adapting related cloud strategies to ensure optimal alignment, from managing complexity to ensuring proper governance. How can culture, automation, legacy apps and even budget be reexamined to enable this ongoing shift within the modern software factory?
SYS-CON Events announced today that Cloudistics, an on-premises cloud computing company, has been named “Bronze Sponsor” of SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Launched in 2016, Cloudistics helps anyone bring the power of the cloud to the data center in an easy-to-use, on- premises cloud platform that automatically provides high performance resources for all types of applications: Docke...
With Cloud Foundry you can easily deploy and use apps utilizing websocket technology, but not everybody realizes that scaling them out is not that trivial. In his session at 21st Cloud Expo, Roman Swoszowski, CTO and VP, Cloud Foundry Services, at Grape Up, will show you an example of how to deal with this issue. He will demonstrate a cloud-native Spring Boot app running in Cloud Foundry and communicating with clients over websocket protocol that can be easily scaled horizontally and coordinate...
@DevOpsSummit at Cloud Expo taking place Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center, Santa Clara, CA, is co-located with the 21st International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is ...
SYS-CON Events announced today that Golden Gate University will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Since 1901, non-profit Golden Gate University (GGU) has been helping adults achieve their professional goals by providing high quality, practice-based undergraduate and graduate educational programs in law, taxation, business and related professions. Many of its courses are taug...
SYS-CON Events announced today that Golden Gate University will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Since 1901, non-profit Golden Gate University (GGU) has been helping adults achieve their professional goals by providing high quality, practice-based undergraduate and graduate educational programs in law, taxation, business and related professions. Many of its courses are taug...
DevOps at Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to w...
SYS-CON Events announced today that Secure Channels, a cybersecurity firm, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Secure Channels, Inc. offers several products and solutions to its many clients, helping them protect critical data from being compromised and access to computer networks from the unauthorized. The company develops comprehensive data encryption security strategie...
Vulnerability management is vital for large companies that need to secure containers across thousands of hosts, but many struggle to understand how exposed they are when they discover a new high security vulnerability. In his session at 21st Cloud Expo, John Morello, CTO of Twistlock, will address this pressing concern by introducing the concept of the “Vulnerability Risk Tree API,” which brings all the data together in a simple REST endpoint, allowing companies to easily grasp the severity of t...
Recently, WebRTC has a lot of eyes from market. The use cases of WebRTC are expanding - video chat, online education, online health care etc. Not only for human-to-human communication, but also IoT use cases such as machine to human use cases can be seen recently. One of the typical use-case is remote camera monitoring. With WebRTC, people can have interoperability and flexibility for deploying monitoring service. However, the benefit of WebRTC for IoT is not only its convenience and interopera...
SYS-CON Events announced today that SkyScale will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. SkyScale is a world-class provider of cloud-based, ultra-fast multi-GPU hardware platforms for lease to customers desiring the fastest performance available as a service anywhere in the world. SkyScale builds, configures, and manages dedicated systems strategically located in maximum-security...