Some thoughts on stress testing web applications with JMeter (part 2)
In this second part on testing web applications with JMeter, I will mainly write about running the test plans, recording the results and interpreting them.
When do I stop ?
One of the main questions you have to ask yourself when you start stress testing a web application is: when do I stop? This question is not as easy a question as it seems, the response depends on your initial objectives and on “scientific” criteria allowing you to decide when you have met the initial objectives. Eventually, it comes down to measuring and interpreting the “results” of your stress tests.
Before going any further, we should spend some time on the measurable outcomes of a stress test. There are mainly 2 interesting measures that you can record when you run a stress test on a web application:
- The throughput: is the number of requests per unit of time (seconds, minutes, hours) that are sent to your server during the test.
- The response time: is the elapsed time from the moment when a given request is sent to the server until the moment when the last bit of information has returned to the client
The throughput is the real load processed by your server during a run but it does not tell you anything about the performance of your server during this same run. This is the reason why you need both measures in order to get a real idea about your server’s performance during a run. The response time tells you how fast your server is handling a given load.
We are now much closer to find an answer to our initial question: you can stop stress testing your application when for a measured throughput the measured response time is “too high”. This is the right answer in an ideal world where information systems behave in a deterministic manner … another way to answer our question could also be: you can stop stress testing your application when your system crashes / collapses / starts to behave unexpectedly … ðŸ™‚
However, I will stick to our first answer for a while as it contains another interesting question: what is a “high” response time for a web application (or any application or information system used by real people)? A very interesting answer is given in the article already mentioned in my previous post and in this one as well. To make it short, based on usability studies it is possible to define response time limits where the user interaction with an information system radically changes. These limits are tightly related with the nature of the human being: psychology as well as brain performance ðŸ™‚
- 0.1 second is about the limit for having the user feel that the system is reacting instantaneously, meaning that no special feedback is necessary except to display the result.
- 1.0 second is about the limit for the user’s flow of thought to stay uninterrupted, even though the user will notice the delay. Normally, no special feedback is necessary during delays of more than 0.1 but less than 1.0 second, but the user does lose the feeling of operating directly on the data.
- 10 seconds is about the limit for keeping the user’s attention focused on the dialogue. For longer delays, users will want to perform other tasks while waiting for the computer to finish, so they should be given feedback indicating when the computer expects to be done. Feedback during the delay is especially important if the response time is likely to be highly variable, since users will then not know what to expect.
Using these limits allows us to give a precise end point to the stress tests of a system; it helps us define in collaboration with our client (or users) what is an acceptable response time. For example, the last time I made stress tests for a client, we agreed that the acceptable upper limit of the response times for his system was 7 seconds: he wanted to know how many concurrent users his system would handle.
The remaining problem now is how to measure / estimate the throughput and response times of our system using JMeter: some simple statistics and mathematics are needed here.
Run your test plan and record the meaningful measures …
First of all, JMeter provides us with several different “listeners” allowing to record these 2 variables in various ways (graphics, tables, trees, files). I would say that most of these “listeners” are useless or to put it in a different way, one of them is a must have in order to do have all the necessary information in hand: the Summary Report.
In order to understand this report and to implement scenarios efficiently we must keep the following things in mind:
- JMeter records response times and throughput for each “sampler” of each “thread group” defined in your test plan.
- In the Summary Report, one line is displayed for each different “sampler” based on the sampler’s names: you can groupÂ or differentiate samplers in the report just by playing with their names.
- Each “sampler” is executedÂ many times: the Summary Report provides us with mean values (and standard deviations) for the throughput and response times of each named “sampler”.
- Global values (mean and standard deviation) for throughput and response times are also calculated in the Summary Report.
- The Summary Report allows you to store the measures of each run in a “csv” file: you can thus analyse and interpret the results in a spreadsheet program.
Other reports are also useful particularly at the beginning when building and testing your scenarios:
- The View Results Tree is very handy when “debugging” a scenario as it allows to monitor all the HTTP Requests and Responses exchanged with the server. The draw back is that it consumes too much memory to be used in a large stress test.
- The View Results in Table listener is also useful in the early stages of the stress test implementation as it gives a good and fast overview of the execution of a test plan. However, this listener also consumes too much memory to be used in a large stress test.
- I have also found some very interesting JMeter plugins on a Google Code project. One of them, the “Active Threads Over Time” helped me a lot when trying to set the ramp up throughput by playing with the “ramp up” and “number of threads” parameters of the thread group.
One more element that you should have in mind when performing stress tests is the performance bottleneck of the computer running the tests themselves:
- It is very common when running stress tests on large production systems to reach the limits of the computer running the tests before reaching the limits of the tested server.
- When the computer running the tests is reaching its limits (memory, number of threads, cpu …) all the measures recorded by the stress tests tool are wrong or at least biased.
- There are two way to face this problem: (1) one is to optimize your scenarios and the way you run them and the (2) second is to set up a distributed infrastructure.
(1) In the JMeter manual, you will find the following advises in the section 16.6 of the Best Practises page:
Some suggestions on reducing resource usage.
- Use non-GUI mode: jmeter -n -t test.jmx -l test.jtl
- Use as few Listeners as possible; if using the -l flag as above they can all be deleted or disabled.
- Rather than using lots of similar samplers, use the same sampler in a loop, and use variables (CSV Data Set) to vary the sample.
- Don’t use functional mode
- Use CSV output rather than XML
- Only save the data that you need
- Use as few Assertions as possible
If your test needs large amounts of data – particularly if it needs to be randomised – create the test data in a file that can be read with CSV Dataset. This avoids wasting resources at run-time.
(2) In the JMeter manual, you will find the Remote Testing page giving you precise instructions necessary to set up a distributed testing environment and a PDF describing how it all works architecture-wise. My experience is that it is all very easy to set up and that it gives excellent results: in the end, it comes down to running the “jmeter-server” scripts on the slaves and to configure the existing host in the master’s configuration file (jmeter.properties).Â The only 2 or 3 little problems I came across with the distributed testing are:
- Do not forget to give memory to your jmeter slaves and master (set Xms and Xmx in the jmeter.properties file) the default values a very low.
- If you use external resources such as a CSV Data Set, you should have them on all your slave installation under the same location (a full path is needed in your scenario)
- Beware of multiple thread groups and schedulers, they leak huge amounts of memory on the slaves
Last but not least, you should never perform your stress tests against a server or infrastructure that was just started. Servers usually need a warm-up before they reach their full speed: this is particularly true for the Java platform where you surely don’t want to measure class loading time, JSP compilation time or native compilation time.
Interpret the results …
In order to interpret the results of a stress tests, it is important to understand some basic elements of Statistics:
(1) The mean value (Î¼)
The following equation show how the mean value (Î¼) is calculated:
Î¼ = 1/n * Î£_{i=1…n} x_{i}
The mean value of a given measure is what is commonly referred to as the average value of this measure. An important thing to understand is that the mean value can be very misleading as it does not show you how close (or far) your values are from the average. An example is always better than a long explanation.
Let’s assume that we are measuring response times in milliseconds in 2 different stress tests:
Stress Test 1:
- x_{1}=100
- x_{2}=110
- x_{3}=90
- x_{4}=900
- x_{5}=890
- x_{6}=910
gives you Î¼ = 1/6 * (100 + 110 + 90 + 900 + 890 + 910) = 500 ms
Stress Test 2:
- x_{1}=490
- x_{2}=510
- x_{3}=535
- x_{4}=465
- x_{5}=590
- x_{6}=410
gives you Î¼ = 1/6 * (490 + 510 + 535 + 465 + 590 + 410) = 500 ms
In both cases the mean value (Î¼) is the same. However if you observe closely the values taken by the response times you will see that in the first case, the values are “far” from the mean value where in the second case, the values are “close” to the mean value. It is quite obvious with this example that a measure of this distance to the mean value is needed in order to draw any kind of conclusion based on the mean value.
(2) The standard deviation (Ïƒ)
The following equation show how the standard deviation (Ïƒ) is calculated:
Ïƒ = 1/n * âˆš Î£_{i=1…n} (x_{i}-Î¼)^{2}
The standard deviation (Ïƒ) measures the mean distance of the values to their average (Î¼). In other words it gives us a good idea of the dispersion or variability of the measures to their mean value. Let’s go back to our example and calculate the standard deviation of each of our theoretical stress tests:
Stress Test 1:
Ïƒ = 1/6 * sqrt( (100-500)^2 + (110-500)^2 + (90-500)^2 + (900-500)^2 + (890-500)^2 + (910-500)^2 ) â‰ˆ 163 ms
Stress Test 2:
Ïƒ = 1/6 * sqrt( (490-500)^2 + (510-500)^2 + (535-500)^2 + (465-500)^2 + (590-500)^2 + (410-500)^2 )Â â‰ˆ 23 ms
The 2 values of the standard deviation calculated above are very different:
- in the first case, the standard deviation is high compared to the mean value, which shows us that our measures are very variable (or mostly far from the mean value) and that the mean value is not very significant.
- in the second case, the standard deviation is low compared to the mean value, which shows us that our measures are not dispersed (or mostly close to the mean value) and that the mean value is significant.
(3) The sampling size and the quality of the measure
Another interesting question is whether our calculated mean value is a good estimation of the “real” mean value. In other word, when calculating the mean value of the response time during a test case do we have a good estimation of the “real” mean response time of the same scenario repeated indefinitely. In probability theory, the Central Limit Theorem states conditions under which the mean of a sufficiently large number of independent random variables, each with finite mean and variance, will be approximately normally distributed.
The measures of response times and throughput obtained during stress tests comply with the Central Limit Theorem as we usually have: a large number of independent and random measures which have a finite (calculated by JMeter) mean value and standard deviation. We can thus assume that the mean values of the response time and the throughput are approximatively normally distributed.
This allow us to calculate a Confidence Interval for these mean values. The Confidence Interval gives us a measure of the quality of our mean values as it allows us to calculated the variability of our mean value (interval) with a predefined probability. You can for example decide to calculate your Confidence Interval at 95%, which will tell you that the probability to have a mean value within the calculated interval is 95%. On the contrary, you can decide to calculate the probability to have you mean value within a given interval (see the examples below).
The following equation show how the Confidence Interval (CI) is calculated:
CI = [Î¼ – Z*Ïƒ/âˆšn, Î¼ + Z*Ïƒ/âˆšn]
where:
- Î¼ is the calculated mean value of our sample,
- Ïƒ is the calculated standard deviation of our sample
- and Z is the value for which the area under the “bell shaped curve” of the standard normal distribution represents the half the chosen Confidence C (anyone who can explain this better is welcome).
The following table gives values of Z for various given values of Confidence C:
C | Z |
---|---|
0.80 | 1.281551565545 |
0.90 | 1.644853626951 |
0.95 | 1.959963984540 |
0.98 | 2.326347874041 |
0.99 | 2.575829303549 |
0.995 | 2.807033768344 |
0.998 | 3.090232306168 |
0.999 | 3.290526731492 |
0.9999 | 3.890591886413 |
0.99999 | 4.417173413469 |
Source: http://en.wikipedia.org/wiki/Normal_distribution
If we go back to our previous examples, we can calculate the confidence intervals of our mean values at 95% :
CI_{1} = [500 – 1.96*163/sqrt(6); 500 + 1.96*163/sqrt(6)] â‰ˆ [370; 630]
CI_{2} = [500 – 1.96*23/sqrt(6); 500 + 1.96*23/sqrt(6)] â‰ˆ [482; 518]
This means that the probability to have a mean response time in the calculated confidence interval is 95%.
We can also calculate the probability to have the mean value in the interval [490, 510]:
10 = Z1 * 163 / sqrt(6) => Z1 = 10 * sqrt(6) / 163 => Z1 â‰ˆ 0.15 => C1 â‰ˆ 12%
10 = Z2 * 23 / sqrt(6) => Z2 = 10 * sqrt(6) / 23 => Z2 â‰ˆ 1.06 => C2 â‰ˆ 71%
Notes:
These are just given as examples of how to calculate the confidence interval … the conditions are not met for the Central Limit Theorem with such a small sample.
The last 2 examples were made using the following Standard Normal Distribution Tables.
Conclusion
As a conclusion, we can say that the best way to interpret our stress test results is to use the Summary Report provided by JMeter and to store it in a “csv” file for every run. In this report we can find, the mean response time, the mean throughput, the standard deviation of the response time and the standard deviation of the throughput for every named sampler and globally for a the run.
Based on the explanations above, I recommend the following methodology:
- If we have a high number of samples (which is usually the case in stress tests) and a low standard deviation than we canÂ conclude without risk that we have a good estimation of the mean value of both the response time and the throughput of our system and that the “real” number will be close to the calculated mean values.
- If we have a high number of samples (which is usually the case in stress tests) and a high standard deviation, we probably have a good estimation of the mean value but should however consider toÂ estimate a confidence interval. In any case, if the variability of the measure is high investigation is needed on a technical point of view as variability of response times and throughput is obviously related to instability of the system tested.
- If we have a low number of samples and a high standard deviation than we almost certainly have a very bad estimation of the mean value, which means that we are measuring the wrong thing, the wrong way.
Monitor your systems while you run the tests …
It is often useful to monitor the system (and its various components) while you are stressing it. Various tools may be used that vary from one platform to another. On the Java platform you may use the excellent “jvisualvm” provided with the latest versions of the JDK and interacting with the various monitoring hooks integrated in the JVM.
Monitoring Java Web Applications is a subject in itself … I can try to share my thoughts on it some time … in another post ðŸ˜‰
Some thoughts on stress testing web applications with JMeter (Part 1)
A small intro …
Now that I am almost finished with the “stress test” task I was talking about in my previous post, I have several thoughts and experience to share concerning on the subject. I am also planing to write about Java web application profiling on a following post as it somehow relates with the results of a “stress test” task.
The tool I have used to carry on stress test tasks is JMeter (the latest version available at the time of this writing) thus, I will write about JMeter. However, I am interested in any feedback (experience) concerning other tools (or JMeter).
State clearly your objectives …
It is important that you state your objectives clearly as the overall methodology of the stress tests will greatly depend on these objectives.
Some classical examples follow:
- Give a precise estimate of the maximum load that a given system may serve (peak): this is usually done in order to help plan the future infrastructure of a live system.
- Find precisely the bottlenecks of a live system during a peak: this is usually done as a preliminary task to profiling and performance tuning tasks.
- Find precisely the origin of eventual leaks (memory, connection to resources, various resources) during a long run: this is also usually done as a preliminary task to profiling and tuning tasks.
- Prove that the system you have implemented can hold a theoretical load: usually this was a client’s requirement expressed during the very early stages of a project (for example in the call for tender)
- Any combination of the aforementioned objectives …
These various different objectives lead to different types of scenarios. To my opinion a good methodology is always to try and implement scenarios that are as close as possible to real and typical use cases of the system you are willing to test. However, in some cases (bullets 2 and 3 above) you may need to write artificial scenarios that will help you identify precisely a functionality of your system that has performance problems.
The following paragraph is about writing “real case” scenarios and test plans covering the aforementioned objectives.
Write good quality scenarios and test plans …
First a difference must be made between “scenarios” onÂ one hand and “test plans” on the other:
A scenario is (or at least should be) an actual use case of your application carried out by a single user. In JMeter terms, a scenarios is a combination of “samplers” and “controllers” that will be executed by a single “thread” of a “thread group”.
A test plan is the “way” a given scenario will be executed in order to achieve a given objective (as the ones described in the previous paragraph). In JMeter terms, the “way” the scenario will be executed mainly means playing with the following variables on the thread group: the number of threads, the ramp up time and the number of loops executed by a thread.
It is very important to understand the exact meaning of these 3 parameters:
- The “number of threads” in a thread group is the actual number of threads spawned by JMeter, each one of them used to execute the scenario. In other words, this variable is the number of users executing a “real life” use case on your system. This number is not the number of concurrent / parallel users executing a “real life” use case on your system: the concurrency of the users depends on both the duration of your scenario and the ramp up time configured on the thread group.
- The “ramp up time” in a thread group is the actual time taken by JMeter to spawn all the threads. If the ramp up time is small compared to the number of threads and the mean duration of a scenario then the number of concurrent threads accessing your system will be high and vice versa. A rough estimation of the throughput (number of requests per second) during the ramp up period of your test plan is: number of threads / ramp up time (in seconds).
- The “number of loops” in a thread group is the actual number of times that the scenario will be executed by each thread.
Now let’s go back to the implementation of “real case” scenarios using JMeter. I recommend this interesting article on the subject sent to me by a colleague (thanks Petros ðŸ˜‰ ). Some very good methodological hints are given concerning the writing of scenarios in the first paragraphs. Basically, I can give 3 main hints on the subject that are easy to follow and implement with JMeter:
- Keep scenarios simple:
Each scenario should correspond to one use case. This makes things much more simple and logical particularly when it comes to interpreting the results of the stress tests. - Use “recording” techniques to generate your scenario from a “real” usage of the application:
JMeter comes with a proxy component, which when started, will record all the HTTP Requests and Response cycles originating from a web browser configured to access your system through this proxy. There are well-known problems with the usage of this proxy when dealing with HTTPS: often, a simple solution is to do all the recording in HTTP and turn the protocol to HTTPS in your scenario afterwards (this supposes that you can make your system run under HTTP for the time of the recording). - Don’t forget to record the “think time” of the users:
The “think time” of a user is the elapsed time between 2 user actions. During this time, the user may be thinking what to do next, answering an urgent call on the phone, talking with a friend … this must be part of the scenario. Fortunately, JMeter allows to record these “think times” and translate them into “Gaussian Waits” inside your scenario (see the article mentioned above for hints on how to do it). In any case, you should always have “waits” in your scenarios simulating in the most realistic manner these “think times” of the real users. - Read the JMeter User’s Manual particularly the “Component Reference” in order to find all possibilities provided by the tool. For example:
You can use an external csv file containing (username, password) couples in order to have each thread login into your system with different credentials.
You can use regular expressions to parse HTTP Responses and extract data necessary to chain your samplers
Once you have your scenario ready, you must configure your test plan in order to meet your objectives. The tuning of the main parameters of your test plan (number of threads, ramp up and number of loops) is often a “try and error” procedure. However, we can give the 3 following hints:
- You should try to have a constant throughput during a run:
It is often very difficult to “control” the throughput particularly during the ramp up period - If your objective is to simulate a “peak”:
You should have a “high” number of threads and a “low” ramp up time and number of loops - If your objective is to simulate a “long run”:
You should have a “medium” number of threads, a “higher” ramp up time and a “high” number of loops
Note: The terms “high”, “higher”, “medium” and “low” are voluntary qualitative in the 3 bullets above as they depend on the system you are testing.
To be continued …
This post is already too long: seems I have to much to say on the subject ðŸ˜‰ Never mind, I will carry on in a following post tomorrow covering the remaining subjects: running the test plans,recording the meaningful measures, interpreting the results, monitoring the systems …