Sunday, April 15, 2012

More Lessons learned from Performance Testing SharePoint

Performance testing with SharePoint, or any web based application, can be quite tricky.  Recently my team launched an upgraded Corporate Web Site based on SharePoint 2007.  The launch was quite challenging mainly due to mistakes made during performance testing Lessons Learned from Intranet Launch.

This post is dedicated to the lessons learned from the performance testing of Corporate Web Site. 

Prior to launch we ran through our performance test scenarios 3 times.  Each time the output showed that we could scale way beyond the existing implantation of our Corporate Web Site (Referred to as Violin from here on). 

The performance test scenarios had been chosen based on traffic patterns and pages determined to be high risk for performance (This was good). 

Our key performance requirements stated that the web servers must support 38 page views / sec with response time < 5 sec (This was good).  This is a nice well defined requirement, although some could argue that 38 page views needs to be broken down into specific types of pages (ex. 10 home page views, 7 chapter page views, …). 

We also had a performance goal stating that processor utilization should not go above 80% on web servers for more than 5 seconds (This was good).

For the final test we replayed traffic from IIS logs that were taken during peak traffic window (when we received the most requests / sec).  This was a bit tricky because my Load Runner resource told me that this was not supported by Load Runner.  So he and I had to message the data inside the IIS logs to get it so Load Runner would support running the tests (this felt wrong at the time, but I cannot say if it is a mistake).

We used Load Runner (sorry I do not know version) for all of the performance tests.  The Load Runner clients were located within the same data center as our web servers, but they were on different network segments.

When we ran the tests we engaged several people from operations team (Network, Windows Server, SQL Server DBA and SharePoint Admin).  These people were tasked with monitoring components related to their area of expertise.  They were also required to collect performance statistics and report those back so they could be included in overall performance test report (This was good).

So each time we ran the tests we were able to reach levels of about 90 page views / sec on one server with avg. response time < 5 seconds (we have 4 load balanced WFE in our farm).  So we were hi-fiving and slapping each other on the back.  As far as we were concerned performance requirements were met, check them off we are done.

We did notice an occasional spike w/ CPU, but we were able to correlate this back to pages expiring in Output Cache.  So this was not a concern.

Well once we went live we discovered that something was gravely wrong.

After going live we discovered that the output cache hit ratio was not aligned with the numbers we were seeing during performance testing.  So were were having a LOT less output cache hits.  This resulted in the servers having to do a lot more work than originally anticipated.

What could have happened? We thought we did everything right with the performance tests.  What went wrong?

Well after much soul searching (and re-reading basics of performance testing) it hit me.  "

Oh $hit we didn’t model user variations and think times. 

Yeah it does, the reason is because we ran a high number of requests but the proportion of cached requests vs. un-cached requests was out of balance.  Had we have taken into consideration user think times and other variations(browser type, user location) we would have less hits against output cache.

Classic 101 Performance Testing Mistake.  Oh well, you pick yourself up, dust yourself off and vow not to make the same mistake again.

User think times are critical when doing performance testing (especially for web applications that rely on ASP.Net Output Caching to meet performance goals).

Just as important as think times you need to look at the IIS Logs (or your web analytics reports) to understand browser differences and local differences.  This is extremely critical if you have Output Cache configured so it treats these differences as non cached page requests.

While this is not as important as Think Times and End User variations it is important if you are doing performance testing through a load balancer configured with session affinity. 

All of the tests we ran looked like they were coming from 2 IPs.  While I cannot prove this invalidated the test results it looks like there was some sort of caching efficiencies realized somewhere in the stack (Switch, NIC, IIS, …). 

References

Microsoft Patterns and Practices: Performance Testing Guidance for Web Applications

Microsoft Office Server Online: Configure page output cache settings

MSDN: Output Caching and Cache Profiles


View the original article here

No comments:

Post a Comment