This post provides some lessons learned from the launch of our Corporate Intranet. After about two weeks of poor performance and stability issues we stabilized the site and resolved most of the issues. The lessons learned here are common and I'm sure our team was not the first (nor the last) to make these mistakes.
Our Corporate Intranet supports about 21 Business Area / Business Units (BA/BUs). When I say Intranet I am referring to a content publishing web site that provides announcements, latest news, corporate policies and other information that is important for employees to consider. It is not a place for employees to collaborate as teams, this is done by another set of SharePoint 2007 sites.
The Intranet is hosted on something we call the Common Web Platform (CWP). What makes it common is it is one set of features / functionality that powers Intranet, Extranet and Internet publishing sites.
In 2007 I started working on a project to upgrade CWP from its current infrastructure (SharePoint 2003 / MS Content Management Server 2002) to SharePoint 2007. The first major component to rollout under the new SharePoint 2007 version of CWP is the Intranet site.
Our intranet is not small. It contains approximately 67,000 webs, 65,0000 documents and 70,000 web pages. The business requirements for sharing content between BA/BUs led us to determine that putting all this content in one site collection was the best choice. I still believe this was the right decision, but it did cause us to create a SharePoint Content DB that is around 330 GB.
Launch day was actually quite calm from my perspective. Yes we had a large site, but I felt we had done an excellent job with performance testing so launch would actually go quite smooth I do not want to go into specifics but the performance testing done had shown that the new SharePoint 2007 site would be able to scale about 3 X higher in number of page views and users than the existing platform.
Everything wasn't perfect, in fact far from it. We had quite a number of lingering issues from content migration. We also had some application bugs that just would not go away. But everyone agreed that these could be solved so we decided to go forward with the launch.
So at approximately noon Eastern US time on March 25th, 2009 we had the DNS team flip the switch and all traffic rolled off the old environment and to the new. It was one of the smoothest cutover’s I have ever been associated with, I even heard some people saying that the did not know we had flipped the switch.
The next morning as Europe came online the proverbial $hit hit the fan. I'm not going to go into the blow by blow details, but I will say a dedicated team of engineers that wanted nothing more than to see this new platform succeed went to work along with MS Premier Support. On Tuesday April 7th the task force was closed as everyone agreed that while the new Intranet had some problems it was stable and performance was acceptable to end users.
This was a really tough one to troubleshoot. The thing that made it tough was just inconsistency with the crashes. We could never tie it back to one specific event or one set of clear patterns. The only consistency was the fact that it crashed during peak traffic loads (from 2 AM - 9 AM Eastern US time). The Intranet availability dipped to about 60% during these two weeks.
As I stated after about two weeks of pure hell we got things stable. During that time we did a lot of analysis and a few changes. So in no particular order here is the things we changed and why and what I personally learned.
One of our field controls makes calls to a web service that in turn makes calls to a database to retrieve some data. It is pretty basic stuff. Well, to make a long story short the web service ended up in our SharePoint solution package and our field control ended up a call back into the same Application Domain to call some data from a database. Yes, I know not a very smart thing to do.
Anyway, during the performance we put together specific KPI’s to watch for this web service. We saw no major problems with it, but put it on a list of things to change once the application went into maintenance mode.
While we never linked any outages specifically to calls to this web service, however we did see a major improvement in overall stability when the web service was moved to a separate application pool.
So the lesson learned is to keep the Application Pools that host SharePoint sites dedicated to SharePoint sites (do not have those Application Pools host non SharePoint IIS Sites).
One of the problems that definitely caused outages was table locks at the SQL Server level. We traced the table locks back to SQL that was being generated by a CAML query we used to show documents associated with a given web page.
SQL will lock a table if it thinks a query will return more than 5000 rows. So it is very important that you set a row limit when using SPQuery and CrossListQueryCache objects. When SharePoint generates the SQL for CrossListQuery if will set a default row limit of 2 million items. I’m not sure if it does the same thing for SPQuery, but better safe than sorry.
So the lesson learned here is always set a row limit that is less than 5000 when using SPQuery and CrossListQueryCache.
The CAML Query referenced in Item 2 above was using FileRef field to filter the result list. Unfortunately FileRef is a special field inside of SharePoint, meaning it doesn’t lend itself to be indexed (See Index List Field). So the SQL query’s that were generated from the CAML were doing full table scans which is another big performance hit and can cause unwanted database locks.
So in the end we abandoned using CAML query to get the documents and instead pulled them the SPWeb.GetListItem method. At first there was a hugh debate on our team, because fundamentally it is better to reduce communication with DB. So we were going from essentially one call to the DB to two calls per file in our document list field control (note: SPWeb.GetListItem results in at least 2 calls to the DB, one to get the List field info and one to pull the ListItem data).
Our control has a limit of 200 documents that can be displayed. So we knew the maximum number of times we would call GetListItem per page would be 200. We also knew that the average number of documents per page was 3. So most pages had very few documents to display.
Our team is looking at alternate approaches. One idea is to add a field to each document that has a GUID. Then index that field and go back to doing queries using that new field. We have a lot of testing to do before we make a decision to go in that direction.
So the lesson learned was do not write CAML query's that use FileRef as the primary field to filter the results.
Okay this one requires a separate blog post. I promise to post a blog entry with this information very soon. In the mean time I can say that the key mistake made with performance testing was not taking into consideration user sessions and think times. We had the right URLs (we took these straight from logs of production machine), but we ran them through too fast which created a situation where URLs uses output cached versions of the pages when under normal load they would not have used the cached versions.
Granted we had a rough launch because the performance testing did not catch the critical application issues. I do not want to leave people with the impression that everything we did was wrong. Our team did a lot of stuff write and often these things get forgotten when things go wrong. So here is a short list of the things we did right:
We used 64 bit hardware for all our servers (SQL and Web Front Ends). We used the caching options with Publishing sites effectively (Output Cache, Object Cache and BLOB Cache). We discovered a major memory leak in our code with performance testing and fixed it before going live. We put together a well defined set of Solutions and Features for our application (so we can deploy easily). We created a team of people that have some really deep knowledge on building SharePoint Publishing sites.
MSDN: Best Practices: Common Coding Issues When Using the SharePoint Object Model
Microsoft TechNet: Tune Web server performance (Office SharePoint Server)
SharePoint for End Users: Manage large SharePoint lists for better performance
Reza Alirezaei’s Blog: 20 key Points Arising, or Inferred, From “Working with large lists in MOSS 2007” Paper
View the original article here