As many of you probably noticed, we have been experiencing some stability issues on dotnetnuke.com over the past couple days. Every couple hours the website would seem to "hang" and pages would no longer be served to visitors. In order to get things working again, we would need to restart the website or recycle the application pool manually. Then the site would return to normal operations - at least for a period of time before it would "hang" once again. So what was the problem?
Well the first thing we wanted to verify was that browser requests were actually reaching the server in a consistent manner. We were not getting 404 errors, but the browser would sit there indefinitely trying to load a page. We tried pinging the server and had no problems. We tried tracert and sometimes it appeared to time out just before it reached the server. We are hosted by MaximumASP, so we contacted them to determine if it was potentially a DNS problem. We were informed that their network firewall prevents tracerts from completing - so the behavior we were seeing was accurate. We connected remotely to the server and tried accessing the site locally. Pages would not be served - so this ruled out DNS.
The next thing we identified was that we had performed a number of upgrades to the site this past week. We had upgraded from 4.6.0 to 4.6.2. We had upgraded to the new version of Forums ( 4.4.3 ). We had upgraded to the new version of Blog ( 3.3.1 ). And we had upgraded the content for the Online Help. Had one of these items cause the problem?
We decided to do some snooping on the web server. Looking at IIS we could see that the site was running fine. Loading perfmon we could see that the box was not overloaded. Next we looked at the Event Logs for the machine. There was nothing which stuck out as a red flag indicator in the logs. We restarted the IIS service, recycled the app pool, and restarted the app. While the website was up and running, we logged into it as the host user and took a look at the Event Viewer in DotNetNuke. There was nothing which identified a serious problem. We looked at the App_Start event and saw that they were not firing regularly - so we knew the site was not constantly recycling on its own. Still the site would "hang" after a period of time. So what next?
We decided to connect remotely to the SQL Server 2005 server and take a look. Running perfmon we could see that the CPU utilization was spiked at 100%. Basically SQL Server was maxxed out and therefore was not returning data to the web application. Restarting the web application would result in SQL Server getting unblocked, settling at <20% CPU utilization, until suddenly it would spike to 100% again. At first we thought there must be a rogue query or transaction which was causing SQL Server to become blocked. We ran SQL Profiler to try and identify the offender. But still no luck. So what next?
We went back to the web server. We opened taskmgr and went to the Processes tab. We highlighted the w3wp.exe process and sorted the table by the Image Name column. These represent individual application pools on the server. There were 11 of them. And looking at the Mem Usage column we could tell that the total memory used was > 2.0 GB. So what is the significance of this? Well if you refer back to my Performance blog:
You will remember that a 32 bit Windows box only has 2.0 GB of memory available to all application pools. The memory is allocated equally across the application pools. So with 11 active application pools, the dotnetnuke.com website app pool did not have enough memory available to satisfy its needs - it was starving. How?
The DotNetNuke web application relies on caching to achieve optimal performance. When the application needs data, it makes a call to the database and then stores the result in the ASP.NET cache so that it does not need to call the database on subsequent requests ( retrieving data in-process from the cache is far more efficient than retrieving data out-of-process from a database ). But this model falls apart when there is not enough memory available. ASP.NET will attempt to insert the data into the cache, but will be unsuccessful because it is full already. As a result, when the application needs to access that same data in the future, it will be forced to go to the database again ( and again and again and again.... ). In our case the database was being hammered so hard that it was spiking the CPU utilization at 100%, blocking SQL Server threads. The meant data was not being passed back to the web application, which in turn, resulted in the web application not being able to issue a response to the web browser. So it would "hang". So how to fix it?
Well, as it turns out, we did not actually need to have 11 application pools on our web server. We had provisioned them that way to provide isolation for various applications, but the reality is that we could group some of the applications together. Consolidating the applications pools down to 5 pools resulted in more than a 2X memory allocation increase for the remaining pools. With more memory available, DotNetNuke could cache data efficiently and take the burden off of SQL Server, bringing back down to <20% CPU utilization.
The moral of the story is that there are some serious complexities in diagnosing ASP.NET site issues. Your journey will result in many twists and turns as you follow the trail of evidence. Assuming that the problem is related to the web application will often lead you down the wrong path. More often than not, the problem will be related to your specific server environment. So it is also critical to understand the behavior and constraints of the Windows server environment as this will allow you to connect the various dots.