So there have been lots of references in recent weeks to the "dogfooding" we have been doing on the dotnetnuke.com site. Running in production introduces a variety of stimuli which can not be accurately simulated in development. For the benefit of the community ( many of whom will be able to relate ) I am going to explain an interesting item which consumed a couple days last week...
So during the dogfooding exercise we have been monitoring the DNN Event Log quite closely to examine any exceptions which are logged. This extra attention has revealed some longstanding issues which do not only affect the most recent release package ( 4.3.2 on ASP.NET 2.0 ) but also legacy release packages ( DNN 3.x on ASP.NET 1.1 ). An important aspect of the issue I am going to describe is related to running a site live on the Internet ( as opposed to running a site on an internal Intranet where this issue would never manifest itself ) where there are more "users" hitting your application than you even imagine sometimes. In addition, this post will describe how one problem can lead to the discovery of other problems as you work through the issue from top to bottom.
So early last week, the DNN Event Log was logging errors every 2-3 seconds. Now this has some very bad side effects. The DNN Event Logging mechanism caches event items in memory for a specific time interval and then performs a batch update to insert them into the database. This model is more efficient than direct logging and results in better applicaton performance. That is, except for the case when an application service throws repeated exceptions. In the case of a busy site ( like dotnetnuke.com ), the amount of data in the cache eventually exceeds the allowable memory for the site and IIS forces an application restart ( thinking there is a memory leak ). So this meant the site was up and down like a yo-yo while this was going on. So what was causing the issue?
Well in reviewing the Event Log, there were constant errors being thrown by the UrlRewriter component. This component accepts a "search engine friendly" logical URL and converts it into an application-friendly physical URL based on some regular expression rule processing. The errors were related to parsing a String to an Int32 value. In reviewing the UrlRewriter code, the first discovery was that the component was not coded very defensively. It assumed that parameter values would be passed in the proper format for processing. Well in the case of critical application parameters like TabID, the UrlRewriter would throw unmanaged exceptions if a non-numeric value was passed as a parameter value. This seemed like the culprit so we patched the code and deployed a new assembly. However, errors continued to be logged.
Unable to find a suitable explanation, we decided to add some more descriptive information to the exception in hopes of discovering some addition clues. We added some more Request attributes and almost immediately stumbled upon the source of the problem.
ROBOTS!
It appeared that when search engine indexers attempted to index the site, they would use their own parsing logic to create URLs which they could follow. The exceptions indicated that MSN Search Bot, GoogleBot, Yahoo Slurp, etc... were visiting the site regularly and choking on the friendly urls. It appeared that when the bot encountered a relative URL in the source of a page like "/default.aspx?tabid=795", it would try to append the site path to create an absolute url, resulting in "www.dotnetnuke.com/tabid/787/default.aspx?tabid=795". Then when the URLRewrite rules processed the request, it would result in "www.dotnetnuke.com/default.aspx?tabid=787&tabid=795". Now based on the W3C URI specifications, this is a valid URL ( parameters can be defined more than once and interpreted as a collection ). However, from a DNN perpective, it would try to convert the TabID arameter value "787,795" to an Int32 and it would get a parsing exception. So this explained why the exception was occurring, the frequency, and why it never happened during development. But how to fix the problem...
Since we we forced to add custom debug statements to the exception in order to get the clues to diagnose the problem, we realized that the general exception log in DNN was not always adequate. So we added 2 new attributes which are collected for every exception: RawURL ( which includes the full QueryString ), and UserAgent ( which identifies browsers as well as robots ). This should definitely help in diagnosing issues in the future ( and allowed us to remove our custom debug statements ).
The relative URLs in the source of the page were being created by the Links module. And the reason they were formed as "/default.aspx?tabid=795" was because a core method in Globals ( LinkClick() ) was not producing a search-friendly URL. This had been fixed in DNN 3.3/4.3, but search engines still had old versions of pages in their cache to contend with.
Finally the UrlRewrite component needed some more intelligent rules processing. The code was modified so that it parses the parameters and checks if the parameter is already part of the URL before adding it. This eliminates scenarios where the TabID parameter was being added twice. In addition to these defensive coding changes, some minor enhancements were added which allow you to rewrite the domain portion of the URL in addition to the path ( ie. forums.dotnetnuke.com redirects to www.dotnetnuke.com/tabid/795/default.aspx ). These changes were posted to dotnetnuke.com and the exceptions are no longer being logged.
This is just one example among many where production deployments reveal situations that would never happen during development. And it is critical to have a web application framework like DNN with an elaborate logging mechanism that allows your site to function normally while in the background collecting the information necessary to diagnose issues as they occur.