I discovered something interesting this week using the Google webmaster tools & thought I's share it. I had a look at the diagnostic tools (many of these are only available if you "verify" your site), and Google listed several hundred Urls (mostly forum posts) that it had not indexed, with an explanation that it did not index the page because it did not find a robots.txt file for my site.
A robots.txt file is used to provide instructions to web "robots" like the one that Google uses to index web sites. The absence of a robots.txt is generally meant to tell robots to go ahead and index everything, but from the results I saw in Google diagnostics, it appears that you will get a better indexing result from Google by explicitly including a robots.txt file in your site.
Here's my robots.txt: I ended up adding all the sub-folders in DNN except for /Portals, even though many of them would never get linked to anyway. (For example, including /bin is a bit of overkill, since no page would ever have a link to its contents).
User-agent: *
Disallow: /Admin/
Disallow: /App_Browser/
Disallow: /App_Code/
Disallow: /App_Data/
Disallow: /App_GlobalResources/
Disallow: /bin/
Disallow: /Components/
Disallow: /Config/
Disallow: /Controls/
Disallow: /DesktopModules/
Disallow: /Documentation/
Disallow: /Install/
Disallow: /js/
Disallow: /Providers/
Disallow: /Resources/
A simple version of the above would be:
User-agent: *
Disallow:
I couldn't find any documentation to back this observation up, but after I added the robots.txt file above and waited a couple of days, the errors disapeared from my Google diagnostics page, and the pages were indexed. If anyone else out there is using the Google webmaster tools & can check to see if their results match mine, I'd appreciate it if you'd let me know by posting a blog comment.