It is quite often that when working on a new version of a site that you will have a development, test, upgrade copy of the site that might be around for a while. It is also possible that if you are working for a third-party that you might stage client sites on your server for a period of time before go-live. At first glance this all seems common place and not something that you would be concerned about. However, that is not the case. Search engines have become overly aggressive in indexing sites, including those that have no direct back links but might have been e-mailed or distributed via similar means. Before I get too far into specific in's and out's on this topic I want to start out with why this is so important.
Why Is "Not Indexing" So Important?
There are actually a number of reason that we want to be confident that our dev/staging sites are not indexed by search engines.
- Typically these are non-stable sites and could suffer from errors or similar
- Typically an existing site has the desired content and that is where users should be getting their content.
- We don't want to be penalized for duplicate content
- We don't want to publicly expose our development URL's or systems
The above are just a few of the reasons why we want to keep this content private.
Why Not Just Enable Basic Authentication on the Site?
One of the most common recommendations that you get when it comes to this topic is to require Windows authentication for all test/development domains. Although effective this solution is not always the best for a number of reasons.
- It does not properly reflect the production environment so issues could arise in the future
- Not possible in shared hosting environments (most cases)
- Requires the creation/management of additional accounts outside that of DotNetNuke
- Can provide issues for mobile browsers
In some situations you might be able to get by this way but not all.
Blocking the Robots for Good!
So how exactly do we block the robots for good from our test/development installations? Well a two pronged approach is really the best way to go about it.
Add a Restrictive Robots.txt
The first item that we should do for all of these environments is to create a restrictive Robots.txt file that will tell the search crawlers that we don't want to allow them to index our site. To so do simply create a text file at the root of the website named "robots.txt" and place the following content within the document.
User-agent: *
Disallow: /
This instructs all of the crawlers that you are disallowing it to index all content on the site. However this is only a partial solution as it is only a "suggestion" and some of the crawlers will bypass it. Additionally if you have a site where content has already been indexed this will not remove it from the index.
Modify the ROBOTS meta Tag
The other key driver for search crawlers is the ROBOTS meta tag. By default all pages in DotNetNuke default to a value of "INDEX, FOLLOW" which tells the crawlers that they should index the content on the site AND that they should follow links to other destinations. In production sites this is exactly what we want, however, horrible if a bot finds a link to a test/development site.
To get around this as of right now the only real solid solution I've found is to make a small core modification. This isn't ideal but there is not currently a configuration point for this within DNN. To set a default value for all pages we need to first disable the existing DNN value from showing, then add our own custom value.
First find the following line of code within Default.aspx at the root of the site.
<meta name="ROBOTS" runat="server" id="MetaRobots" />
Once you have found this modify it to look like the following, this will stop it from displaying to users.
<meta name="ROBOTS" runat="server" id="MetaRobots" visible="false" />
Once this has been done, on the next line add your custom value telling the crawlers that content is off limits. Similar to the following.
<meta content="NOINDEX, NOFOLLOW" name="ROBOTS" />
With this completed you are set to go!.
WARNING!!!
It is VERY, VERY, VERY important to note that if you ever transition one of these test/development sites to production that you remember to reverse the above listed change BEFORE you deploy to production. NOINDEX, NOFOLLOW will cause search engines to remove the content from their indexes if it is found. If this happens to your production URL's you will lose page rank!
Closing Thoughts
With these two simple steps you can safely hide your in-progress works from the search engines while still allowing your team easy access to the sites. Just don't forget the warning above! Feel free to share your comments/experiences below.
This article has been cross-posted from my Personal Blog.