Products

Solutions

Resources

Partners

Community

About

New Community Website

Ordinarily, you'd be at the right spot, but we've recently launched a brand new community website... For the community, by the community.

Yay... Take Me to the Community!

The Community Blog is a personal opinion of community members and by no means the official standpoint of DNN Corp or DNN Platform. This is a place to express personal thoughts about DNNPlatform, the community and its ecosystem. Do you have useful information that you would like to share with the DNN Community in a featured article or blog? If so, please contact .

The use of the Community Blog is covered by our Community Blog Guidelines - please read before commenting or posting.


Rough to get Slashdotted? Try getting “SharePoint’ed…

There was a time when DotNetNuke didn't gather a lot of site statistics.  Very early on, we used our own internal site stats features to keep some relevant information but as most hosts know, that architecture is not terribly efficient on a massive scale.   But as we have grown and continued to add interactive features to the site ( heavily used forums, active blogs, information repositories, benefactor & vendor management, etc )… our need to understand our site usage at a more detailed level has increased dramatically.  So when we migrated to a new web server before the holidays, we finally got around to installing some log analysis software.

Like many organizations, we weren't too concerned about getting indexed.  In fact, we’d normally consider it a pretty good thing to get found & indexed when you are still a growing organization in search of broader uptake.  But then we took a look at our first week of recorded site traffic…

Oh, my.

So let’s take a look at what we were facing in December.  The two screen shots below represent (1) spider traffic on www.dotnetnuke.com and (2) IP traffic.  Look at these and let’s see what we can notice.

DecemberSpiders.jpg

So who’s that “MS Search Robot” that soaked up 495GB of data transfer in December?  Not to mention 7 ¾ million page views on 393 visits?

DecemberIP.jpg

Hmm… and who are those IP addresses with just a few visits and millions of page views… and ( by the way ) responsible for more than half of our bandwidth consumption?

If you go google’ing for “MS Search Robot”… you’re not going to find much except for a bunch of other people asking, “hey, what’s this MS Search Robot”?  But if you check your raw server logs it looks a little bit different… “MS Search 4.0 Robot”.  Keep digging and you’ll find some obscure references ( mostly circa 2003 ).  But the one you’re really looking for is here:

http://support.microsoft.com/default.aspx?scid=kb;en-us;284022#XSLTH3163121123120121120120

You might notice the fine print there…

IMPORTANT: Limit the number of site hops to the absolute minimum number necessary. When you perform an Internet crawl, you might index millions of documents in just a few site hops.

Yep.  We can validate that.

Turns out… as we had the opportunity to follow up with some of these IP address owners ( some representing large companies, professional organizations, etc ) we quickly discovered what was happening.  All of them were quite happy to work cooperatively with us, often not even aware of the load on their own systems being generated.

Local SharePoint installations ( some development, some production ) were crawling their internal networks.  Within their internal networks they had ( one or more ) default installations of DotNetNuke… each of which contains front page links back to www.dotnetnuke.com ( i.e. for the information Links and Sponsors modules ).  SharePoint ( without specific inclusions defined ) was just following links… So what else did that fine print say?

The site path rule strategy that is recommended when you are crawling Internet sites is to create an exclusion rule for the entire HTTP URL space (http://*), and then create inclusion rules for only those sites that you want to index.

Oh yeah.  Basically that the default settings are a little impolite to other sites and that you should change them.  Please make a note.  *grin*

Now that we have the proper exclusions in our robots.txt file ( see below )… we’re no longer being hammered by this particular bot.  Are you?

User-agent: Mozilla/4.0 (compatible; MSIE 4.01; Windows NT; MS Search 4.0 Robot) Microsoft
Disallow: /

Comments

Comment Form

Only registered users may post comments.

NewsArchives


Aderson Oliveira (22)
Alec Whittington (11)
Alessandra Daniels (3)
Alex Shirley (10)
Andrew Hoefling (3)
Andrew Nurse (30)
Andy Tryba (1)
Anthony Glenwright (5)
Antonio Chagoury (28)
Ash Prasad (37)
Ben Schmidt (1)
Benjamin Hermann (25)
Benoit Sarton (9)
Beth Firebaugh (12)
Bill Walker (36)
Bob Kruger (5)
Bogdan Litescu (1)
Brian Dukes (2)
Brice Snow (1)
Bruce Chapman (20)
Bryan Andrews (1)
cathal connolly (55)
Charles Nurse (163)
Chris Hammond (213)
Chris Paterra (55)
Clint Patterson (108)
Cuong Dang (21)
Daniel Bartholomew (2)
Daniel Mettler (181)
Daniel Valadas (48)
Dave Buckner (2)
David Poindexter (12)
David Rodriguez (3)
Dennis Shiao (1)
Doug Howell (11)
Erik van Ballegoij (30)
Ernst Peter Tamminga (80)
Francisco Perez Andres (17)
Geoff Barlow (12)
George Alatrash (12)
Gifford Watkins (3)
Gilles Le Pigocher (3)
Ian Robinson (7)
Israel Martinez (17)
Jan Blomquist (2)
Jan Jonas (3)
Jaspreet Bhatia (1)
Jenni Merrifield (6)
Joe Brinkman (274)
John Mitchell (1)
Jon Henning (14)
Jonathan Sheely (4)
Jordan Coopersmith (1)
Joseph Craig (2)
Kan Ma (1)
Keivan Beigi (3)
Kelly Ford (4)
Ken Grierson (10)
Kevin Schreiner (6)
Leigh Pointer (31)
Lorraine Young (60)
Malik Khan (1)
Matt Rutledge (2)
Matthias Schlomann (16)
Mauricio Márquez (5)
Michael Doxsey (7)
Michael Tobisch (3)
Michael Washington (202)
Miguel Gatmaytan (3)
Mike Horton (19)
Mitchel Sellers (40)
Nathan Rover (3)
Navin V Nagiah (14)
Néstor Sánchez (31)
Nik Kalyani (14)
Oliver Hine (1)
Patricio F. Salinas (1)
Patrick Ryan (1)
Peter Donker (54)
Philip Beadle (135)
Philipp Becker (4)
Richard Dumas (22)
Robert J Collins (5)
Roger Selwyn (8)
Ruben Lopez (1)
Ryan Martinez (1)
Sacha Trauwaen (1)
Salar Golestanian (4)
Sanjay Mehrotra (9)
Scott McCulloch (1)
Scott Schlesier (11)
Scott Wilkinson (3)
Scott Willhite (97)
Sebastian Leupold (80)
Shaun Walker (237)
Shawn Mehaffie (17)
Stefan Cullmann (12)
Stefan Kamphuis (12)
Steve Fabian (31)
Steven Fisher (1)
Tony Henrich (3)
Torsten Weggen (3)
Tycho de Waard (4)
Vicenç Masanas (27)
Vincent Nguyen (3)
Vitaly Kozadayev (6)
Will Morgenweck (40)
Will Strohl (180)
William Severance (5)
What is Liquid Content?
Find Out
What is Liquid Content?
Find Out
What is Liquid Content?
Find Out