Jump to content
  • Checkout
  • Login
  • Get in touch

osCommerce

The e-commerce.

Robots Spiders and robot.txt


mlulm

Recommended Posts

I think there should be a place to help keep track of what our list of spiders should be in our sid killer script and what should be in our robot.txt file.

 

You can use this link to check your sid killer with different User Agents

http://www.wannabrowser.com/

 

There is a list of robots at

http://tamingthebeast.net/articles2/search...ine-spiders.htm

it also talks about robots.txt

 

This is the sid killer script I am using, I got this off the forum and it seems to work. I would like to know how the list should be adjusted, are there spiders I need but don't have or some I have and don't need.

// Add more Spiders as you find them. MAKE SURE THEY ARE LOWER CASE! 

   $spiders = array("googlebot", "gigabot", "almaden.ibm.com", "appie 1.1", "augurfind", "baiduspider", "bannana_bot", "bdcindexer", "docomo", "fast-webcrawler", "frooglebot", "geobot", "ia_archiver", "henrythemiragorobot", "infoseek", "sidewinder", "lachesis", "mercator", "moget/1.0", "nationaldirectory-webspider", "naverrobot", "ncsa beta", "netresearchserver", "ng/1.0", "osis-project", "polybot", "pompos", "scooter", "seventwentyfour", "slurp/si", "[email protected]", "steeler/1.3", "szukacz", "teoma", "turnitinbot", "vagabondo", "zao/0", "zyborg/1.0"); 

   foreach($spiders as $Val) { 

       if (ereg($Val, strtolower(getenv("HTTP_USER_AGENT")))) { 

           // Edit out one of these as necessary depending upon your version of html_output.php 

           //$sess = NULL; 

           $sid = NULL; 

           break; 

       } 

   }

 

I would like to see some robots.txt that works well with the sid killers.

Link to comment
Share on other sites

That list of spider UA in the code was compiled by me by grep'ing approximately 3.5 gigabytes worth of transfer-logs for requests for robots.txt.

 

As spiders are the only things that request robots.txt file, these are the ones that the grep came up with.

 

There are going to be some that are missed, and some of those may now be dead. All you have to do is find a resource on the web which has a list of spider UA's and apply them into the list (in lower case).

 

Eg, say you find a spider called "OscommerceExperimentalSpider - v2.0", I would add it in as:

 

$spiders = array("googlebot", "gigabot", "almaden.ibm.com", "appie 1.1", "augurfind", "baiduspider", "bannana_bot", "bdcindexer", "docomo", "fast-webcrawler", "frooglebot", "geobot", "ia_archiver", "henrythemiragorobot", "infoseek", "sidewinder", "lachesis", "mercator", "moget/1.0", "nationaldirectory-webspider", "naverrobot", "ncsa beta", "netresearchserver", "ng/1.0", "oscommerceexperimental", "osis-project", "polybot", "pompos", "scooter", "seventwentyfour", "slurp/si", "[email protected]", "steeler/1.3", "szukacz", "teoma", "turnitinbot", "vagabondo", "zao/0", "zyborg/1.0");

 

It's that easy.

 

I have this set up so that all I ahve to do is update 1 text file on each server, and every site that this is implemented on is automatically updated when I find a new spider. Easy.

Link to comment
Share on other sites

How did you decide to use just this part "oscommerceexperimental" of "OscommerceExperimentalSpider - v2.0"?

Could you just use "bot" to cover all the spiders that have "bot" in their name?

 

Thanks burt your list has saved me a lot of trouble, my site was being smashed by the googlebot.

Link to comment
Share on other sites

You want to make sure that the term you use is not too generic, as other wise you might end up not giving a SID to visitors who need one.

 

bot is too generic... as this would not give a sid to the following (assuming they were in the UA of the browser):

 

bottom

botulism

bottle

us robotics

 

Lots of ISPs give their customers a version of IE with their logo and details - if you were with an ISP named botticelli internet for example, these people wouldn't get a SID if they visited (if you used bot as a UA sniffter)...

 

Make sense ?

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...