mlulm Posted January 26, 2003 Share Posted January 26, 2003 I think there should be a place to help keep track of what our list of spiders should be in our sid killer script and what should be in our robot.txt file. You can use this link to check your sid killer with different User Agents http://www.wannabrowser.com/ There is a list of robots at http://tamingthebeast.net/articles2/search...ine-spiders.htm it also talks about robots.txt This is the sid killer script I am using, I got this off the forum and it seems to work. I would like to know how the list should be adjusted, are there spiders I need but don't have or some I have and don't need. // Add more Spiders as you find them. MAKE SURE THEY ARE LOWER CASE! $spiders = array("googlebot", "gigabot", "almaden.ibm.com", "appie 1.1", "augurfind", "baiduspider", "bannana_bot", "bdcindexer", "docomo", "fast-webcrawler", "frooglebot", "geobot", "ia_archiver", "henrythemiragorobot", "infoseek", "sidewinder", "lachesis", "mercator", "moget/1.0", "nationaldirectory-webspider", "naverrobot", "ncsa beta", "netresearchserver", "ng/1.0", "osis-project", "polybot", "pompos", "scooter", "seventwentyfour", "slurp/si", "[email protected]", "steeler/1.3", "szukacz", "teoma", "turnitinbot", "vagabondo", "zao/0", "zyborg/1.0"); foreach($spiders as $Val) { if (ereg($Val, strtolower(getenv("HTTP_USER_AGENT")))) { // Edit out one of these as necessary depending upon your version of html_output.php //$sess = NULL; $sid = NULL; break; } } I would like to see some robots.txt that works well with the sid killers. Link to comment Share on other sites More sharing options...
burt Posted January 26, 2003 Share Posted January 26, 2003 That list of spider UA in the code was compiled by me by grep'ing approximately 3.5 gigabytes worth of transfer-logs for requests for robots.txt. As spiders are the only things that request robots.txt file, these are the ones that the grep came up with. There are going to be some that are missed, and some of those may now be dead. All you have to do is find a resource on the web which has a list of spider UA's and apply them into the list (in lower case). Eg, say you find a spider called "OscommerceExperimentalSpider - v2.0", I would add it in as: $spiders = array("googlebot", "gigabot", "almaden.ibm.com", "appie 1.1", "augurfind", "baiduspider", "bannana_bot", "bdcindexer", "docomo", "fast-webcrawler", "frooglebot", "geobot", "ia_archiver", "henrythemiragorobot", "infoseek", "sidewinder", "lachesis", "mercator", "moget/1.0", "nationaldirectory-webspider", "naverrobot", "ncsa beta", "netresearchserver", "ng/1.0", "oscommerceexperimental", "osis-project", "polybot", "pompos", "scooter", "seventwentyfour", "slurp/si", "[email protected]", "steeler/1.3", "szukacz", "teoma", "turnitinbot", "vagabondo", "zao/0", "zyborg/1.0"); It's that easy. I have this set up so that all I ahve to do is update 1 text file on each server, and every site that this is implemented on is automatically updated when I find a new spider. Easy. Link to comment Share on other sites More sharing options...
mlulm Posted January 26, 2003 Author Share Posted January 26, 2003 How did you decide to use just this part "oscommerceexperimental" of "OscommerceExperimentalSpider - v2.0"? Could you just use "bot" to cover all the spiders that have "bot" in their name? Thanks burt your list has saved me a lot of trouble, my site was being smashed by the googlebot. Link to comment Share on other sites More sharing options...
burt Posted January 26, 2003 Share Posted January 26, 2003 You want to make sure that the term you use is not too generic, as other wise you might end up not giving a SID to visitors who need one. bot is too generic... as this would not give a sid to the following (assuming they were in the UA of the browser): bottom botulism bottle us robotics Lots of ISPs give their customers a version of IE with their logo and details - if you were with an ISP named botticelli internet for example, these people wouldn't get a SID if they visited (if you used bot as a UA sniffter)... Make sense ? Link to comment Share on other sites More sharing options...
mlulm Posted January 26, 2003 Author Share Posted January 26, 2003 Does it make sense to put the most active spiders first in the list, to make the script faster? What do you suggest for robots.txt? Is it needed? Link to comment Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.