Jump to content
  • Checkout
  • Login
  • Get in touch

osCommerce

The e-commerce.

New Spiders - lots of them for Burt's SID killer fix


JenRed

Recommended Posts

Posted

Reference to this topic "New Spider"

http://www.oscommerce.com/forums/viewtopic.php?t=36577

 

 

This is not for Ian's SID KIller mod, but for the one that changes code in catalog/includes/functions/html_output.php

 

I have been cutting and pasting like mad (RSI!!!) from some of the sites that list robot names and have come up with this big list. There may be some double-ups - I presume this is okay. Anybody who spots an error please post the correction here.

 

 

 

So...

 

 

in catalog/includes/functions/html_output.php

 

after:

// Add the session ID when moving from HTTP and HTTPS servers or when SID is defined 

   if ( (ENABLE_SSL == true ) && ($connection == 'SSL') && ($add_session_id == true) ) { 

     $sid = tep_session_name() . '=' . tep_session_id(); 

   } elseif ( ($add_session_id == true) && (tep_not_null(SID)) ) { 

     $sid = SID; 

   }

 

add this:

// Add more Spiders as you find them.  MAKE SURE THEY ARE LOWER CASE! 

$spiders = array("almaden.ibm.com", "abachobot", "aesop_com_spiderman", "appie 1.1", "ah-ha.com", "acme.spider", "ahoy", "iron33", "ia_archiver", "acoon", "spider.batsch.com", "crawler", "atomz", "antibot", "wget", "roach.smo.av.com-1.0", "altavista-intranet", "asterias2.0", "augurfind", "fluffy", "zyborg", "wire", "wscbot", "yandex", "yellopet-spider", "libwww-perl", "speedfind", "supersnooper", "webwombat", "marvin/infoseek", "whizbang", "nazilla", "uk searcher spider", "esismartspider", "surfnomore ", "kototoi", "scrubby", "baiduspider", "bannana_bot", "bdcindexer", "docomo", "fast-webcrawler", "frooglebot", "geobot", "googlebot", "googlebot/2.1", "henrythemiragorobot", "rabot", "pjspider", "architextspider", "henrythemiragorobot", "gulliver", "deepindex", "dittospyder", "jack", "infoseek", "sidewinder", "lachesis", "moget/1.0", "nationaldirectory-webspider", "picosearch", "naverrobot", "ncsa beta", "moget/2.0", "aranha", "netresearchserver", "ng/1.0", "osis-project", "polybot", "xift", "nationaldirectory", "piranha", "shark", "psbot", "pinpoint", "alkalinebot", "openbot", "pompos", "teomaagent", "zyborg", "gulliver", "architext", "fast-webcrawler", "seventwentyfour", "toutatis", "iltrovatore-setaccio", "sidewinder", "incywincy", "hubater", "slurp/si", "slurp", "partnersite", "diibot", "nttdirectory_robot", "griffon", "geckobot", "kit-fireball", "gencrawler", "ezresult", "mantraagent", "t-rex", "mp3bot", "ip3000", "lnspiderguy", "architectspider", "steeler/1.3", "szukacz", "teoma", "maxbot.com", "bradley", "infobee", "teoma_agent1", "turnitinbot", "vagabondo", "w3c_validator", "zao/0", "zyborg/1.0", "netresearchserver", "slurp", "ask jeeves", "ia_archiver", "scooter", "mercator", "crawler@fast", "crawler", "infoseek sidewinder", "lycos_spider", "fluffy the spider", "ultraseek", "anthill", "walhello appie", "arachnophilia", "arale", "araneo ", "aretha", "arks", "aspider", "atn worldwide", "atomz", "backrub", "big brother", "bjaaland", "blackwidow", "die blinde kuh", "bloodhound", "borg-bot", "bright.net caching robot", "bspider", "cactvs chemistry spider", "calif", "cassandra", "digimarc marcspider/cgi", "checkbot", "christcrawler.com", "churl", "cienciaficcion.net", "cmc/0.01", "collective", "combine system", "conceptbot", "coolbot", "web core / roots", "xyleme robot", "internet cruiser robot", "cusco", "cyberspyder link test", "deweb(c) katalog/index", "dienstspider", "digger", "digital integrity robot", "direct hit grabber", "dnabot", "download express", "dragonbot", "dwcp (dridus' web cataloging project)", "e-collector", "ebiness", "eit link verifier robot", "elfinbot", "emacs-w3 search engine", "ananzi", "esther", "evliya celebi", "nzexplorer", "fastcrawler", "fluid dynamics search engine robot", "felix ide", "wild ferret", "web hopper", "fetchrover", "fido", "hamahakki", "kit-fireball", "fish search", "fouineur", "robot francoroute", "freecrawl", "funnelweb", "gammaspider", "focusedcrawler", "gazz", "gcreep", "getbot", "geturl", "golem", "googlebot", "grapnel/0.01 experiment", "griffon", "gromit", "northern light gulliver", "gulper bot", "hambot", "harvest", "havindex", "html index", "hometown spider pro", "wired digital", "dig", "htmlgobble", "hyper-decontextualizer", "iajabot", "ibm_planetwide", "popular iconoclast", "ingrid", "imagelock", "informant", "infoseek robot 1.0", "infoseek sidewinder", "infospiders", "inspector web", "intelliagent", "i, robot", "israeli-search", "javabee", "jbot java web robot", "jcrawler", "jobo java web robot", "jobot", "joebot", "jumpstation", "image.kapsi.net", "katipo", "kdd-explorer", "kilroy", "ko_yappo_robot", "labelgrabber", "larbin", "legs", "link validator", "linkscan", "linkwalker", "lockon", "logo.gif crawler", "lycos", "mac wwwworm", "magpie", "marvin/infoseek", "mattie", "mediafox", "merzscope", "nec-meshexplorer", "mindcrawler", "mnogosearch", "momspider", "monster", "motor", "muncher", "ferret", "mwd.search", "internet shinchakubin", "netcarta webmap engine", "netmechanic", "netscoop", "newscan-online", "nhse web forager", "nomad", "the northstar robot", "occam", "hku www octopus", "openfind data gatherer", "orb search", "pack rat", "pageboy", "parasite", "patric", "pegasus", "the peregrinator", "perlcrawler 1.0", "phantom", "phpdig", "piltdownman", "pimptrain.com", "pioneer", "html_analyzer", "portal juice spider", "pgp key agent", "plumtreewebaccessor", "poppi", "portalb spider", "psbot", "getterroboplus puu", "the python robot", "raven search", "rbse spider", "resume robot", "roadhouse", "road runner", "robbie", "computingsite robi/1.0", "robocrawl spider", "robofox", "robozilla", "roverbot", "rules", "safetynet robot", "search.aus-au.com", "sleek", "searchprocess", "senrigan", "sg-scout", "shagseeker", "shai'hulud", "sift", "simmany", "site valet", "open text", "sitetech-rover", "skymob.com", "slcrawler", "smart spider", "snooper", "solbot", "speedy spider", "spider_monkey", "spiderbot", "spiderline", "spiderman", "spiderview", "spry wizard robot", "site searcher", "suke", "suntek", "sven", "tach black widow", "tarantula", "tarspider", "tcl w3", "techbot", "templeton", "teomatechnologies", "titin", "titan", "tkwww", "tlspider", "ucsd", "udmsearch", "url check", "url spider pro", "valkyrie", "verticrawl", "victoria", "vision-search", "voyager", "vwbot", "the nwi robot", "w3m2", "wallpaper", "the world wide web wanderer", "w@pspider", "webbandit", "webcatcher", "webcopy", "webfoot", "weblayers", "weblinker", "webmirror", "moose", "webquest", "digimarc marcspider", "webreaper", "webs", "websnarf", "webspider", "webvac", "webwalk", "webwalker", "webwatch", "wget", "whatuseek winona", "whowhere", "weblog monitor", "w3mir", "webstolperer", "web wombat", "the world wide web worm", "wwwc", "webzinger", "xget", "nederland.zoek", "mantraagent", "moget", "t-h-u-n-d-e-r-s-t-o-n-e", "muscatferret", "voilabot", "sleek spider", "kit_fireball", "semanticdiscovery/0.1", "inktomisearch.com ", "webcrawler");



// get useragent and force to lowercase just once 

$useragent = strtolower(getenv("HTTP_USER_AGENT")); 



foreach($spiders as $Val) { 

   if (!(strpos($Val, $useragent) === false)) { 

     // found a spider, kill the sid/sess 

     // Edit out one of these as necessary depending upon your version of html_output.php 

     // $sess = NULL;

      $sid = NULL;

     break; 

   } 

} 

// End spider stopper code

 

 

Thanks to all those whose posts I have been reading to understand this!

 

Jen

 

http://www.redinstead.com.au/catalog

(not finished yet but please test for me!) :)

I haven't lost my mind - I have it backed up on disk somewhere.

  • Replies 155
  • Created
  • Last Reply
Posted

so I did it right then? :)

 

Jen

I haven't lost my mind - I have it backed up on disk somewhere.

Posted

:D Great spider hunt job :D

 

saved me a few hundred keystrokes 8)

Posted

You know, just thinking about this, since output.html is called often during a visitor's browsing of the site, wouldn't it be prudent to store these values in a table, and then just execute a SQL statement that return a 1 if it finds the user agent in the table? This would also be easy to build an admin interface to add new spiders, edit old ones, or display all.

-------------------------------------------------------------------------------------------------------------------------

NOTE: As of Oct 2006, I'm not as active in this forum as I used to be, but I still work with osC quite a bit.

If you have a question about any of my posts here, your best bet is to contact me though either Email or PM in my profile, and I'll be happy to help.

Posted

that thought has been tossed around by a few people but no one has gotten to it yet on their to-do list :wink:

Posted

Wouldn't be easier to see if the session is an actual customer session than to cross reference hundreds of spiders?

 

If the session isn't a valid customer, don't show the SID in the URL. Bang, it handles every spider known.

Posted

Probably a stupid question, Wayne, but how would you do that? SIDs are assigned before a customer is valid.

-------------------------------------------------------------------------------------------------------------------------

NOTE: As of Oct 2006, I'm not as active in this forum as I used to be, but I still work with osC quite a bit.

If you have a question about any of my posts here, your best bet is to contact me though either Email or PM in my profile, and I'll be happy to help.

Posted

thats a good question - i was thinking the same thing :?

Posted

How is this different than Ian's attempt?

 

What are the issues with his and yours?

 

I've already applied his, should I switch?

- - - -

Sometimes, ignorance is bliss.

Posted

There are a few bugs in Ian's that seem to show up when moving from SSL to nonSSL mpages.

 

Inparticular, is you use a shared SSL, you can not use Ian's code at the moment.

 

There are no bugs with this one, it's just a little inefficient.

-------------------------------------------------------------------------------------------------------------------------

NOTE: As of Oct 2006, I'm not as active in this forum as I used to be, but I still work with osC quite a bit.

If you have a question about any of my posts here, your best bet is to contact me though either Email or PM in my profile, and I'll be happy to help.

Posted

Just looking at Jen's additions...

 

are the entries with spaces in the names going to work?

... if you want to REALLY see something that doesn't set up right out of the box without some tweaking,

try being a Foster Parent!

Posted

I removed the entries in Jen's additions that are covered by the $spider_agent[] "hard tests" (i.e. those containing spider, bot and crawler in the name). If you are using that option, you can use the following to cut down the size of the array quite a bit. I don't know how much difference that makes, but maybe it helps :).

 

// Add more Spiders as you find them.  MAKE SURE THEY ARE LOWER CASE! 

$spiders = array("almaden.ibm.com", "appie 1.1", "ah-ha.com", "ahoy", "iron33", "ia_archiver", "acoon", "atomz", "wget", "roach.smo.av.com-1.0", "altavista-intranet", "asterias2.0", "augurfind", "fluffy", "zyborg", "wire", "yandex", "libwww-perl", "speedfind", "supersnooper", "webwombat", "marvin/infoseek", "whizbang", "nazilla", "surfnomore ", "kototoi", "scrubby", "bdcindexer", "docomo", "gulliver", "deepindex", "dittospyder", "jack", "infoseek", "sidewinder", "lachesis", "moget/1.0", "picosearch", "ncsa beta", "moget/2.0", "aranha", "netresearchserver", "ng/1.0", "osis-project", "xift", "nationaldirectory", "piranha", "shark", "pinpoint", "pompos", "teomaagent", "zyborg", "gulliver", "architext", "seventwentyfour", "toutatis", "iltrovatore-setaccio", "sidewinder", "incywincy", "hubater", "slurp/si", "slurp", "partnersite", "griffon", "kit-fireball", "ezresult", "mantraagent", "t-rex", "ip3000", "steeler/1.3", "szukacz", "teoma", "bradley", "infobee", "teoma_agent1", "vagabondo", "w3c_validator", "zao/0", "zyborg/1.0", "netresearchserver", "slurp", "ask jeeves", "ia_archiver", "scooter", "mercator", "infoseek sidewinder", "ultraseek", "anthill", "walhello appie", "arachnophilia", "arale", "araneo ", "aretha", "arks", "atn worldwide", "atomz", "backrub", "big brother", "bjaaland", "blackwidow", "die blinde kuh", "bloodhound", "calif", "cassandra", "churl", "cienciaficcion.net", "cmc/0.01", "collective", "combine system", "web core / roots", "cusco", "cyberspyder link test", "deweb(c) katalog/index", "digger", "direct hit grabber", "download express", "dwcp (dridus' web cataloging project)", "e-collector", "ebiness", "emacs-w3 search engine", "ananzi", "esther", "evliya celebi", "nzexplorer", "felix ide", "wild ferret", "web hopper", "fetchrover", "fido", "hamahakki", "kit-fireball", "fish search", "fouineur", "freecrawl", "funnelweb", "gazz", "gcreep", "geturl", "golem", "grapnel/0.01 experiment", "griffon", "gromit", "northern light gulliver", "harvest", "havindex", "html index", "wired digital", "dig", "htmlgobble", "hyper-decontextualizer", "ibm_planetwide", "popular iconoclast", "ingrid", "imagelock", "informant", "infoseek sidewinder", "inspector web", "intelliagent", "israeli-search", "javabee", "jumpstation", "image.kapsi.net", "katipo", "kdd-explorer", "kilroy", "labelgrabber", "larbin", "legs", "link validator", "linkscan", "linkwalker", "lockon", "lycos", "mac wwwworm", "magpie", "marvin/infoseek", "mattie", "mediafox", "merzscope", "nec-meshexplorer", "mnogosearch", "monster", "motor", "muncher", "ferret", "mwd.search", "internet shinchakubin", "netcarta webmap engine", "netmechanic", "netscoop", "newscan-online", "nhse web forager", "nomad", "occam", "hku www octopus", "openfind data gatherer", "orb search", "pack rat", "pageboy", "parasite", "patric", "pegasus", "the peregrinator", "phantom", "phpdig", "piltdownman", "pimptrain.com", "pioneer", "html_analyzer", "pgp key agent", "plumtreewebaccessor", "poppi", "getterroboplus puu", "raven search", "roadhouse", "road runner", "robbie", "computingsite robi/1.0", "robofox", "robozilla", "rules", "search.aus-au.com", "sleek", "searchprocess", "senrigan", "sg-scout", "shagseeker", "shai'hulud", "sift", "simmany", "site valet", "open text", "sitetech-rover", "skymob.com", "snooper", "site searcher", "suke", "suntek", "sven", "tach black widow", "tarantula", "tcl w3", "templeton", "teomatechnologies", "titin", "titan", "tkwww", "ucsd", "udmsearch", "url check", "valkyrie", "verticrawl", "victoria", "vision-search", "voyager", "w3m2", "wallpaper", "the world wide web wanderer", "webbandit", "webcatcher", "webcopy", "webfoot", "weblayers", "weblinker", "webmirror", "moose", "webquest", "webreaper", "webs", "websnarf", "webvac", "webwalk", "webwalker", "webwatch", "wget", "whatuseek winona", "whowhere", "weblog monitor", "w3mir", "webstolperer", "web wombat", "the world wide web worm", "wwwc", "webzinger", "xget", "nederland.zoek", "mantraagent", "moget", "t-h-u-n-d-e-r-s-t-o-n-e", "muscatferret", "kit_fireball", "semanticdiscovery/0.1", "inktomisearch.com"); 



// get useragent and force to lowercase just once 

$useragent = strtolower(getenv("HTTP_USER_AGENT")); 



foreach($spiders as $Val) { 

   if (!(strpos($Val, $useragent) === false)) { 

     // found a spider, kill the sid/sess 

     // Edit out one of these as necessary depending upon your version of html_output.php 

     // $sess = NULL; 

      $sid = NULL; 

     break; 

   } 

} 

// End spider stopper code

 

Thanks Jen for compiling that incredible list!

 

David

Posted

I have tried to install this in my html_output.php but keep getting parse error.

 

Parse error: parse error, expecting `')'' in /home/mrsfield/public_html/includes/functions/html_output.php on line 284

 

I have tried several different versions from here and contributions and get the same error.

 

What am I doing wrong??

Posted

if you copy and pasted the code from the thread you may have picked up stray characters which need to be deleted from the end of each line :wink:

Posted

I tried deleting any stray characters and it didn't seem to work, same error. I replaced my html_output with the code from this thread

http://www.oscommerce.com/forums/viewtopic.php...ight=sid+killer

 

It seems to work, but I wanted to add the other spiders, but keep getting parse error.........

 

I think it's probably something real simple that I'm missing, but I can't find it... :oops:

Posted

I got that error when I accidentally deleted one of the "," between the spider names. You might double-check that.

Posted

I doubled checked the "," all are there and even found a few names with spaces between the end of the name and the" but that didn't help either..............

Posted

When I updated html_output using Jens list of spiders my cart started looping at checkout and you could proceed no further. :cry:

 

It's a shame (for me!) because thats a long list!!!!

Posted

I think you are all going rather over the top with these spiders lists. You only need to worry about 20 spiders absolute tops.

 

Most of the spiders I'm seeing mentioned (across multiple threads, not just this one!) will never ever visit your site...and if they do, so what ? You have a SID showing in a search engine that no-one will ever use...

 

Concentrate on the big boys only. The little guys will just fade away.

Posted

Just checked and google now listing my site sans sids!! :D

 

Just wanted to take time to say thanks to Gary. Thank you.

 

Also they're good comments about the search engine list.

Posted

What does this spider list do?

 

Is there a problem with search engines showing a session id when the spider hits your site? What is the fix for this?

 

What god does it do to list all the spiders?

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...