Jump to content
  • Checkout
  • Login
  • Get in touch

osCommerce

The e-commerce.

New Spiders - lots of them for Burt's SID killer fix


JenRed

Recommended Posts

Reference to this topic "New Spider"

http://www.oscommerce.com/forums/viewtopic.php?t=36577

 

 

This is not for Ian's SID KIller mod, but for the one that changes code in catalog/includes/functions/html_output.php

 

I have been cutting and pasting like mad (RSI!!!) from some of the sites that list robot names and have come up with this big list. There may be some double-ups - I presume this is okay. Anybody who spots an error please post the correction here.

 

 

 

So...

 

 

in catalog/includes/functions/html_output.php

 

after:

// Add the session ID when moving from HTTP and HTTPS servers or when SID is defined 

   if ( (ENABLE_SSL == true ) && ($connection == 'SSL') && ($add_session_id == true) ) { 

     $sid = tep_session_name() . '=' . tep_session_id(); 

   } elseif ( ($add_session_id == true) && (tep_not_null(SID)) ) { 

     $sid = SID; 

   }

 

add this:

// Add more Spiders as you find them.  MAKE SURE THEY ARE LOWER CASE! 

$spiders = array("almaden.ibm.com", "abachobot", "aesop_com_spiderman", "appie 1.1", "ah-ha.com", "acme.spider", "ahoy", "iron33", "ia_archiver", "acoon", "spider.batsch.com", "crawler", "atomz", "antibot", "wget", "roach.smo.av.com-1.0", "altavista-intranet", "asterias2.0", "augurfind", "fluffy", "zyborg", "wire", "wscbot", "yandex", "yellopet-spider", "libwww-perl", "speedfind", "supersnooper", "webwombat", "marvin/infoseek", "whizbang", "nazilla", "uk searcher spider", "esismartspider", "surfnomore ", "kototoi", "scrubby", "baiduspider", "bannana_bot", "bdcindexer", "docomo", "fast-webcrawler", "frooglebot", "geobot", "googlebot", "googlebot/2.1", "henrythemiragorobot", "rabot", "pjspider", "architextspider", "henrythemiragorobot", "gulliver", "deepindex", "dittospyder", "jack", "infoseek", "sidewinder", "lachesis", "moget/1.0", "nationaldirectory-webspider", "picosearch", "naverrobot", "ncsa beta", "moget/2.0", "aranha", "netresearchserver", "ng/1.0", "osis-project", "polybot", "xift", "nationaldirectory", "piranha", "shark", "psbot", "pinpoint", "alkalinebot", "openbot", "pompos", "teomaagent", "zyborg", "gulliver", "architext", "fast-webcrawler", "seventwentyfour", "toutatis", "iltrovatore-setaccio", "sidewinder", "incywincy", "hubater", "slurp/si", "slurp", "partnersite", "diibot", "nttdirectory_robot", "griffon", "geckobot", "kit-fireball", "gencrawler", "ezresult", "mantraagent", "t-rex", "mp3bot", "ip3000", "lnspiderguy", "architectspider", "steeler/1.3", "szukacz", "teoma", "maxbot.com", "bradley", "infobee", "teoma_agent1", "turnitinbot", "vagabondo", "w3c_validator", "zao/0", "zyborg/1.0", "netresearchserver", "slurp", "ask jeeves", "ia_archiver", "scooter", "mercator", "crawler@fast", "crawler", "infoseek sidewinder", "lycos_spider", "fluffy the spider", "ultraseek", "anthill", "walhello appie", "arachnophilia", "arale", "araneo ", "aretha", "arks", "aspider", "atn worldwide", "atomz", "backrub", "big brother", "bjaaland", "blackwidow", "die blinde kuh", "bloodhound", "borg-bot", "bright.net caching robot", "bspider", "cactvs chemistry spider", "calif", "cassandra", "digimarc marcspider/cgi", "checkbot", "christcrawler.com", "churl", "cienciaficcion.net", "cmc/0.01", "collective", "combine system", "conceptbot", "coolbot", "web core / roots", "xyleme robot", "internet cruiser robot", "cusco", "cyberspyder link test", "deweb(c) katalog/index", "dienstspider", "digger", "digital integrity robot", "direct hit grabber", "dnabot", "download express", "dragonbot", "dwcp (dridus' web cataloging project)", "e-collector", "ebiness", "eit link verifier robot", "elfinbot", "emacs-w3 search engine", "ananzi", "esther", "evliya celebi", "nzexplorer", "fastcrawler", "fluid dynamics search engine robot", "felix ide", "wild ferret", "web hopper", "fetchrover", "fido", "hamahakki", "kit-fireball", "fish search", "fouineur", "robot francoroute", "freecrawl", "funnelweb", "gammaspider", "focusedcrawler", "gazz", "gcreep", "getbot", "geturl", "golem", "googlebot", "grapnel/0.01 experiment", "griffon", "gromit", "northern light gulliver", "gulper bot", "hambot", "harvest", "havindex", "html index", "hometown spider pro", "wired digital", "dig", "htmlgobble", "hyper-decontextualizer", "iajabot", "ibm_planetwide", "popular iconoclast", "ingrid", "imagelock", "informant", "infoseek robot 1.0", "infoseek sidewinder", "infospiders", "inspector web", "intelliagent", "i, robot", "israeli-search", "javabee", "jbot java web robot", "jcrawler", "jobo java web robot", "jobot", "joebot", "jumpstation", "image.kapsi.net", "katipo", "kdd-explorer", "kilroy", "ko_yappo_robot", "labelgrabber", "larbin", "legs", "link validator", "linkscan", "linkwalker", "lockon", "logo.gif crawler", "lycos", "mac wwwworm", "magpie", "marvin/infoseek", "mattie", "mediafox", "merzscope", "nec-meshexplorer", "mindcrawler", "mnogosearch", "momspider", "monster", "motor", "muncher", "ferret", "mwd.search", "internet shinchakubin", "netcarta webmap engine", "netmechanic", "netscoop", "newscan-online", "nhse web forager", "nomad", "the northstar robot", "occam", "hku www octopus", "openfind data gatherer", "orb search", "pack rat", "pageboy", "parasite", "patric", "pegasus", "the peregrinator", "perlcrawler 1.0", "phantom", "phpdig", "piltdownman", "pimptrain.com", "pioneer", "html_analyzer", "portal juice spider", "pgp key agent", "plumtreewebaccessor", "poppi", "portalb spider", "psbot", "getterroboplus puu", "the python robot", "raven search", "rbse spider", "resume robot", "roadhouse", "road runner", "robbie", "computingsite robi/1.0", "robocrawl spider", "robofox", "robozilla", "roverbot", "rules", "safetynet robot", "search.aus-au.com", "sleek", "searchprocess", "senrigan", "sg-scout", "shagseeker", "shai'hulud", "sift", "simmany", "site valet", "open text", "sitetech-rover", "skymob.com", "slcrawler", "smart spider", "snooper", "solbot", "speedy spider", "spider_monkey", "spiderbot", "spiderline", "spiderman", "spiderview", "spry wizard robot", "site searcher", "suke", "suntek", "sven", "tach black widow", "tarantula", "tarspider", "tcl w3", "techbot", "templeton", "teomatechnologies", "titin", "titan", "tkwww", "tlspider", "ucsd", "udmsearch", "url check", "url spider pro", "valkyrie", "verticrawl", "victoria", "vision-search", "voyager", "vwbot", "the nwi robot", "w3m2", "wallpaper", "the world wide web wanderer", "w@pspider", "webbandit", "webcatcher", "webcopy", "webfoot", "weblayers", "weblinker", "webmirror", "moose", "webquest", "digimarc marcspider", "webreaper", "webs", "websnarf", "webspider", "webvac", "webwalk", "webwalker", "webwatch", "wget", "whatuseek winona", "whowhere", "weblog monitor", "w3mir", "webstolperer", "web wombat", "the world wide web worm", "wwwc", "webzinger", "xget", "nederland.zoek", "mantraagent", "moget", "t-h-u-n-d-e-r-s-t-o-n-e", "muscatferret", "voilabot", "sleek spider", "kit_fireball", "semanticdiscovery/0.1", "inktomisearch.com ", "webcrawler");



// get useragent and force to lowercase just once 

$useragent = strtolower(getenv("HTTP_USER_AGENT")); 



foreach($spiders as $Val) { 

   if (!(strpos($Val, $useragent) === false)) { 

     // found a spider, kill the sid/sess 

     // Edit out one of these as necessary depending upon your version of html_output.php 

     // $sess = NULL;

      $sid = NULL;

     break; 

   } 

} 

// End spider stopper code

 

 

Thanks to all those whose posts I have been reading to understand this!

 

Jen

 

http://www.redinstead.com.au/catalog

(not finished yet but please test for me!) :)

I haven't lost my mind - I have it backed up on disk somewhere.

Link to comment
Share on other sites

  • Replies 155
  • Created
  • Last Reply

You know, just thinking about this, since output.html is called often during a visitor's browsing of the site, wouldn't it be prudent to store these values in a table, and then just execute a SQL statement that return a 1 if it finds the user agent in the table? This would also be easy to build an admin interface to add new spiders, edit old ones, or display all.

-------------------------------------------------------------------------------------------------------------------------

NOTE: As of Oct 2006, I'm not as active in this forum as I used to be, but I still work with osC quite a bit.

If you have a question about any of my posts here, your best bet is to contact me though either Email or PM in my profile, and I'll be happy to help.

Link to comment
Share on other sites

Wouldn't be easier to see if the session is an actual customer session than to cross reference hundreds of spiders?

 

If the session isn't a valid customer, don't show the SID in the URL. Bang, it handles every spider known.

Link to comment
Share on other sites

Probably a stupid question, Wayne, but how would you do that? SIDs are assigned before a customer is valid.

-------------------------------------------------------------------------------------------------------------------------

NOTE: As of Oct 2006, I'm not as active in this forum as I used to be, but I still work with osC quite a bit.

If you have a question about any of my posts here, your best bet is to contact me though either Email or PM in my profile, and I'll be happy to help.

Link to comment
Share on other sites

There are a few bugs in Ian's that seem to show up when moving from SSL to nonSSL mpages.

 

Inparticular, is you use a shared SSL, you can not use Ian's code at the moment.

 

There are no bugs with this one, it's just a little inefficient.

-------------------------------------------------------------------------------------------------------------------------

NOTE: As of Oct 2006, I'm not as active in this forum as I used to be, but I still work with osC quite a bit.

If you have a question about any of my posts here, your best bet is to contact me though either Email or PM in my profile, and I'll be happy to help.

Link to comment
Share on other sites

Just looking at Jen's additions...

 

are the entries with spaces in the names going to work?

... if you want to REALLY see something that doesn't set up right out of the box without some tweaking,

try being a Foster Parent!

Link to comment
Share on other sites

I removed the entries in Jen's additions that are covered by the $spider_agent[] "hard tests" (i.e. those containing spider, bot and crawler in the name). If you are using that option, you can use the following to cut down the size of the array quite a bit. I don't know how much difference that makes, but maybe it helps :).

 

// Add more Spiders as you find them.  MAKE SURE THEY ARE LOWER CASE! 

$spiders = array("almaden.ibm.com", "appie 1.1", "ah-ha.com", "ahoy", "iron33", "ia_archiver", "acoon", "atomz", "wget", "roach.smo.av.com-1.0", "altavista-intranet", "asterias2.0", "augurfind", "fluffy", "zyborg", "wire", "yandex", "libwww-perl", "speedfind", "supersnooper", "webwombat", "marvin/infoseek", "whizbang", "nazilla", "surfnomore ", "kototoi", "scrubby", "bdcindexer", "docomo", "gulliver", "deepindex", "dittospyder", "jack", "infoseek", "sidewinder", "lachesis", "moget/1.0", "picosearch", "ncsa beta", "moget/2.0", "aranha", "netresearchserver", "ng/1.0", "osis-project", "xift", "nationaldirectory", "piranha", "shark", "pinpoint", "pompos", "teomaagent", "zyborg", "gulliver", "architext", "seventwentyfour", "toutatis", "iltrovatore-setaccio", "sidewinder", "incywincy", "hubater", "slurp/si", "slurp", "partnersite", "griffon", "kit-fireball", "ezresult", "mantraagent", "t-rex", "ip3000", "steeler/1.3", "szukacz", "teoma", "bradley", "infobee", "teoma_agent1", "vagabondo", "w3c_validator", "zao/0", "zyborg/1.0", "netresearchserver", "slurp", "ask jeeves", "ia_archiver", "scooter", "mercator", "infoseek sidewinder", "ultraseek", "anthill", "walhello appie", "arachnophilia", "arale", "araneo ", "aretha", "arks", "atn worldwide", "atomz", "backrub", "big brother", "bjaaland", "blackwidow", "die blinde kuh", "bloodhound", "calif", "cassandra", "churl", "cienciaficcion.net", "cmc/0.01", "collective", "combine system", "web core / roots", "cusco", "cyberspyder link test", "deweb(c) katalog/index", "digger", "direct hit grabber", "download express", "dwcp (dridus' web cataloging project)", "e-collector", "ebiness", "emacs-w3 search engine", "ananzi", "esther", "evliya celebi", "nzexplorer", "felix ide", "wild ferret", "web hopper", "fetchrover", "fido", "hamahakki", "kit-fireball", "fish search", "fouineur", "freecrawl", "funnelweb", "gazz", "gcreep", "geturl", "golem", "grapnel/0.01 experiment", "griffon", "gromit", "northern light gulliver", "harvest", "havindex", "html index", "wired digital", "dig", "htmlgobble", "hyper-decontextualizer", "ibm_planetwide", "popular iconoclast", "ingrid", "imagelock", "informant", "infoseek sidewinder", "inspector web", "intelliagent", "israeli-search", "javabee", "jumpstation", "image.kapsi.net", "katipo", "kdd-explorer", "kilroy", "labelgrabber", "larbin", "legs", "link validator", "linkscan", "linkwalker", "lockon", "lycos", "mac wwwworm", "magpie", "marvin/infoseek", "mattie", "mediafox", "merzscope", "nec-meshexplorer", "mnogosearch", "monster", "motor", "muncher", "ferret", "mwd.search", "internet shinchakubin", "netcarta webmap engine", "netmechanic", "netscoop", "newscan-online", "nhse web forager", "nomad", "occam", "hku www octopus", "openfind data gatherer", "orb search", "pack rat", "pageboy", "parasite", "patric", "pegasus", "the peregrinator", "phantom", "phpdig", "piltdownman", "pimptrain.com", "pioneer", "html_analyzer", "pgp key agent", "plumtreewebaccessor", "poppi", "getterroboplus puu", "raven search", "roadhouse", "road runner", "robbie", "computingsite robi/1.0", "robofox", "robozilla", "rules", "search.aus-au.com", "sleek", "searchprocess", "senrigan", "sg-scout", "shagseeker", "shai'hulud", "sift", "simmany", "site valet", "open text", "sitetech-rover", "skymob.com", "snooper", "site searcher", "suke", "suntek", "sven", "tach black widow", "tarantula", "tcl w3", "templeton", "teomatechnologies", "titin", "titan", "tkwww", "ucsd", "udmsearch", "url check", "valkyrie", "verticrawl", "victoria", "vision-search", "voyager", "w3m2", "wallpaper", "the world wide web wanderer", "webbandit", "webcatcher", "webcopy", "webfoot", "weblayers", "weblinker", "webmirror", "moose", "webquest", "webreaper", "webs", "websnarf", "webvac", "webwalk", "webwalker", "webwatch", "wget", "whatuseek winona", "whowhere", "weblog monitor", "w3mir", "webstolperer", "web wombat", "the world wide web worm", "wwwc", "webzinger", "xget", "nederland.zoek", "mantraagent", "moget", "t-h-u-n-d-e-r-s-t-o-n-e", "muscatferret", "kit_fireball", "semanticdiscovery/0.1", "inktomisearch.com"); 



// get useragent and force to lowercase just once 

$useragent = strtolower(getenv("HTTP_USER_AGENT")); 



foreach($spiders as $Val) { 

   if (!(strpos($Val, $useragent) === false)) { 

     // found a spider, kill the sid/sess 

     // Edit out one of these as necessary depending upon your version of html_output.php 

     // $sess = NULL; 

      $sid = NULL; 

     break; 

   } 

} 

// End spider stopper code

 

Thanks Jen for compiling that incredible list!

 

David

Link to comment
Share on other sites

I have tried to install this in my html_output.php but keep getting parse error.

 

Parse error: parse error, expecting `')'' in /home/mrsfield/public_html/includes/functions/html_output.php on line 284

 

I have tried several different versions from here and contributions and get the same error.

 

What am I doing wrong??

Link to comment
Share on other sites

if you copy and pasted the code from the thread you may have picked up stray characters which need to be deleted from the end of each line :wink:

Link to comment
Share on other sites

I tried deleting any stray characters and it didn't seem to work, same error. I replaced my html_output with the code from this thread

http://www.oscommerce.com/forums/viewtopic.php...ight=sid+killer

 

It seems to work, but I wanted to add the other spiders, but keep getting parse error.........

 

I think it's probably something real simple that I'm missing, but I can't find it... :oops:

Link to comment
Share on other sites

When I updated html_output using Jens list of spiders my cart started looping at checkout and you could proceed no further. :cry:

 

It's a shame (for me!) because thats a long list!!!!

Link to comment
Share on other sites

I think you are all going rather over the top with these spiders lists. You only need to worry about 20 spiders absolute tops.

 

Most of the spiders I'm seeing mentioned (across multiple threads, not just this one!) will never ever visit your site...and if they do, so what ? You have a SID showing in a search engine that no-one will ever use...

 

Concentrate on the big boys only. The little guys will just fade away.

Link to comment
Share on other sites

Just checked and google now listing my site sans sids!! :D

 

Just wanted to take time to say thanks to Gary. Thank you.

 

Also they're good comments about the search engine list.

Link to comment
Share on other sites

What does this spider list do?

 

Is there a problem with search engines showing a session id when the spider hits your site? What is the fix for this?

 

What god does it do to list all the spiders?

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...