JenRed Posted April 1, 2003 Share Posted April 1, 2003 Reference to this topic "New Spider" http://www.oscommerce.com/forums/viewtopic.php?t=36577 This is not for Ian's SID KIller mod, but for the one that changes code in catalog/includes/functions/html_output.php I have been cutting and pasting like mad (RSI!!!) from some of the sites that list robot names and have come up with this big list. There may be some double-ups - I presume this is okay. Anybody who spots an error please post the correction here. So... in catalog/includes/functions/html_output.php after: // Add the session ID when moving from HTTP and HTTPS servers or when SID is defined if ( (ENABLE_SSL == true ) && ($connection == 'SSL') && ($add_session_id == true) ) { $sid = tep_session_name() . '=' . tep_session_id(); } elseif ( ($add_session_id == true) && (tep_not_null(SID)) ) { $sid = SID; } add this: // Add more Spiders as you find them. MAKE SURE THEY ARE LOWER CASE! $spiders = array("almaden.ibm.com", "abachobot", "aesop_com_spiderman", "appie 1.1", "ah-ha.com", "acme.spider", "ahoy", "iron33", "ia_archiver", "acoon", "spider.batsch.com", "crawler", "atomz", "antibot", "wget", "roach.smo.av.com-1.0", "altavista-intranet", "asterias2.0", "augurfind", "fluffy", "zyborg", "wire", "wscbot", "yandex", "yellopet-spider", "libwww-perl", "speedfind", "supersnooper", "webwombat", "marvin/infoseek", "whizbang", "nazilla", "uk searcher spider", "esismartspider", "surfnomore ", "kototoi", "scrubby", "baiduspider", "bannana_bot", "bdcindexer", "docomo", "fast-webcrawler", "frooglebot", "geobot", "googlebot", "googlebot/2.1", "henrythemiragorobot", "rabot", "pjspider", "architextspider", "henrythemiragorobot", "gulliver", "deepindex", "dittospyder", "jack", "infoseek", "sidewinder", "lachesis", "moget/1.0", "nationaldirectory-webspider", "picosearch", "naverrobot", "ncsa beta", "moget/2.0", "aranha", "netresearchserver", "ng/1.0", "osis-project", "polybot", "xift", "nationaldirectory", "piranha", "shark", "psbot", "pinpoint", "alkalinebot", "openbot", "pompos", "teomaagent", "zyborg", "gulliver", "architext", "fast-webcrawler", "seventwentyfour", "toutatis", "iltrovatore-setaccio", "sidewinder", "incywincy", "hubater", "slurp/si", "slurp", "partnersite", "diibot", "nttdirectory_robot", "griffon", "geckobot", "kit-fireball", "gencrawler", "ezresult", "mantraagent", "t-rex", "mp3bot", "ip3000", "lnspiderguy", "architectspider", "steeler/1.3", "szukacz", "teoma", "maxbot.com", "bradley", "infobee", "teoma_agent1", "turnitinbot", "vagabondo", "w3c_validator", "zao/0", "zyborg/1.0", "netresearchserver", "slurp", "ask jeeves", "ia_archiver", "scooter", "mercator", "crawler@fast", "crawler", "infoseek sidewinder", "lycos_spider", "fluffy the spider", "ultraseek", "anthill", "walhello appie", "arachnophilia", "arale", "araneo ", "aretha", "arks", "aspider", "atn worldwide", "atomz", "backrub", "big brother", "bjaaland", "blackwidow", "die blinde kuh", "bloodhound", "borg-bot", "bright.net caching robot", "bspider", "cactvs chemistry spider", "calif", "cassandra", "digimarc marcspider/cgi", "checkbot", "christcrawler.com", "churl", "cienciaficcion.net", "cmc/0.01", "collective", "combine system", "conceptbot", "coolbot", "web core / roots", "xyleme robot", "internet cruiser robot", "cusco", "cyberspyder link test", "deweb(c) katalog/index", "dienstspider", "digger", "digital integrity robot", "direct hit grabber", "dnabot", "download express", "dragonbot", "dwcp (dridus' web cataloging project)", "e-collector", "ebiness", "eit link verifier robot", "elfinbot", "emacs-w3 search engine", "ananzi", "esther", "evliya celebi", "nzexplorer", "fastcrawler", "fluid dynamics search engine robot", "felix ide", "wild ferret", "web hopper", "fetchrover", "fido", "hamahakki", "kit-fireball", "fish search", "fouineur", "robot francoroute", "freecrawl", "funnelweb", "gammaspider", "focusedcrawler", "gazz", "gcreep", "getbot", "geturl", "golem", "googlebot", "grapnel/0.01 experiment", "griffon", "gromit", "northern light gulliver", "gulper bot", "hambot", "harvest", "havindex", "html index", "hometown spider pro", "wired digital", "dig", "htmlgobble", "hyper-decontextualizer", "iajabot", "ibm_planetwide", "popular iconoclast", "ingrid", "imagelock", "informant", "infoseek robot 1.0", "infoseek sidewinder", "infospiders", "inspector web", "intelliagent", "i, robot", "israeli-search", "javabee", "jbot java web robot", "jcrawler", "jobo java web robot", "jobot", "joebot", "jumpstation", "image.kapsi.net", "katipo", "kdd-explorer", "kilroy", "ko_yappo_robot", "labelgrabber", "larbin", "legs", "link validator", "linkscan", "linkwalker", "lockon", "logo.gif crawler", "lycos", "mac wwwworm", "magpie", "marvin/infoseek", "mattie", "mediafox", "merzscope", "nec-meshexplorer", "mindcrawler", "mnogosearch", "momspider", "monster", "motor", "muncher", "ferret", "mwd.search", "internet shinchakubin", "netcarta webmap engine", "netmechanic", "netscoop", "newscan-online", "nhse web forager", "nomad", "the northstar robot", "occam", "hku www octopus", "openfind data gatherer", "orb search", "pack rat", "pageboy", "parasite", "patric", "pegasus", "the peregrinator", "perlcrawler 1.0", "phantom", "phpdig", "piltdownman", "pimptrain.com", "pioneer", "html_analyzer", "portal juice spider", "pgp key agent", "plumtreewebaccessor", "poppi", "portalb spider", "psbot", "getterroboplus puu", "the python robot", "raven search", "rbse spider", "resume robot", "roadhouse", "road runner", "robbie", "computingsite robi/1.0", "robocrawl spider", "robofox", "robozilla", "roverbot", "rules", "safetynet robot", "search.aus-au.com", "sleek", "searchprocess", "senrigan", "sg-scout", "shagseeker", "shai'hulud", "sift", "simmany", "site valet", "open text", "sitetech-rover", "skymob.com", "slcrawler", "smart spider", "snooper", "solbot", "speedy spider", "spider_monkey", "spiderbot", "spiderline", "spiderman", "spiderview", "spry wizard robot", "site searcher", "suke", "suntek", "sven", "tach black widow", "tarantula", "tarspider", "tcl w3", "techbot", "templeton", "teomatechnologies", "titin", "titan", "tkwww", "tlspider", "ucsd", "udmsearch", "url check", "url spider pro", "valkyrie", "verticrawl", "victoria", "vision-search", "voyager", "vwbot", "the nwi robot", "w3m2", "wallpaper", "the world wide web wanderer", "w@pspider", "webbandit", "webcatcher", "webcopy", "webfoot", "weblayers", "weblinker", "webmirror", "moose", "webquest", "digimarc marcspider", "webreaper", "webs", "websnarf", "webspider", "webvac", "webwalk", "webwalker", "webwatch", "wget", "whatuseek winona", "whowhere", "weblog monitor", "w3mir", "webstolperer", "web wombat", "the world wide web worm", "wwwc", "webzinger", "xget", "nederland.zoek", "mantraagent", "moget", "t-h-u-n-d-e-r-s-t-o-n-e", "muscatferret", "voilabot", "sleek spider", "kit_fireball", "semanticdiscovery/0.1", "inktomisearch.com ", "webcrawler"); // get useragent and force to lowercase just once $useragent = strtolower(getenv("HTTP_USER_AGENT")); foreach($spiders as $Val) { if (!(strpos($Val, $useragent) === false)) { // found a spider, kill the sid/sess // Edit out one of these as necessary depending upon your version of html_output.php // $sess = NULL; $sid = NULL; break; } } // End spider stopper code Thanks to all those whose posts I have been reading to understand this! Jen http://www.redinstead.com.au/catalog (not finished yet but please test for me!) :) I haven't lost my mind - I have it backed up on disk somewhere. Link to comment Share on other sites More sharing options...
burt Posted April 1, 2003 Share Posted April 1, 2003 Good work, thanks Jen. Link to comment Share on other sites More sharing options...
JenRed Posted April 1, 2003 Author Share Posted April 1, 2003 so I did it right then? :) Jen I haven't lost my mind - I have it backed up on disk somewhere. Link to comment Share on other sites More sharing options...
Guest Posted April 2, 2003 Share Posted April 2, 2003 :D Great spider hunt job :D saved me a few hundred keystrokes 8) Link to comment Share on other sites More sharing options...
wizardsandwars Posted April 2, 2003 Share Posted April 2, 2003 You know, just thinking about this, since output.html is called often during a visitor's browsing of the site, wouldn't it be prudent to store these values in a table, and then just execute a SQL statement that return a 1 if it finds the user agent in the table? This would also be easy to build an admin interface to add new spiders, edit old ones, or display all. ------------------------------------------------------------------------------------------------------------------------- NOTE: As of Oct 2006, I'm not as active in this forum as I used to be, but I still work with osC quite a bit. If you have a question about any of my posts here, your best bet is to contact me though either Email or PM in my profile, and I'll be happy to help. Link to comment Share on other sites More sharing options...
Guest Posted April 2, 2003 Share Posted April 2, 2003 that thought has been tossed around by a few people but no one has gotten to it yet on their to-do list :wink: Link to comment Share on other sites More sharing options...
Guest Posted April 2, 2003 Share Posted April 2, 2003 Wouldn't be easier to see if the session is an actual customer session than to cross reference hundreds of spiders? If the session isn't a valid customer, don't show the SID in the URL. Bang, it handles every spider known. Link to comment Share on other sites More sharing options...
wizardsandwars Posted April 2, 2003 Share Posted April 2, 2003 Probably a stupid question, Wayne, but how would you do that? SIDs are assigned before a customer is valid. ------------------------------------------------------------------------------------------------------------------------- NOTE: As of Oct 2006, I'm not as active in this forum as I used to be, but I still work with osC quite a bit. If you have a question about any of my posts here, your best bet is to contact me though either Email or PM in my profile, and I'll be happy to help. Link to comment Share on other sites More sharing options...
Guest Posted April 2, 2003 Share Posted April 2, 2003 thats a good question - i was thinking the same thing :? Link to comment Share on other sites More sharing options...
drakonan Posted April 5, 2003 Share Posted April 5, 2003 How is this different than Ian's attempt? What are the issues with his and yours? I've already applied his, should I switch? - - - - Sometimes, ignorance is bliss. Link to comment Share on other sites More sharing options...
wizardsandwars Posted April 5, 2003 Share Posted April 5, 2003 There are a few bugs in Ian's that seem to show up when moving from SSL to nonSSL mpages. Inparticular, is you use a shared SSL, you can not use Ian's code at the moment. There are no bugs with this one, it's just a little inefficient. ------------------------------------------------------------------------------------------------------------------------- NOTE: As of Oct 2006, I'm not as active in this forum as I used to be, but I still work with osC quite a bit. If you have a question about any of my posts here, your best bet is to contact me though either Email or PM in my profile, and I'll be happy to help. Link to comment Share on other sites More sharing options...
mugitty Posted April 5, 2003 Share Posted April 5, 2003 Just looking at Jen's additions... are the entries with spaces in the names going to work? ... if you want to REALLY see something that doesn't set up right out of the box without some tweaking, try being a Foster Parent! Link to comment Share on other sites More sharing options...
DavidR Posted April 11, 2003 Share Posted April 11, 2003 I removed the entries in Jen's additions that are covered by the $spider_agent[] "hard tests" (i.e. those containing spider, bot and crawler in the name). If you are using that option, you can use the following to cut down the size of the array quite a bit. I don't know how much difference that makes, but maybe it helps :). // Add more Spiders as you find them. MAKE SURE THEY ARE LOWER CASE! $spiders = array("almaden.ibm.com", "appie 1.1", "ah-ha.com", "ahoy", "iron33", "ia_archiver", "acoon", "atomz", "wget", "roach.smo.av.com-1.0", "altavista-intranet", "asterias2.0", "augurfind", "fluffy", "zyborg", "wire", "yandex", "libwww-perl", "speedfind", "supersnooper", "webwombat", "marvin/infoseek", "whizbang", "nazilla", "surfnomore ", "kototoi", "scrubby", "bdcindexer", "docomo", "gulliver", "deepindex", "dittospyder", "jack", "infoseek", "sidewinder", "lachesis", "moget/1.0", "picosearch", "ncsa beta", "moget/2.0", "aranha", "netresearchserver", "ng/1.0", "osis-project", "xift", "nationaldirectory", "piranha", "shark", "pinpoint", "pompos", "teomaagent", "zyborg", "gulliver", "architext", "seventwentyfour", "toutatis", "iltrovatore-setaccio", "sidewinder", "incywincy", "hubater", "slurp/si", "slurp", "partnersite", "griffon", "kit-fireball", "ezresult", "mantraagent", "t-rex", "ip3000", "steeler/1.3", "szukacz", "teoma", "bradley", "infobee", "teoma_agent1", "vagabondo", "w3c_validator", "zao/0", "zyborg/1.0", "netresearchserver", "slurp", "ask jeeves", "ia_archiver", "scooter", "mercator", "infoseek sidewinder", "ultraseek", "anthill", "walhello appie", "arachnophilia", "arale", "araneo ", "aretha", "arks", "atn worldwide", "atomz", "backrub", "big brother", "bjaaland", "blackwidow", "die blinde kuh", "bloodhound", "calif", "cassandra", "churl", "cienciaficcion.net", "cmc/0.01", "collective", "combine system", "web core / roots", "cusco", "cyberspyder link test", "deweb(c) katalog/index", "digger", "direct hit grabber", "download express", "dwcp (dridus' web cataloging project)", "e-collector", "ebiness", "emacs-w3 search engine", "ananzi", "esther", "evliya celebi", "nzexplorer", "felix ide", "wild ferret", "web hopper", "fetchrover", "fido", "hamahakki", "kit-fireball", "fish search", "fouineur", "freecrawl", "funnelweb", "gazz", "gcreep", "geturl", "golem", "grapnel/0.01 experiment", "griffon", "gromit", "northern light gulliver", "harvest", "havindex", "html index", "wired digital", "dig", "htmlgobble", "hyper-decontextualizer", "ibm_planetwide", "popular iconoclast", "ingrid", "imagelock", "informant", "infoseek sidewinder", "inspector web", "intelliagent", "israeli-search", "javabee", "jumpstation", "image.kapsi.net", "katipo", "kdd-explorer", "kilroy", "labelgrabber", "larbin", "legs", "link validator", "linkscan", "linkwalker", "lockon", "lycos", "mac wwwworm", "magpie", "marvin/infoseek", "mattie", "mediafox", "merzscope", "nec-meshexplorer", "mnogosearch", "monster", "motor", "muncher", "ferret", "mwd.search", "internet shinchakubin", "netcarta webmap engine", "netmechanic", "netscoop", "newscan-online", "nhse web forager", "nomad", "occam", "hku www octopus", "openfind data gatherer", "orb search", "pack rat", "pageboy", "parasite", "patric", "pegasus", "the peregrinator", "phantom", "phpdig", "piltdownman", "pimptrain.com", "pioneer", "html_analyzer", "pgp key agent", "plumtreewebaccessor", "poppi", "getterroboplus puu", "raven search", "roadhouse", "road runner", "robbie", "computingsite robi/1.0", "robofox", "robozilla", "rules", "search.aus-au.com", "sleek", "searchprocess", "senrigan", "sg-scout", "shagseeker", "shai'hulud", "sift", "simmany", "site valet", "open text", "sitetech-rover", "skymob.com", "snooper", "site searcher", "suke", "suntek", "sven", "tach black widow", "tarantula", "tcl w3", "templeton", "teomatechnologies", "titin", "titan", "tkwww", "ucsd", "udmsearch", "url check", "valkyrie", "verticrawl", "victoria", "vision-search", "voyager", "w3m2", "wallpaper", "the world wide web wanderer", "webbandit", "webcatcher", "webcopy", "webfoot", "weblayers", "weblinker", "webmirror", "moose", "webquest", "webreaper", "webs", "websnarf", "webvac", "webwalk", "webwalker", "webwatch", "wget", "whatuseek winona", "whowhere", "weblog monitor", "w3mir", "webstolperer", "web wombat", "the world wide web worm", "wwwc", "webzinger", "xget", "nederland.zoek", "mantraagent", "moget", "t-h-u-n-d-e-r-s-t-o-n-e", "muscatferret", "kit_fireball", "semanticdiscovery/0.1", "inktomisearch.com"); // get useragent and force to lowercase just once $useragent = strtolower(getenv("HTTP_USER_AGENT")); foreach($spiders as $Val) { if (!(strpos($Val, $useragent) === false)) { // found a spider, kill the sid/sess // Edit out one of these as necessary depending upon your version of html_output.php // $sess = NULL; $sid = NULL; break; } } // End spider stopper code Thanks Jen for compiling that incredible list! David Link to comment Share on other sites More sharing options...
chfields Posted April 11, 2003 Share Posted April 11, 2003 I have tried to install this in my html_output.php but keep getting parse error. Parse error: parse error, expecting `')'' in /home/mrsfield/public_html/includes/functions/html_output.php on line 284 I have tried several different versions from here and contributions and get the same error. What am I doing wrong?? Link to comment Share on other sites More sharing options...
Guest Posted April 11, 2003 Share Posted April 11, 2003 if you copy and pasted the code from the thread you may have picked up stray characters which need to be deleted from the end of each line :wink: Link to comment Share on other sites More sharing options...
chfields Posted April 11, 2003 Share Posted April 11, 2003 I tried deleting any stray characters and it didn't seem to work, same error. I replaced my html_output with the code from this thread http://www.oscommerce.com/forums/viewtopic.php...ight=sid+killer It seems to work, but I wanted to add the other spiders, but keep getting parse error......... I think it's probably something real simple that I'm missing, but I can't find it... :oops: Link to comment Share on other sites More sharing options...
DavidR Posted April 11, 2003 Share Posted April 11, 2003 I got that error when I accidentally deleted one of the "," between the spider names. You might double-check that. Link to comment Share on other sites More sharing options...
chfields Posted April 11, 2003 Share Posted April 11, 2003 I doubled checked the "," all are there and even found a few names with spaces between the end of the name and the" but that didn't help either.............. Link to comment Share on other sites More sharing options...
mlulm Posted April 12, 2003 Share Posted April 12, 2003 There is a Spider Killer contribution that appears to be an improvement, it puts all the spider agents in a separate file. It checks once per page, instead of every link which should save time. It's working for me. http://www.oscommerce.com/community/contributions,1089 Link to comment Share on other sites More sharing options...
raym Posted April 12, 2003 Share Posted April 12, 2003 When I updated html_output using Jens list of spiders my cart started looping at checkout and you could proceed no further. :cry: It's a shame (for me!) because thats a long list!!!! Link to comment Share on other sites More sharing options...
burt Posted April 12, 2003 Share Posted April 12, 2003 I think you are all going rather over the top with these spiders lists. You only need to worry about 20 spiders absolute tops. Most of the spiders I'm seeing mentioned (across multiple threads, not just this one!) will never ever visit your site...and if they do, so what ? You have a SID showing in a search engine that no-one will ever use... Concentrate on the big boys only. The little guys will just fade away. Link to comment Share on other sites More sharing options...
DavidR Posted April 12, 2003 Share Posted April 12, 2003 Maybe we should have a thread with the current "top 20" listed :) David Link to comment Share on other sites More sharing options...
raym Posted April 14, 2003 Share Posted April 14, 2003 Just checked and google now listing my site sans sids!! :D Just wanted to take time to say thanks to Gary. Thank you. Also they're good comments about the search engine list. Link to comment Share on other sites More sharing options...
burt Posted April 14, 2003 Share Posted April 14, 2003 :) Don't get many thanks - thank you! :) Link to comment Share on other sites More sharing options...
pacman Posted April 16, 2003 Share Posted April 16, 2003 What does this spider list do? Is there a problem with search engines showing a session id when the spider hits your site? What is the fix for this? What god does it do to list all the spiders? Link to comment Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.