masat Posted April 10, 2004 Posted April 10, 2004 How do you determine what to put in the spider list to kill the sid when the site is indexed. Here is what my visited list shows: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp). March 15, 2004, 4:23 am http://coppercolander.com/advanced_search....6a8bb57ec4a5d69 I am guessing it is Yahoo! Slurp but could someone verify this for me. Also I have noticed previously many of the SE robots seem to get hung up at the advanced_search.php link. i.e. in the above report that spider managed to get about 200 entries into my visit report and never went any further. I have been working on a contrib that gives the above report and it seems to have some draw backs. I've been trying to get the install put together so it is some what smooth. I'll try to get it posted soon 'cause I would really like some help with it. Tim *edit* - URL How do you know when you know what you want to do for the rest of your life?
peterr Posted April 10, 2004 Posted April 10, 2004 Hi, This thread will help you, I'm sure: http://www.oscommerce.com/forums/index.php?showtopic=79300 Peter
masat Posted April 10, 2004 Author Posted April 10, 2004 Thanks Peter for your reply but I don't see where that thread answered the question unless your suggesting I place a disallow in the robots.txt file which I currently do not use anyway. I will implement the robots file later. I currently have these bots listed in the /includes/spiders.txt file: sleek spider slurp/si [email protected] slurp It is appearing to me that since yahoo acquired inktomi a while back they are useing the slurp bot to index sites now and the bot (slurp) as I have listed in the /includes/spiders.txt file is not keeping this bot from generating sid's. ...But I don't know what to place in the /includes/spiders.txt file to prevent sid generation for this particular flavor of slurp. I am kind of curious about how yahoo is using these indexing visits also. Tim How do you know when you know what you want to do for the rest of your life?
peterr Posted April 10, 2004 Posted April 10, 2004 Hi, Do you have the string 'yahoo' defined in spiders.txt ? When there was discussion about msnbot (starting carwling and no-one knew, .. too late it got the Sid's), I added: msnbot and then noticed from my web serve rlogs, the spider was called 'msnbot/0.11', so (panic,panic), I added: msnbot/0.11 to spiders.txt, all to find out later, after looking at the osC code, that any wildcard is okay, so, in regards to msnbot, only: msnbot is needed. Since the first string that appeared in your logs was 'yahoo' (case is not important), but don't have uppercase in your spiders.txt ( I _think_ ), I'd place the string 'yahoo' in your spiders.txt, it's certainly not going to do any harm. Trouble is, you will have to wait till the next crawl, to get rid of the Sid's. Peter
peterr Posted April 10, 2004 Posted April 10, 2004 Hi, Silly question I know, but you do have the following set: Admin | Config | Sessions | Prevent Spider Sessions | True Peter
peterr Posted April 10, 2004 Posted April 10, 2004 Hi, When I was concerned about spiders and how osC would handle them, I made up this small PHP script, and forced the user_agent to whatever I need to test for. <?php // include the domain checking functions (osCommerce code) require('includes/application_top.php'); echo $session_started; echo '<br>'; echo $user_agent; echo '<br>'; echo $spider_flag; echo '<br>'; echo $spiders[$i]; echo '<br>'; echo $SID; echo '<br>'; // start the session -- osCommerce code $session_started = false; if (SESSION_FORCE_COOKIE_USE == 'True') { tep_setcookie('cookie_test', 'please_accept_for_session', time()+60*60*24*30, $cookie_path, $cookie_domain); if (isset($HTTP_COOKIE_VARS['cookie_test'])) { tep_session_start(); $session_started = true; } } elseif (SESSION_BLOCK_SPIDERS == 'True') { //$user_agent = strtolower(getenv('HTTP_USER_AGENT')); //$user_agent = strtolower("Googlebot/2.1 (+http://www.googlebot.com/bot.html)"); $user_agent = strtolower("msnbot/0.11 (+http://search.msn.com/msnbot.htm)"); $spider_flag = false; if (tep_not_null($user_agent)) { $spiders = file(DIR_WS_INCLUDES . 'spiders.txt'); for ($i=0, $n=sizeof($spiders); $i<$n; $i++) { if (tep_not_null($spiders[$i])) { if (is_integer(strpos($user_agent, trim($spiders[$i])))) { $spider_flag = true; break; } } } } if ($spider_flag == false) { tep_session_start(); $session_started = true; } } else { tep_session_start(); $session_started = true; } // set SID once, even if empty $SID = (defined('SID') ? SID : ''); echo $session_started; echo '<br>'; echo $user_agent; echo '<br>'; echo $spider_flag; echo '<br>'; echo $spiders[$i]; echo '<br>'; echo $SID; echo '<br>'; ?> For your testing, force the user_agent to be: $user_agent = strtolower("Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"); place the script in your 'catalog' path. Peter
masat Posted April 10, 2004 Author Posted April 10, 2004 Hey Peter, You are right it couldn't hurt. I placed yahoo in the spiders.txt and uploaded it to the site. I placed your script as bot_test.php on my test site, ran it and got the following output: 1 osCsid=40d0e25000d013a326857e08fc2fedb0 1 osCsid=40d0e25000d013a326857e08fc2fedb0 What does this tell us? Tim How do you know when you know what you want to do for the rest of your life?
masat Posted April 10, 2004 Author Posted April 10, 2004 Sorry I missed one reply there... Sessions was set false so I set it true and the bot_test.php script returned: 1 mozilla/4.0 (compatible; msie 6.0; windows nt 5.0; crazy browser 1.0.5; alexa toolbar; .net clr 1.1.4322) osCsid=bf423d569bee3b536898bdd463436e2b mozilla/5.0 (compatible; yahoo! slurp; http://help.yahoo.com/help/us/ysearch/slurp) 1 slurp osCsid=bf423d569bee3b536898bdd463436e2b Smae question... what does this tell us? It looks to me like the bot_test is generating session id's What say you my friend? Tim How do you know when you know what you want to do for the rest of your life?
peterr Posted April 10, 2004 Posted April 10, 2004 Hi, Sessions was set false so I set it true and the bot_test.php script returned: 1 mozilla/4.0 (compatible; msie 6.0; windows nt 5.0; crazy browser 1.0.5; alexa toolbar; .net clr 1.1.4322) osCsid=bf423d569bee3b536898bdd463436e2b mozilla/5.0 (compatible; yahoo! slurp; http://help.yahoo.com/help/us/ysearch/slurp) 1 slurp osCsid=bf423d569bee3b536898bdd463436e2b It looks to me like the bot_test is generating session id's Okay, there are 5 echo's before and 5 after the 'hard coded' user agent. 1. The first display is: Session started Your browser as the user agent No spider found No spider to display SID is displayed 2. The second display is: Session has NOT started The hard coded user agent Spider has been found The spider name is slurp SID is displayed The PHP strpos() function will find the position of first occurrence of a string, and the string is obtained from the file 'spiders.txt' It found 'slurp', simply because 's' comes before 'y' (yahoo). So, you have slurp defined, and it has been recognised. The only thing you wouldn't want is the SID of course, but that is only because the preceeding code has picked it up from your bowser user agent, or an already existing session that you have going (check your cookies). I could find a 'tep_session_stop()', but did find a ' tep_session_close()' function, and a 'tep_session_destroy()' function, so try adding one of those just after the first lot of 'echos', and see if that gets rid of the SID for the hardcoded user agent. Peter
peterr Posted April 10, 2004 Posted April 10, 2004 Hi, Oops, 5 minutes and the EDIT button disappears again. I could find a 'tep_session_stop()', but did find a ' tep_session_close()' function, and a 'tep_session_destroy()' function, so ........ should be ......... I could not find a 'tep_session_stop()', but did find a ' tep_session_close()' function, and a 'tep_session_destroy()' function, so ............ Peter
masat Posted April 10, 2004 Author Posted April 10, 2004 Well this is a bit beyond my comfort zone as it may be bvut here is what I did... I found in application_top.php this snippet: // verify the browser user agent if the feature is enabled if (SESSION_CHECK_USER_AGENT == 'True') { $http_user_agent = getenv('HTTP_USER_AGENT'); if (!tep_session_is_registered('SESSION_USER_AGENT')) { $SESSION_USER_AGENT = $http_user_agent; tep_session_register('SESSION_USER_AGENT'); } if ($SESSION_USER_AGENT != $http_user_agent) { tep_session_destroy(); tep_redirect(tep_href_link(FILENAME_LOGIN)); } } I removed the redirect line and posted it just before the ?> in the bot_test.php script. I saw no change in the output other than the sid changeing. I removed all three referals to "slurp" in the spiders.txt file and no difference. I'm not sure why but there are no cookies listed in my cookies folder. It maybe because I'm running a localhost??? Are cookies used when sessions are enabled? I was kind of understanding that cookies are only used when sessions were not able to be used for some reason. Should the snippet I placed effectively kill the session? Tim How do you know when you know what you want to do for the rest of your life?
♥yesudo Posted April 10, 2004 Posted April 10, 2004 TIA This is the spider list I have(anyone know where I can get an up to date one ?: acoon ah-ha.com ahoy almaden.ibm.com altavista-intranet ananzi anthill appie 1.1 arachnophilia arale araneo aranha architext aretha arks ask jeeves asterias2.0 atn worldwide atomz augurfind backrub baiduspider bannana_bot bdcindexer big brother bjaaland blackwidow bloodhound bradley calif cassandra churl cienciaficcion.net cmc/0.01 collective combine system computingsite robi/1.0 crawler crawler@fast cusco cyberspyder link test deepindex deweb? katalog/index die blinde kuh dig digger direct hit grabber dittospyder docomo download express dwcp (dridus' web cataloging project) ebiness e-collector emacs-w3 search engine esther evliya celebi ezresult fast-webcrawler felix ide ferret fetchrover fido fish search fluffy fluffy the spider fouineur freecrawl frooglebot funnelweb gazz gcreep geobot getterroboplus puu geturl golem googlebot grapnel/0.01 experiment griffon gromit gulliver hamahakki harvest havindex henrythemiragorobot hku www octopus html index html_analyzer htmlgobble hubater hyper-decontextualizer ia_archiver ibm_planetwide iltrovatore-setaccio image.kapsi.net imagelock incywincy infobee informant infoseek infoseek sidewinder ingrid inktomisearch.com inspector web intelliagent internet shinchakubin ip3000 iron33 israeli-search jack javabee jumpstation katipo kdd-explorer kilroy kit_fireball kit-fireball kototoi labelgrabber lachesis larbin legs libwww-perl link validator linkscan linkwalker lockon lycos lycos_spider mac wwwworm magpie mantraagent marvin/infoseek mattie mediafox mercator merzscope mnogosearch moget moget/1.0 moget/2.0 monster moose motor muncher muscatferret mwd.search nationaldirectory nationaldirectory-webspider naverrobot nazilla ncsa beta nec-meshexplorer nederland.zoek netcarta webmap engine netmechanic netresearchserver netscoop newscan-online ng/1.0 nhse web forager nomad northern light gulliver nzexplorer occam open text openfind data gatherer orb search osis-project pack rat pageboy parasite partnersite patric pegasus pgp key agent phantom phpdig picosearch piltdownman pimptrain.com pinpoint pioneer piranha plumtreewebaccessor polybot pompos poppi popular iconoclast raven search roach.smo.av.com-1.0 road runner roadhouse robbie robofox robozilla rules scooter scrubby search.aus-au.com searchprocess semanticdiscovery/0.1 senrigan seventwentyfour sg-scout shagseeker shai'hulud shark sidewinder sift simmany site searcher site valet sitetech-rover skymob.com sleek sleek spider slurp slurp/si Slurp [email protected] snooper speedfind steeler/1.3 suke suntek supersnooper surfnomore sven szukacz tach black widow tarantula tcl w3 templeton teoma teoma_agent1 teomaagent teomatechnologies the peregrinator the world wide web wanderer the world wide web worm t-h-u-n-d-e-r-s-t-o-n-e titan titin tkwww toutatis t-rex turnitinbot ucsd udmsearch ultraseek url check vagabondo valkyrie verticrawl victoria vision-search voilabot voyager w3c_validator w3m2 w3mir walhello appie wallpaper web core / roots web hopper web wombat webbandit webcatcher webcopy webfoot weblayers weblinker weblog monitor webmirror webquest webreaper webs websnarf webstolperer webvac webwalk webwalker webwatch webwombat webzinger wget whatuseek winona whizbang whowhere wild ferret wire wired digital wwwc xget xift yandex Yahoo Yahoo! Slurp YahooSeeker/1.1 zao/0 zyborg zyborg/1.0 Your online success is Paramount.
mrsym2 Posted April 10, 2004 Posted April 10, 2004 I am using a spider simulator to check my site. If I set Prevent Spider Sessions = True and Force Cookie Use = False . I get: 10. http://aquatin.com/index.php/cPath/21?osCs...8726808451c18ee If I set Prevent Spider Sessions = True and Force Cookie Use = True . I get: 10. http://aquatin.com/index.php/cPath/21 What advantages/disadvantages are there in these options? I'm afraid I don't fully understand the role all these options play in SEO.
peterr Posted April 10, 2004 Posted April 10, 2004 Hi 'yesudo', As you will see from this post: http://www.oscommerce.com/forums/index.php?sho...60entry342554 it appears you should have the user agents in lowercase. You have the yahoo entries as mixed case, and as the user agent is converted to all lowercase chars before the search for the spider name (in the osC code), the spider name of 'Yahoo' woudn't be found. However, as the user agent usually has the url , which would contain the string 'yahoo' somewhere, it would find a match. :D Also, you don't have 'msnbot' ? Peter PHP refs http://au2.php.net/manual/en/function.strtolower.php http://au2.php.net/manual/en/function.strpos.php
♥yesudo Posted April 10, 2004 Posted April 10, 2004 Hi Peter, Thanx for the spot on msnbot - think i must have deleted it accidently at some stage - BIG mistake. Apologies I got a bit confused with: However, as the user agent usually has the url , which would contain the string 'yahoo' somewhere, it would find a match. Would you mind explaining differently - Many thanx. Your online success is Paramount.
peterr Posted April 12, 2004 Posted April 12, 2004 Hi, I think a lot of people got caught out with 'msnbot'; I sure did, they crawled the site before I had it in 'spiders.txt', so now the SID's _may_ turn up in their search engines (if that's what the M$ msnbot is all about). Apologies I got a bit confused with: However, as the user agent usually has the url , which would contain the string 'yahoo' somewhere, it would find a match. Would you mind explaining differently Okay, I'm no PHP guru, so just my understanding of how this code works: /catalog/includes/application_top.php - lines 167 to 203 // start the session $session_started = false; if (SESSION_FORCE_COOKIE_USE == 'True') { tep_setcookie('cookie_test', 'please_accept_for_session', time()+60*60*24*30, $cookie_path, $cookie_domain); if (isset($HTTP_COOKIE_VARS['cookie_test'])) { tep_session_start(); $session_started = true; } } elseif (SESSION_BLOCK_SPIDERS == 'True') { $user_agent = strtolower(getenv('HTTP_USER_AGENT')); $spider_flag = false; if (tep_not_null($user_agent)) { $spiders = file(DIR_WS_INCLUDES . 'spiders.txt'); for ($i=0, $n=sizeof($spiders); $i<$n; $i++) { if (tep_not_null($spiders[$i])) { if (is_integer(strpos($user_agent, trim($spiders[$i])))) { $spider_flag = true; break; } } } } if ($spider_flag == false) { tep_session_start(); $session_started = true; } } else { tep_session_start(); $session_started = true; } // set SID once, even if empty $SID = (defined('SID') ? SID : ''); If someones spiders.txt has the following: .... Yahoo Yahoo! Slurp YahooSeeker/1.1 ..... then (my understand of) the osC PHP code, converts the user agent names to lowercase, and we have: mozilla/5.0 (compatible; yahoo! slurp; http://help.yahoo.com/help/us/ysearch/slurp) as the agent name. Now, the PHP function strpos() won't find a match on "Yahoo! Slurp", because: 'Yahoo! Slurp' <> 'yahoo! slurp' .. and it won't find a match on these agents either, ...... Yahoo YahooSeeker/1.1 but if you added 'yahoo' (lowercase) in spiders.txt, then it should find a match on all three user agents, in fact, my understanding of these 2 PHP functions is that you should only need to place 'yahoo' in your spiders.txt This is because, to me, a strpos() would return true if the string it was trying to find was 'yahoo', and the user agent was any of the following: Yahoo Yahoo! Slurp YahooSeeker/1.1 (because remember osC first converts the user agent to all lowercase first, before the strpos() function. ) You could try this code: //$user_agent = strtolower(getenv('HTTP_USER_AGENT')); $user_agent = strtolower('Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)'); $spider_flag = false; if (tep_not_null($user_agent)) { $spiders = file(DIR_WS_INCLUDES . 'spiders.txt'); for ($i=0, $n=sizeof($spiders); $i<$n; $i++) { if (tep_not_null($spiders[$i])) { if (is_integer(strpos($user_agent, trim($spiders[$i])))) { $spider_flag = true; break; } } } } echo $spider_flag; echo $spiders[$i]; with using your _current_ 'spiders.txt', and then add the string 'yahoo' to the file 'spiders.txt', and run the small script again. I'm 99% certain that the first time around, the var $spider_flag will be false, the second time around , it will return true. This is simply because the PHP strpos() function appears to be case sensitive, I couldn't see any references to it being otherwise, so if the agent user name gets converted to all lowercase, the only way you will get a match (true) is for the spider names to be lowercase. My experience is application programming, definitely NOT web programming, I'm still an infant at this stuff, but the above is simply how I see the code working. Someone please correct me if I have misunderstood the use of these 2 PHP functions. Hope that helps, :) Peter
♥yesudo Posted April 12, 2004 Posted April 12, 2004 Excellent - Thanx for that Peter. Your online success is Paramount.
Recommended Posts
Archived
This topic is now archived and is closed to further replies.