stevel Posted December 12, 2006 Author Share Posted December 12, 2006 Ok, good. Notice that the user agent contains "slurp". This means that it will not be assigned a NEW session, but if the spider comes in with a session already in the URL, it will keep it regardless. This problem can come up if you ran for a while without "prevent spider sessions" on (you DO have it on in admin, right?) and Yahoo remembered the URL with sessions. Use this contrib to take care of that. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
boxtel Posted December 12, 2006 Share Posted December 12, 2006 Ok, good. Notice that the user agent contains "slurp". This means that it will not be assigned a NEW session, but if the spider comes in with a session already in the URL, it will keep it regardless. This problem can come up if you ran for a while without "prevent spider sessions" on (you DO have it on in admin, right?) and Yahoo remembered the URL with sessions. Use this contrib to take care of that. steve, eventhough that contrib works flawlessly, it has one crucial limitation: RewriteCond %{HTTP_USER_AGENT} !(msnbot|slurp|googlebot) [NC] this implies that you need to add every spider known to mankind to this line, basically what you have been doing with spiders.txt. so why not re-use spiders.txt for that as realized in chemo's php version of this: if ( $spider_flag == true ){ if ( eregi(tep_session_name(), $_SERVER['REQUEST_URI']) ){ $location = tep_href_link(basename($_SERVER['SCRIPT_NAME']), tep_get_all_get_params(array(tep_session_name())), 'NONSSL', false); header("HTTP/1.0 301 Moved Permanently"); header("Location: $location"); // redirect...bye bye } } to be used in application_top.php after spider identification and in the event of seo url's after the inclusion of that class. Quote Treasurer MFC Link to comment Share on other sites More sharing options...
stevel Posted December 12, 2006 Author Share Posted December 12, 2006 I'm away from my sources so can't look at the code, but my recollection is that the search of spiders.txt (and hence the setting of $spider_flag) is skipped if the spider came in with a session ID already in the URL. Good thing in general as that's an expensive operation, though if there's a cookie set, you could skip it. I was not thinking of listing every spider, just those known to be a problem, but the method you propose is nice if it doesn't impact normal users. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
boxtel Posted December 12, 2006 Share Posted December 12, 2006 I'm away from my sources so can't look at the code, but my recollection is that the search of spiders.txt (and hence the setting of $spider_flag) is skipped if the spider came in with a session ID already in the URL. Good thing in general as that's an expensive operation, though if there's a cookie set, you could skip it. I was not thinking of listing every spider, just those known to be a problem, but the method you propose is nice if it doesn't impact normal users. the enquiry of spiders.txt is only skipped if you force cookies. Quote Treasurer MFC Link to comment Share on other sites More sharing options...
boxtel Posted December 12, 2006 Share Posted December 12, 2006 the enquiry of spiders.txt is only skipped if you force cookies. well, and ofcourse if you do not prevent spider sessions and when the user agent is void. but the former should never have been an option and the latter is obvious. Quote Treasurer MFC Link to comment Share on other sites More sharing options...
NancyL7 Posted December 12, 2006 Share Posted December 12, 2006 Ok, good. Notice that the user agent contains "slurp". This means that it will not be assigned a NEW session, but if the spider comes in with a session already in the URL, it will keep it regardless. This problem can come up if you ran for a while without "prevent spider sessions" on (you DO have it on in admin, right?) and Yahoo remembered the URL with sessions. Use this contrib to take care of that. Actually, I did not have prevent spider sessions set to true (I'm rather red-faced right now). Thank you for pointing this out. I set this up before ever I opened my store, but didn't know much about osC (still don't, but getting better). Thanks for your help! Nancy Quote Link to comment Share on other sites More sharing options...
NancyL7 Posted December 14, 2006 Share Posted December 14, 2006 Thanks Steve, Turning that attribute on prevented the 'carts' and the session Ids.. which allows me to tell which connections are spiders and which are customers. Q: (if I may) Is it a normal for the yahoo slurp spider and the google bots to sit on your site 24x7? If they do, is that a good thing or not? Google is on there about 3/4 of the time, and yahoo has at least 1 connection all the time. Should this concern me? Thanks!! Nancy Quote Link to comment Share on other sites More sharing options...
stevel Posted December 14, 2006 Author Share Posted December 14, 2006 Well, it isn't unusual when they are first indexing your site, especially if you have lots of links and it looks as if the URLs are different. Eliminating sessions can help. Another thing you can do, which I think is mentioned earlier in this thread, is to disable display of the product listing sort links if there is no session. Another is to not display "buy now" links without a session. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
NancyL7 Posted December 15, 2006 Share Posted December 15, 2006 Well, it isn't unusual when they are first indexing your site, especially if you have lots of links and it looks as if the URLs are different. Eliminating sessions can help. Another thing you can do, which I think is mentioned earlier in this thread, is to disable display of the product listing sort links if there is no session. Another is to not display "buy now" links without a session. Thanks! I'll try it! Quote Link to comment Share on other sites More sharing options...
Guest Posted February 5, 2007 Share Posted February 5, 2007 I've got a spider that's not being detected with the current spiders.txt the log line is: 72.14.199.68 - - [05/Feb/2007:02:30:59 +0100] "GET /shop/rss.php HTTP/1.1" 200 1749 www.perfectpassion.co.uk "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)" "-" Thanks, Tom Quote Link to comment Share on other sites More sharing options...
stevel Posted February 5, 2007 Author Share Posted February 5, 2007 Do you see it crawling your entire site? From what I can find out, Feedfetcher is just looking for RSS and Atom feeds. I generally leave out of spiders.txt those that don't pull up store pages. Are there other hits, especially those with session IDs? Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
Guest Posted February 6, 2007 Share Posted February 6, 2007 I think it only fetches the rss feed as far as i've found - no probs I'll just ignore it then! Quote Link to comment Share on other sites More sharing options...
stevel Posted February 6, 2007 Author Share Posted February 6, 2007 Fine. Adding that one to spiders.txt would not accomplish anything anyway. But if you do see spiders getting session IDs. then by all means let me know! I just posted an update to the contrib - the rate of new spiders has fallen off quite a bit - I had not seen a new one for a couple of months. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
AWWWW.WAHWAH Posted February 7, 2007 Share Posted February 7, 2007 38.98.120.75 This constantly crawls my site. Quote Remember what the Bible says: He who is without sin, cast the first rock. And I shall smoketh it. Link to comment Share on other sites More sharing options...
stevel Posted February 7, 2007 Author Share Posted February 7, 2007 That's an IP, not a user agent. What's the user agent string from the access log? That IP is not assigned to a specific domain. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
Guest Posted February 17, 2007 Share Posted February 17, 2007 Does anyone know what this is? It's showing up in my website constantly. At this moment, it's been in my site for over 17 consecutive hours. I lookup the IP address and I see this (copying and pasting): OrgName: Inktomi Corporation OrgID: INKT Address: 701 First Ave City: Sunnyvale StateProv: CA PostalCode: 94089 Country: US NetRange: 74.6.0.0 - 74.6.255.255 CIDR: 74.6.0.0/16 NetName: INKTOMI-BLK-6 NetHandle: NET-74-6-0-0-1 Parent: NET-74-0-0-0-0 NetType: Direct Allocation NameServer: NS1.YAHOO.COM NameServer: NS2.YAHOO.COM NameServer: NS3.YAHOO.COM NameServer: NS4.YAHOO.COM NameServer: NS5.YAHOO.COM ------------------- And it's creating session ids for everything from product pages to the privacy policy. ??? Andrea Quote Link to comment Share on other sites More sharing options...
stevel Posted February 17, 2007 Author Share Posted February 17, 2007 That's Yahoo Slurp. If you have "Prevent Spider Sessions" set to TRUE, you should not be getting sessions. What user agent shows in your access logs for these entries? Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
Guest Posted February 17, 2007 Share Posted February 17, 2007 Prevent Spider Sessions has been set to True forever...after watching a Googlebot put $4000 worth of stuff in the shopping cart a long time ago. :) I see this: Agent: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) Is that what you're after? Andrea Quote Link to comment Share on other sites More sharing options...
stevel Posted February 17, 2007 Author Share Posted February 17, 2007 Yes - are you using my updated spiders.txt? It should have the string "slurp" in it that will catch this. What's the URL of your store? I can check to see if the spider check is working properly. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
Guest Posted February 17, 2007 Share Posted February 17, 2007 My website is soapoperaworld.com I only replaced my spiders.txt file with your updated file around an hour ago. I'm seeing another thing now that I've seen a million times before, too. IP address-lookup is showing this: OrgName: Microsoft Corp OrgID: MSFT Address: One Microsoft Way City: Redmond StateProv: WA PostalCode: 98052 Country: US NetRange: 65.52.0.0 - 65.55.255.255 CIDR: 65.52.0.0/14 NetName: MICROSOFT-1BLK NetHandle: NET-65-52-0-0-1 Parent: NET-65-0-0-0-0 NetType: Direct Assignment NameServer: NS1.MSFT.NET NameServer: NS5.MSFT.NET NameServer: NS2.MSFT.NET NameServer: NS3.MSFT.NET NameServer: NS4.MSFT.NET ---------------------------------- The weird thing is...it wasn't there an hour ago yet it says, in Who's Online, it's been there for over 18 hours. It seems to have replaced the Yahoo Slurp entry. Agent: msnbot-media/1.0 (+http://search.msn.com/msnbot.htm) It's creating session ids, as well. Quote Link to comment Share on other sites More sharing options...
Guest Posted February 17, 2007 Share Posted February 17, 2007 That entry for the MSNbot has now been replaced again with Yahoo Slurp. Is that normal? For different spiders to be exchanging places? I mean...whatever is going on...it's now been there for over 18 hours and the IP address keeps jumping every few minutes, back and forth, from MSN to Yahoo Slurp. Here's what I'm seeing in Who's Online at the moment: 18:31:49 0 Guest 74.6.69.160 19:56:53 14:28:28 /product_info.php?products_id=1300&osCsid=14c31bf87f746de0c19a3e As you can see, the spider is sitting on product number 1300 with a session ID attached. Quote Link to comment Share on other sites More sharing options...
stevel Posted February 17, 2007 Author Share Posted February 17, 2007 (edited) I tried your site with various user agent strings and all looks well. My guess is that you had Prevent Spider Sessions off for a while and these spiders picked up the links with session IDs. It's not that they're getting new ones. You should be able to see this by looking for the first access from a given IP - if it comes in with a session ID, then Prevent Spider Sessions isn't going to remove it. There is a contrib Spider Session Remover which you can use to try to get the spiders to remove the session IDs. I'm not sure what you're seeing that makes you think there's "switching" going on. Edited February 17, 2007 by stevel Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
gregy Posted March 13, 2007 Share Posted March 13, 2007 hi guys i get listed all spiders but one is called mozilla for last few weeks copy/paste of those lines Active Bot with session 00:46:40 Mozilla 85.10.36.100 21:40:47 22:27:27 /product_print.php?products_id=497&language=Sl No Not Found Active Bot with session 00:00:00 Mozilla 74.6.87.44 22:27:22 22:27:22 /customer_testimonials.php?testimonial_id=21 No Not Found Active Bot with session 00:18:24 Mozilla 85.10.36.125 22:08:01 22:26:25 /product_print.php?products_id=472&language=It No Not Found Active Bot with session 00:00:00 Mozilla 74.6.69.167 22:23:17 22:23:17 /cookie_usage.php No Not Found Inactive Bot with session 00:00:00 Mozilla 74.6.69.220 22:22:16 22:22:16 /index.php No Not Found Inactive Bot with session 07:11:20 Mozilla 66.249.66.2 15:10:46 22:22:06 /customer_testimonials.php?testimonial_id=17 No Not Found any idea? thanx Quote Link to comment Share on other sites More sharing options...
stevel Posted March 13, 2007 Author Share Posted March 13, 2007 Post again with the relevant lines from your web server access log. All I can do is look up IP addresses here. 74.6.87.44 is someone from Slovenia 85.10.36.125 is Yahoo. This should not show up with a session unless it is following a link with a session ID 74.6.69.167 and 74.6.69.220 are also Yahoo 66.249.66.2 is Google I do not trust whatever contrib you are using that displays these lines. It is clearly misidentifying at least some of these. It is interesting that one of the Yahoo visitors got to your cookie_usage page. That suggests that you have active links to the cart or "buy now" and that this particular visitor has no session, another reason to think that the "with session" is bogus. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
gregy Posted March 14, 2007 Share Posted March 14, 2007 i'll reinstall contrib. to see what will happen. thanx Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.