krnl Posted June 8, 2005 Share Posted June 8, 2005 Hi all, I have set 'Prevent Spider Sessions' to true in my configuration, and I have also updated to the latest contribution of the .../includes/spiders.txt file. Yet, I still have a lot of search bots that are assigned session IDs when they crawl my site. Yahoo! slurp is on there right now with a session ID, even though slurp is clearly included in spiders.txt. Does anyone have any pointers that may help me to resolve this issue? Thanks, Rick http://all-in-general.com Link to comment Share on other sites More sharing options...
♥Vger Posted June 8, 2005 Share Posted June 8, 2005 First off - it's against forum rules to advertise in your signature and including a web url is considered advertising (whether you intended it or not). Best to remove it from your signature - because if you don't then I'm sure it will be done for you. Yahoo spiders are tough little beggars to track down, especially since they took over Inktomi. Try adding these to spiders.txt: ink inktomi Inktomi Slurpcat@inktomi Vger Link to comment Share on other sites More sharing options...
krnl Posted June 8, 2005 Author Share Posted June 8, 2005 Sorry...did not realize, Re: URLs in signature. Problem corrected. The IP in question does resolv to inktomisearch.com, but the user-agent should still be caught by the 'slurp' regex in spiders.txt. I'll try adding your suggestions to spiders.txt and see if that helps. 68.142.251.176 - - [08/Jun/2005:08:51:10 -0400] "GET /product_info.php?products_id=803&osCsid=232d6f30d874f000c559f64771fdaf16 HTTP/1.0" 200 31807 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" Thanks! Rick Link to comment Share on other sites More sharing options...
Thomas_Burke Posted June 8, 2005 Share Posted June 8, 2005 Sorry...did not realize, Re: URLs in signature. Problem corrected. The IP in question does resolv to inktomisearch.com, but the user-agent should still be caught by the 'slurp' regex in spiders.txt. I'll try adding your suggestions to spiders.txt and see if that helps. 68.142.251.176 - - [08/Jun/2005:08:51:10 -0400] "GET /product_info.php?products_id=803&osCsid=232d6f30d874f000c559f64771fdaf16 HTTP/1.0" 200 31807 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" Thanks! Rick <{POST_SNAPBACK}> Have you tried this contribution: http://www.oscommerce.com/community/contributions,2819 Spider Session Remover Spider Session Remover v1.0 (Jan 15th 2005) ================================== This is the official release of the Spider Session Remover. This contribution uses Apache mod_rewite to look for specific spiders, and remove the session (osCsid) from the URL, and return a '301' back to the spider. Basically, if the spider tries to do this: GET /www.example.com/product_info.php?products_id=24&osCsid=ac8d8926059625ecb8dd9115f91d5f8a the Apache mod_rewrite will rewrite the url to be: GET /www.example.com/product_info.php?products_id=24 and also return a "301" (Moved Permanently) to the spider. The problem ========= You may use one of the following: * 2-2MS2 "Prevent Spider Sessions" admin feature is set to true. * SID Killer contribution (http://www.oscommerce.com/community/contributions,952) * Spider Killer for MS1 contribution (http://www.oscommerce.com/community/contributions,1089) All of these features are very good, and aim to prevent spiders from adding an session ID (osCsid) to the url. However, what if a spider started to crawl your website BEFORE you enabled one of the above features ? What can happen, is that the (previously) harvested URLS with SIDs in them will show as results in search engines. Afterwards, often many months later, you will still see the spider trying to access the the URLs it harvested earlier with the session ID in it. In summary, URL's with sessions ID's were harvested PRIOR to any session disabling, and therefore these URL's are now indexed in search engines, and the spiders continue to re-visit your website using the URL's with the 'osCsid' in them. The Solution ========= So, how do we remove these session ID's for the spiders that continue to use the previously harvested URL ? By the use of Apache mod_rewrite, look for the spider agent name, and if the condition is true, re-write the URL without the 'osCsid' in it, and ALSO return a "301" back to the spider. What results from the mod_rewite ======================== As to what the search engines will do, they'll see the 301-Moved Permanently response, re-fetch the page from the new (osCsid-less) URL given in that response, and, ...... after a while, update their database to use the new URL. My contributions Link to comment Share on other sites More sharing options...
krnl Posted June 8, 2005 Author Share Posted June 8, 2005 Have you tried this contribution: http://www.oscommerce.com/community/contributions,2819 <{POST_SNAPBACK}> I think that's exactly what I need, as spiders certainly crawled my site before I came to the realization of the sessionID issue. I'll be installing this today and hopefully it will solve my problem. Since sessionIDs are such an issue when dealing with searchbots....shouldn't something like this be added as 'default' to the main osCommerce package in the future? Or at the very least, default the "Prevent Spider Sessions" value to "YES" instead of "NO". Thanks Thomas! Rick Link to comment Share on other sites More sharing options...
Thomas_Burke Posted June 8, 2005 Share Posted June 8, 2005 I think that's exactly what I need, as spiders certainly crawled my site before I came to the realization of the sessionID issue. I'll be installing this today and hopefully it will solve my problem. Since sessionIDs are such an issue when dealing with searchbots....shouldn't something like this be added as 'default' to the main osCommerce package in the future? Or at the very least, default the "Prevent Spider Sessions" value to "YES" instead of "NO". Thanks Thomas! Rick <{POST_SNAPBACK}> It certainly works, as Slurp visited my site before I installed the contribution, and I could see the session ID in Yahoo's cache. I checked Yahoo's cache after I installed the contribution and the session IDs were gone. Be careful in backing up. There's a risk of killing your site when you tinker with it. :o My contributions Link to comment Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.