pstrid Posted July 20, 2004 Posted July 20, 2004 Not sure why but whenever inktomi's bot crawls our site, it is still getting session IDs. Any ideas/help would be much appreciated. Here our my details... Current settings: Sessions Force Cookie Use False Check SSL Session ID False Check User Agent False Check IP Address False Prevent Spider Sessions True Recreate Session True and here is my current spiders.txt file: $Id: spiders.txt,v 1.2 2003/05/05 17:58:17 dgw_ Exp $ ask jeeves crawler crawler@fast docomo fast-webcrawler frooglebot geobot googlebot infoseek lycos_spider ncsa beta polybot scooter slurp/si [email protected] teoma voilabot w3c_validator Yahooseeker YahooSeeker/1.1 inktomisearch.com lj1029.inktomisearch.com inktomisearch When i check my tracking i see visits from:inktomisearch.com with seession IDs attached to each page it crawls, for example: /contact_us.php?osCsid=47a78760759c755ac3b3c376a29a4cd0 Am i missing something? Thanks.
peterr Posted July 20, 2004 Posted July 20, 2004 Hi, What is the exact user agent name ? Can you post one line from your web server logs, showing the full details from inktomi ? Peter
peterr Posted July 20, 2004 Posted July 20, 2004 Hi, See some sample code here: http://www.oscommerce.com/forums/index.php?showtopic=103222 Peter
pstrid Posted July 20, 2004 Author Posted July 20, 2004 Peter, Is this what you are loking for? lj1159.inktomisearch.com - seems to access my robots.txt file and then one of the follwing does the crawl: lj1038.inktomisearch.com lj1015.inktomisearch.com lj1236.inktomisearch.com lj1220.inktomisearch.com
pstrid Posted July 20, 2004 Author Posted July 20, 2004 Or something more like this: 5 1 0 0 0 18169971 0 0 18169971 0 0 15262 lj1165.inktomisearch.com
peterr Posted July 20, 2004 Posted July 20, 2004 Hi, Just noticed you have some uppercase characters (Yahoo... ) in your spiders.txt file. You should change them to ALL lowercase, because of this: $user_agent = strtolower(getenv('HTTP_USER_AGENT')); then the var $user_agent is compared to the value in the file 'spiders.txt' Can you paste the complete line from your web server logs, like this: 64.68.82.159 - - [30/Jun/2004:10:18:05 -0400] "GET /contact_us.php HTTP/1.0" 200 23537 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" Thanks, Peter
peterr Posted July 20, 2004 Posted July 20, 2004 Hi, Just noticed my 'awstats' shows "Inktomi Slurp", and we have a "slurp" entry in 'spiders.txt'. I see you don't have 'slurp' on it's own, might be wise to add this: slurp ...... to spiders.txt Peter
pstrid Posted July 20, 2004 Author Posted July 20, 2004 change all to lowercase and added 'slurp' Here's are two different lines from my log file 66.196.90.175 - - [18/Jul/2004:10:06:44 -0500] "GET /robots.txt HTTP/1.0" 404 3152 and 66.196.90.54 - - [18/Jul/2004:07:29:12 -0500] "GET /product_info.php?products_id=73&osCsid=49ea1123434915cefd633388ae2655af HTTP/1.0" 200 4412
peterr Posted July 20, 2004 Posted July 20, 2004 Hi, You can get rid of the "404" messages for robots.txt by placing this in your webroot path (usually called 'public_html). User-agent: * Disallow: /images/ Disallow: /includes/ and call the file robots.txt of course. :) Your web server log files are not showing the additional portins, the referrer and user agent. See this one: 66.196.90.54 - - [18/Jul/2004:07:29:12 -0500] "GET /product_info.php?products_id=73&osCsid=49ea1123434915cefd633388ae2655af HTTP/1.0" 200 4412 should have referrer and user agent appended to it. Do you have a control panel, like CPanel ? The only other thing I noticed is you have: Recreate Session True I'm 99% certain we usually leave that set to false, but honestly, I can't remember what it does (doh !! ). Peter
Recommended Posts
Archived
This topic is now archived and is closed to further replies.