tigergirl Posted June 13, 2007 Posted June 13, 2007 My problem: for the last 5 days, Yahoo Slurp has been using up huge amounts of bandwidth and seems to be stuck and is trying to do the following actions: 1) buy_now 2) notify 3) write_review 4) constantly hitting the cookie_usage page 5) hitting pages that are disallowed in robots.txt i.e. contact_us & conditions From the spiders.txt update contribution thread, I have done: 1) added the code mentioned for includes/modules/product_listing (case 'PRODUCT_LIST_BUY_NOW':) 2) added cookie_usage.php to robots.txt 3) added product_reviews_write to robots.txt 4) i've added User-agent: Slurp Crawl-delay: 5000 but that's not helping as the IP address's are differnt every time it accesses pages Please can you tell me what else I should do? Should I add the code for includes/functions/general.php for sort links? Do I need to add the same code for includes/modules/product_listing to other files? If so - which files please? My robots.txt says Disallow: /con which I thought would cover the contact_us & conditions files - why is slurp not obeying that? Why is Slurp still coming to cookie_usage.php without going anywhere else? I'm a bit stuck with this - I don't want to bannish slurp but it's not being very well behaved and is p-ing me off somewhat. thanks so much in advance for anyone who can help. Tiger PS Sessions aren't a problem as already fixed that... I'm feeling lucky today......maybe someone will answer my post! I do try and answer a simple post when I can just to give something back. ------------------------------------------------ PM me? - I'm not for hire
Guest Posted June 13, 2007 Posted June 13, 2007 robots.txt and the index, noindex etc metatags won't do much. They can be ignored. Search engines may index secure pages despite the fact that a noindex tag exists or a folder is set in the robots.txt. First check the ips if they're really legit (do they match the robot?). All the requests with the action parameter will be redirected to the cookies usage page. So see if adding 301 redirect header makes a difference. In catalog\includes\functions\general.php find this code in the tep_redirect function. header('Location: ' . $url); add above it: header( "HTTP/1.1 301 Moved Permanently" ); and check if the spider will hit the same link more than once.
tigergirl Posted June 13, 2007 Author Posted June 13, 2007 robots.txt and the index, noindex etc metatags won't do much. They can be ignored. Search engines may index secure pages despite the fact that a noindex tag exists or a folder is set in the robots.txt. First check the ips if they're really legit (do they match the robot?). All the requests with the action parameter will be redirected to the cookies usage page. So see if adding 301 redirect header makes a difference. In catalog\includes\functions\general.php find this code in the tep_redirect function. header('Location: ' . $url); add above it: header( "HTTP/1.1 301 Moved Permanently" ); and check if the spider will hit the same link more than once. Hi Enigma, thanks for replying. I'm pretty sure the IPs are slurp, I looked up a few - Inktomi Corporation, they look and act like bots except almost everytime he comes with a different IP address. Here's a few examples: 74.6.22.204 - - [13/Jun/2007:03:21:38 +0000] "GET /cookie_usage.php HTTP/1.0" 200 24503 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 74.6.22.204 - - [12/Jun/2007:23:21:37 +0000] "GET /product_reviews.php?cPath=60&products_id=71&action=notify HTTP/1.0" 302 0 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 74.6.24.45 - - [12/Jun/2007:15:21:37 +0000] "GET /shopping_cart.php HTTP/1.0" 200 22985 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 74.6.25.213 - - [12/Jun/2007:08:21:31 +0000] "GET /products_new.php?page=1&action=buy_now&products_id=106 HTTP/1.0" 302 0 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 74.6.25.106 - - [12/Jun/2007:05:44:09 +0000] "GET /product_reviews.php?products_id=73&action=notify HTTP/1.0" 302 0 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" The bot is getting a 302 but then no log entry showing immediately for cookie_usage.php, these come hours later so I was assuming he was just visiting the page again and again - is that normal? I think the code I added to not show links if there is no SID worked for buy_now on this file: includes/modules/product_listing.php change case 'PRODUCT_LIST_BUY_NOW': $lc_align = 'center'; $lc_text = '<a href="' . tep_href_link(basename($PHP_SELF), tep_get_all_get_params(array('action')) . 'action=buy_now&products_id=' . $listing['products_id']) . '">' . tep_image_button('button_buy_now.gif', IMAGE_BUTTON_BUY_NOW) . '</a> '; break; to case 'PRODUCT_LIST_BUY_NOW': $lc_align = 'center'; if ($session_started) { $lc_text = '<a href="' . tep_href_link(basename($PHP_SELF), tep_get_all_get_params(array('action')) . 'action=buy_now&products_id=' . $listing['products_id']) . '">' . tep_image_button('button_buy_now.gif', IMAGE_BUTTON_BUY_NOW) . '</a> '; } else {$lc_text = ' '; } break; I don't want the bots trying to do the actions as it's wasting bandwidth. I'm not sure what the piece of code you suggested would do??? From my logs, the problem with actions (buy_now, notify, write review) is with these files: product_reviews_info.php products_new.php product_reviews.php product_info.php will the sid code work for these files? if so how should it be written? and can you say the exact route of these files? Thanks again Enigma, you're always around to help. Tiger I'm feeling lucky today......maybe someone will answer my post! I do try and answer a simple post when I can just to give something back. ------------------------------------------------ PM me? - I'm not for hire
Guest Posted June 13, 2007 Posted June 13, 2007 spiders won't create sessions. You don't won't them to do that. But they will attempt the use the buy now etc links and that should redirect them to cookies_usage.php. So the reason for that line was to indicate the redirection type. 301 for permanent redirect. (307 for temp) You want the permanent one as spiders won't use your cart facility. See if that makes a difference. It would be easier than to try modifying the site's layout for spiders.
tigergirl Posted June 13, 2007 Author Posted June 13, 2007 Hi Enigma, I'll add the code you suggested and monitor the outcome. Are you saying that if the bots get the 301 Moved Permanently, they should stop following that link on that specific page? Won't that take ages as there are so many IPs from Yahoo? Tiger I'm feeling lucky today......maybe someone will answer my post! I do try and answer a simple post when I can just to give something back. ------------------------------------------------ PM me? - I'm not for hire
Guest Posted June 13, 2007 Posted June 13, 2007 Give it a bit of time if you can save a copy of the log somewhere and see if there are any duplicates from future visits. Spiders should respect the 301 theoretically at least. That is the simplest method. If you're concerned about b/w usage try and see if the cache html contribution makes a difference. That sends a 304 header to revisited pages and spiders monitor that so they won't revisit till the page content expires. These methods are simpler otherwise you gona need to eliminate the links behind the buttons if a spider is present but that's way too complicated for this.
Recommended Posts
Archived
This topic is now archived and is closed to further replies.