Jump to content
  • Checkout
  • Login
  • Get in touch

osCommerce

The e-commerce.

Spider Bad Behaviour - Action=buy_now Notify Write Review - How Do I Prevent This?


tigergirl

Recommended Posts

Posted

My problem:

for the last 5 days, Yahoo Slurp has been using up huge amounts of bandwidth and seems to be stuck and is trying to do the following actions:

 

1) buy_now

2) notify

3) write_review

4) constantly hitting the cookie_usage page

5) hitting pages that are disallowed in robots.txt i.e. contact_us & conditions

 

From the spiders.txt update contribution thread, I have done:

1) added the code mentioned for includes/modules/product_listing (case 'PRODUCT_LIST_BUY_NOW':)

2) added cookie_usage.php to robots.txt

3) added product_reviews_write to robots.txt

4) i've added User-agent: Slurp Crawl-delay: 5000 but that's not helping as the IP address's are differnt every time it accesses pages

 

Please can you tell me what else I should do?

Should I add the code for includes/functions/general.php for sort links?

Do I need to add the same code for includes/modules/product_listing to other files? If so - which files please?

My robots.txt says Disallow: /con which I thought would cover the contact_us & conditions files - why is slurp not obeying that?

Why is Slurp still coming to cookie_usage.php without going anywhere else?

 

I'm a bit stuck with this - I don't want to bannish slurp but it's not being very well behaved and is p-ing me off somewhat.

 

thanks so much in advance for anyone who can help.

Tiger

 

PS Sessions aren't a problem as already fixed that...

I'm feeling lucky today......maybe someone will answer my post!

I do try and answer a simple post when I can just to give something back.

------------------------------------------------

PM me? - I'm not for hire

Posted

robots.txt and the index, noindex etc metatags won't do much. They can be ignored. Search engines may index secure pages despite the fact that a noindex tag exists or a folder is set in the robots.txt.

 

First check the ips if they're really legit (do they match the robot?). All the requests with the action parameter will be redirected to the cookies usage page. So see if adding 301 redirect header makes a difference. In catalog\includes\functions\general.php find this code in the tep_redirect function.

 

header('Location: ' . $url);

 

add above it:

header( "HTTP/1.1 301 Moved Permanently" );

 

and check if the spider will hit the same link more than once.

Posted
robots.txt and the index, noindex etc metatags won't do much. They can be ignored. Search engines may index secure pages despite the fact that a noindex tag exists or a folder is set in the robots.txt.

 

First check the ips if they're really legit (do they match the robot?). All the requests with the action parameter will be redirected to the cookies usage page. So see if adding 301 redirect header makes a difference. In catalog\includes\functions\general.php find this code in the tep_redirect function.

 

header('Location: ' . $url);

 

add above it:

header( "HTTP/1.1 301 Moved Permanently" );

 

and check if the spider will hit the same link more than once.

 

Hi Enigma,

thanks for replying.

I'm pretty sure the IPs are slurp, I looked up a few - Inktomi Corporation, they look and act like bots except almost everytime he comes with a different IP address. Here's a few examples:

 

74.6.22.204 - - [13/Jun/2007:03:21:38 +0000] "GET /cookie_usage.php HTTP/1.0" 200 24503 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
74.6.22.204 - - [12/Jun/2007:23:21:37 +0000] "GET /product_reviews.php?cPath=60&products_id=71&action=notify HTTP/1.0" 302 0 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
74.6.24.45 - - [12/Jun/2007:15:21:37 +0000] "GET /shopping_cart.php HTTP/1.0" 200 22985 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
74.6.25.213 - - [12/Jun/2007:08:21:31 +0000] "GET /products_new.php?page=1&action=buy_now&products_id=106 HTTP/1.0" 302 0 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
74.6.25.106 - - [12/Jun/2007:05:44:09 +0000] "GET /product_reviews.php?products_id=73&action=notify HTTP/1.0" 302 0 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

 

The bot is getting a 302 but then no log entry showing immediately for cookie_usage.php, these come hours later so I was assuming he was just visiting the page again and again - is that normal?

 

I think the code I added to not show links if there is no SID worked for buy_now on this file: includes/modules/product_listing.php

change

case 'PRODUCT_LIST_BUY_NOW':
	   $lc_align = 'center';
	   $lc_text = '<a href="' . tep_href_link(basename($PHP_SELF), tep_get_all_get_params(array('action')) . 'action=buy_now&products_id=' . $listing['products_id']) . '">' . tep_image_button('button_buy_now.gif', IMAGE_BUTTON_BUY_NOW) . '</a> ';
	   break;

to

case 'PRODUCT_LIST_BUY_NOW':
	   $lc_align = 'center';
		if ($session_started) {
	   $lc_text = '<a href="' . tep_href_link(basename($PHP_SELF), tep_get_all_get_params(array('action')) . 'action=buy_now&products_id=' . $listing['products_id']) . '">' . tep_image_button('button_buy_now.gif', IMAGE_BUTTON_BUY_NOW) . '</a> ';
		} else {$lc_text = ' '; }
	   break;

 

I don't want the bots trying to do the actions as it's wasting bandwidth. I'm not sure what the piece of code you suggested would do???

 

From my logs, the problem with actions (buy_now, notify, write review) is with these files:

product_reviews_info.php

products_new.php

product_reviews.php

product_info.php

 

will the sid code work for these files? if so how should it be written? and can you say the exact route of these files?

 

Thanks again Enigma, you're always around to help.

 

Tiger

I'm feeling lucky today......maybe someone will answer my post!

I do try and answer a simple post when I can just to give something back.

------------------------------------------------

PM me? - I'm not for hire

Posted

spiders won't create sessions. You don't won't them to do that. But they will attempt the use the buy now etc links and that should redirect them to cookies_usage.php. So the reason for that line was to indicate the redirection type. 301 for permanent redirect. (307 for temp) You want the permanent one as spiders won't use your cart facility. See if that makes a difference. It would be easier than to try modifying the site's layout for spiders.

Posted

Hi Enigma,

I'll add the code you suggested and monitor the outcome. Are you saying that if the bots get the 301 Moved Permanently, they should stop following that link on that specific page? Won't that take ages as there are so many IPs from Yahoo?

 

Tiger

I'm feeling lucky today......maybe someone will answer my post!

I do try and answer a simple post when I can just to give something back.

------------------------------------------------

PM me? - I'm not for hire

Posted

Give it a bit of time if you can save a copy of the log somewhere and see if there are any duplicates from future visits. Spiders should respect the 301 theoretically at least. That is the simplest method. If you're concerned about b/w usage try and see if the cache html contribution makes a difference. That sends a 304 header to revisited pages and spiders monitor that so they won't revisit till the page content expires.

 

These methods are simpler otherwise you gona need to eliminate the links behind the buttons if a spider is present but that's way too complicated for this.

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...