Jump to content
  • Checkout
  • Login
  • Get in touch

osCommerce

The e-commerce.

Google bots are out in full force today.


wizardsandwars

Recommended Posts

Ian,

 

Thankyou very much, I have updated my code now and hope it will help next time around.

 

The bots have left in the last 1/2hr and everything seems quite thankgod. Either that or they were bots on drink and have now passed out!!! :lol:

 

Thankyou all for the help and i look forward to the day that i can return this help. After this i am going to try and get above a php gobshit and learn a little so i can be a not so php gobshit.

 

Thanks again...

Link to comment
Share on other sites

  • Replies 132
  • Created
  • Last Reply

I was also concerned about the overhead but the problem Google gave us during the past few days, put this issue second.

 

In all honesty, I have not noticed any slow-down of my site since adding this code (maybe it was slow already!) but it is hard to tell what a 28kbs dial-up would experience.

 

Anyway, the cut-down solution mentioned earlier in these posts will probably be okay for those worried about overhead. Hopefully, we can get some feedback on both solutions.

 

// Add the session ID when moving from HTTP and HTTPS servers or when SID is defined

if ( (ENABLE_SSL == true ) && ($connection == 'SSL') && ($add_session_id == true) ) {

$sid = tep_session_name() . '=' . tep_session_id();

} elseif ( ($add_session_id == true) && (tep_not_null(SID)) ) {

$sid = SID;

}

 

// New Code to trap spiders

if (eregi("googlebot",getenv("HTTP_USER_AGENT")) || eregi("internetseer",getenv("HTTP_USER_AGENT")) || eregi("WebCrawler",getenv("HTTP_USER_AGENT"))) {

$sid = NULL;

}

 

 

if (isset($sid)) {

$link .= $separator . $sid;

}

Ian-san

Flawlessnet

Link to comment
Share on other sites

There is one thing I don't like about this solution - it costs. Lots of eregis and stuff executing when they are mostly (95% I guess) not needed.

[/quota]

 

Here's my modified solution' date=' it has run for more than one month, now all googlebot visits don't have any SID anymore. Actually I think this solution can catch most of the robots, without to much overhead.

 

[quota']

// search engines don't want session id

$spider_footprint = array( "bot", "rawler", "pider", "ppie", "rchitext", "aaland", "igout4u", "cho", "ferret", "ulliver

$spider_ip = array( "216.239.46.", "213.10.10.116", "213.10.10.117", "213.10.10.118", "64.41.153.100", "192.134.99.192",

 

$agent = getenv('HTTP_USER_AGENT');

$host_ip = getenv('REMOTE_ADDR');

$is_spider = 0;

 

// Is it a spider?

$i = 0;

while ($i < (count($spider_footprint))) {

if (stristr($agent, $spider_footprint[$i])) {

$is_spider = 1;

break;

}

$i++;

}

 

if (! $is_spider) {

$i = 0;

while ($i < (count($spider_ip))) {

if (strstr($host_ip, $spider_ip[$i])) {

$is_spider = 1;

break;

}

$i++;

}

}

 

// if visitor is a spider, then don't attach session id

if ($is_spider) {

$sid = NULL;

}

[/quota]

 

Look into the code, you can see that for googlebot, the first comparesion will get a match and thus stop all further comparing. So only those unpopular robots will cost some slightly more time.

 

By checking your webserver's log, you can adjust the order in $spider_footprint, to save some more time.

 

In my opinion, this hack doesn't add any obvious server load. Save your time to do something more interesting, such as install php-accelarator, which will almost cut your php scripts running time by half.

Kenneth Wang

VA3RRW/BD4RR

Link to comment
Share on other sites

Sorry, some mistake when I copy/paste the code in my last post.

 

The spider definition should looks like this:

   $spider_footprint = array( "bot", "rawler", "pider",  "ppie", "rchitext", "aaland", "igout4u", "cho", "ferret", "ulliver", "arvest", "tdig", "rchiver", "eeves", "inkwalker", "ycos", "ercator", "uscatferret", "yweb", "omad", "eternews", "cooter", "lurp", "oila", "oyager", "ebbase", "eblayers", "get");    

   $spider_ip = array( "216.239.46.", "213.10.10.116", "213.10.10.117", "213.10.10.118", "64.41.153.100", "192.134.99.192", "24.11.13.173", "209.143.1.46", "210.159.73.34", "210.159.73.35", "203.183.218.4", "63.195.193.17", "193.131.167.144", "194.168.54.11", "66.7.131.132", "216.243.113.1");                          

Kenneth Wang

VA3RRW/BD4RR

Link to comment
Share on other sites

there are many things one can do for site optimization. The site I did zoomone.com (an adult dvd shop) was listed straight away.

 

How did I do it? simple - good links, lots of links from popular sites will get you into google. this is the only way. Meta tags are useless.

Link to comment
Share on other sites

we're not worried about getting listed though, the thing is ( if you read all of the prior post you will see) google has been on our site for days now. 4 or 5 for me already and alot of people are having their site go down due to google crawling them so intensly. We are all trying to figure away to get them to index the pages quicker and not get hung up with the session id's.

 

So getting listed is the least of people worries at the moment, we are trying to get google to leave.

Brandon Sweet

Link to comment
Share on other sites

I have noticed for the most part that the bots are dropping the sid for the product_info pages but are retaining it for the cpath/ and the other pages. Basically from what I can tell the product_info is the only one that makes them drop the sid.

Brandon Sweet

Link to comment
Share on other sites

So getting listed is the least of people worries at the moment, we are trying to get google to leave.

 

Use robots.txt. Will help immediately.

 

BTW - I mailed you about your signature, you didn't answer nor react. Please read the forum rules:

 

* The forum is for information exchange only. Commercial advertising is not allowed.

 

The same goes for some other people - please don't abuse the bandwidth we pay for to promote your services. You can all use your profile to promote yourself - but not the signature.

 

No offense meant - but the rules are here for a reason.

You can't have everything. That's why trains have difficulty crossing oceans, and hippos did not adapt to fly. -- from the OpenBSD mailinglist.

Link to comment
Share on other sites

I did. I added a few lines in there that another person on this topic suggested and it still hasn't worked. The bots are still looking at the files that I have "disallowed" in the robots.txt.

 

Here is my robots.txt file. With the exception of a few other things...

 

# Robots.txt for Bedtimegirl.com

# Email: [email protected]

 

User-agent: *

Disallow: /catalog/address_book_process.php  

Disallow: /catalog/account.php  

Disallow: /catalog/account_edit.php  

Disallow: /catalog/account_edit_process.php  

Disallow: /catalog/account_history.php  

Disallow: /catalog/account_history_info.php  

Disallow: /catalog/address_book.php  

Disallow: /catalog/checkout_process.php  

Disallow: /catalog/advanced_search.php  

Disallow: /catalog/advanced_search_result.php  

Disallow: /catalog/checkout_address.php  

Disallow: /catalog/checkout_confirmation.php  

Disallow: /catalog/checkout_payment.php  

Disallow: /catalog/checkout_success.php  

Disallow: /catalog/conditions.php  

Disallow: /catalog/contact_us.php  

Disallow: /catalog/create_account.php  

Disallow: /catalog/create_account_process.php  

Disallow: /catalog/create_account_success.php  

Disallow: /catalog/download.php  

Disallow: /catalog/info_shopping_cart.php  

Disallow: /catalog/login.php  

Disallow: /catalog/logoff.php  

Disallow: /catalog/password_forgotten.php  

Disallow: /catalog/popup_image.php  

Disallow: /catalog/popup_search_help.php  

Disallow: /catalog/privacy.php  

Disallow: /catalog/products_new.php  

Disallow: /catalog/product_notifications.php  

Disallow: /catalog/product_reviews.php  

Disallow: /catalog/product_reviews_info.php  

Disallow: /catalog/product_reviews_write.php  

Disallow: /catalog/redirect.php  

Disallow: /catalog/reviews.php  

Disallow: /catalog/shipping.php  

Disallow: /catalog/shopping_cart.php  

Disallow: /catalog/specials.php  

Disallow: /catalog/tell_a_friend.php  

Disallow: /catalog/disclaimer.php  

Disallow: /catalog/admin/  

Disallow: /catalog/download/  

Disallow: /catalog/images/  

Disallow: /catalog/includes/  

Disallow: /catalog/pub/  

 

Is there something in this that I did wrong and you could clarify for me.

also sorry about the signature. I closed the pop-up "new PM" window when I got it and didn't look at it after I was finished reading the forums.

 

it is corrected now.

Brandon Sweet

Link to comment
Share on other sites

The bots are still looking at the files that I have "disallowed" in the robots.txt.

 

Did you store the robots.txt in the doc_root? Is it readable by world?

 

Did you try the most simple robots.txt that will disallow everything?

 

User-agent: *

Disallow: /catalog/*

 

 

This shouldn't happen. Google is known to carefully use the rules robots.txt provides. Other bots might act different.

 

Maybe you should contact google about this.

 

also sorry about the signature.  I closed the pop-up "new PM" window when I got it and didn't look at it after I was finished reading the forums.

 

it is corrected now.

 

Thanks. Again - no offense meant. I am simply trying to enforce the forum rules to make sure we all live happy ;-)

You can't have everything. That's why trains have difficulty crossing oceans, and hippos did not adapt to fly. -- from the OpenBSD mailinglist.

Link to comment
Share on other sites

yes I tried the simple file "BUT" I really dont want to disallow the entire catalog, reason being, if google is still gathering info from my site, which in all reality they are, and I disallow the catalog directory it could go one of 2 ways.

 

Either A) The bots send info back to google that my catalog directory is off limits and all of the info they have gathered will not be indexed.

 

or B) It will stop the bots in their tracks and return home and compile what it has gotten thus far and use that for their indexing.

 

Its russian roulette when it comes to that decision and I'm way to chicken to gamble on it.

 

For the most part I have the bots trained to view the product_info.php file without Sid's but when it hits a category path it will show the Sid to the bot.

 

With what I'm thinking, if you look at the layout, when you click on a product it uses the product_info.php file to show that product. However from what I have looked at, the cpath is found in the default.php file itself. If there were a way to have the cpath on an external file this might work out.

 

What do you think, and how hard would it be to do..

Brandon Sweet

Link to comment
Share on other sites

The comments regarding performance are very valid, especially due to the amount of times html_output.php must get called. Am I right in thinking it gets called for every link that is built as well as for the url?

 

What if the code was changed to check for a match in the database. This would only need 2 queries, one for the footprint and one for the ip, and I imagine would be a lot quicker than the current code.

 

This would also make maintenance of the lists a lot easier - updating the tables rather than updating the html_output.php module.

 

Jon.

Link to comment
Share on other sites

Jan,

 

Looking at the RAW logs in my webserver, the reason why the beefed up robots.txt file is not working is because the googlebot only read it once, several days and 60,000 hits ago, when it first kicked off its program.

 

Now it's just wroking off of lists of urls that it programmatically generated befroe we put in the hack.

 

Essestially, we've stopped it from continuing to add to its list of urls to crawl, but it needs to finish crawling the ones it already had on its list.

 

I have seem a dramatic decrease of googlebot hits over the last 3 days, however, googlebot is still present. I would *STRONGLY* advise anyone to make *CERTAIN* that they have this hack in place before the next googlebot bombardment.

-------------------------------------------------------------------------------------------------------------------------

NOTE: As of Oct 2006, I'm not as active in this forum as I used to be, but I still work with osC quite a bit.

If you have a question about any of my posts here, your best bet is to contact me though either Email or PM in my profile, and I'll be happy to help.

Link to comment
Share on other sites

hehe that was why I was coming here, to tell everyone about what you just said about the robots.txt getting looked at days ago.

 

I recieved almost 100,000 hits from google in 4 or 5 days already. Everything should be set up for the next crawl though......

Brandon Sweet

Link to comment
Share on other sites

Did you store the robots.txt in the doc_root? Is it readable by world?

 

I assume that you store robots.txt in the same location as the highest level index.php / default.php page on your site? e.g. alongside the first file that is loaded when the site is opened using its normal url?

 

Or should you also put it in other directories as well e.g. if you are linked to a specific product, you might open the site with http://www.yourname.com/catalog/products_i...products_id=xxx?

 

so would you put a robots.txt there as well?

Ian-san

Flawlessnet

Link to comment
Share on other sites

the robots.txt should go in your web-root, or the same directory that it takes you when you type in www.<your domain>

 

BTW, we could have restricted the googlebot from our website immediatly by altering the htaccess file, however I was worried that if the bot found that it did not have access to the site ,it would not index it at all.

-------------------------------------------------------------------------------------------------------------------------

NOTE: As of Oct 2006, I'm not as active in this forum as I used to be, but I still work with osC quite a bit.

If you have a question about any of my posts here, your best bet is to contact me though either Email or PM in my profile, and I'll be happy to help.

Link to comment
Share on other sites

Just a though,

 

I don't know php but would it be possible to have a allprods.php changed to works like catalog anywhere? Then block the bots completely from the shop instalation folder leaving them only the allprods in the root which would have a link to all products in the shop. These links would be listed in google and as far as i can see the robot.txt will not affect a web browser so the links would work for the visitors.

 

It is the way i think, but from a programers view would this be possible and would it help to solve the bot problem?

 

Please be kind and if the above is a load of rubbish pretend I was pissed outa my mind at time of writing :-) i know i will be!!

Link to comment
Share on other sites

I'm not as knowledgeable as most programmers here, I'm more of a database programmer, but that sounds like a terrific idea to me.

 

It avoids the ridirect issue as well.

 

Any web type programmers have any thoughts on Cobras suggestion?

-------------------------------------------------------------------------------------------------------------------------

NOTE: As of Oct 2006, I'm not as active in this forum as I used to be, but I still work with osC quite a bit.

If you have a question about any of my posts here, your best bet is to contact me though either Email or PM in my profile, and I'll be happy to help.

Link to comment
Share on other sites

I don't know php but would  it be possible to have a allprods.php changed to works like catalog anywhere?
As I released the latter versions of allprods.php, I am willing to change it to accomodate the spiders as good as possible. If the community can tell me exactly (or as exactly as possible) what to do, I will take a look at it. For now, I'm not quite sure about what is needed.

 

M@rcel

Greetings from Marcel

|Current version|Documentation|Contributions|

Link to comment
Share on other sites

I guess what this is saying is that when you enter the site, you should only be directed to the catalog if you are a person. If it is a robot, it would be directed to allprods. Then using a robots.txt file, we can block all robot access to the catalog site.

 

I think the issue must be to move the spider detection script up a level so it is only actioned on entry (and so saves processing time) and to include a redirect to either allprods or catalog.

 

So, allprods may need to be 'unlinked' from the store and act in a stand-alone way.

 

However, in my store http:/www.nowsayit.com, I use allprods both for robots and also as a visible link from the left column for customers to see the entire catalogue.

 

I also use another 'unlinked' version of allprods as a hidden link (you can see it if you go click the last dot on the extreme right of the date in my footer) for both robots but also to get a printable catalogue. I have no evidence that it actually works for robots but it does make a useful catalogue :)

 

I am not sure if this idea is suggesting an-unliked file like my catalogue idea or not? Or if anyone wants the code for this 'catalogue' script?

Ian-san

Flawlessnet

Link to comment
Share on other sites

The problem with redirects is that there seems to be pretty substantial evidence that Google does not want anyone to redirect their bots. They may not find out that your are redirecting them, but if they do, they may not like it.

 

Additionally, using the "read" function instead of redirecting the bot DEFINATLY does not work.

 

However, we know that the robots.txt DOES work, and it is not redirecting the bot anywhere.

 

What Cobra is suggestiong, is that we move the "All-prods" page up to the web-root, and then amend the robots.txt to include all files and directories in the web-root EXCEPT the all prods page. So when a robot came, it could ONLY crawl the All-Prods page, and the links it references.

 

Not a Bad idea, I think. It would ensure that the all-prods page is indexed. It would still have access to the rest of the site, though, through the links on the all prods page, and we would still need the "bot-detector" to adjust the SID accordingly.

-------------------------------------------------------------------------------------------------------------------------

NOTE: As of Oct 2006, I'm not as active in this forum as I used to be, but I still work with osC quite a bit.

If you have a question about any of my posts here, your best bet is to contact me though either Email or PM in my profile, and I'll be happy to help.

Link to comment
Share on other sites

Ok, that I understand. What I do not understand, however, is how the links on allprods.php will be indexed. If the Bots are not allowed to enter the catalog-directorytree, then the productstitle, description, keywords, etc metatags and or textcontents are not available for the bots to index. I am afraid that this will lead to unusable indexes, if they are indexed at all.

 

M@rcel

Greetings from Marcel

|Current version|Documentation|Contributions|

Link to comment
Share on other sites

M@rcel,

 

That's a good question. To tell you the truth, I'm not sure if including a directory in the robots.txt denies the robots from that directory through a link, as well as to the directory directly.

 

directory directly, lol, say that 10 times fast, lol.

 

Like I said, I'm not as experienced on the web as I am in a database. One of the web guru's will have to answer this.

-------------------------------------------------------------------------------------------------------------------------

NOTE: As of Oct 2006, I'm not as active in this forum as I used to be, but I still work with osC quite a bit.

If you have a question about any of my posts here, your best bet is to contact me though either Email or PM in my profile, and I'll be happy to help.

Link to comment
Share on other sites

We have to remember that after indexing, a customer must be able to follow the same path as the robot. So allprods needs to link to the site and probably should be within the catalog root, if this is not the same as the site.

 

Also, it seem intuitively obvious (god, big words now :shock: ) that if a robots.txt is to work at all, then it must stop the robot no matter how it enters the site. So for me at least, this means that I will not shut the door on robots going to the main pages such as products_info etc.

 

It also seems obvious, that allprods should carry all the meta tags that you want the robot to see.

 

So in summary: I cannot see that we can avoid having a spider detect script within the site, having a 'stop all' robots.txt looks risky to me, and allprods should carry all the meta tags and link to the site in a clear way so that robots go there first.

 

That sounds to me like what we already have actually?

Ian-san

Flawlessnet

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...