Jump to content
  • Checkout
  • Login
  • Get in touch

osCommerce

The e-commerce.

Is Google Ignoring ROBOTS.TXT?


ChrisW123

Recommended Posts

I'm confused.

 

I just did a keyword search in Google to take a look at my website listing and it appears that they have indexed pages that I have disallowed in Robots.txt, checkout_shipping.php, in the example below.

 

ImageCritique Photography - Online Image Download of Cheap Stock ...

Online image downloads of cheap stock photos. We have ... photography needs.

ImageCritique Photography - Online Image Download, Cheap Stock Photos, ...

https://st15.startlogic.com/ ~imagecri/checkout_shipping.php - 23k - Dec 19, 2004

 

What would cause this? In robots.txt (in my Root (Catalog folder)) I have:

 

User-agent: *

....

Disallow: /checkout_shipping.php

....

 

And WHY of all pages would they select checkout_shipping.php in the first place?!! The result above was the 3rd item on the FIRST page of the results. So it appears they have concluded that checkout_shipping.php best fits the search phrase I used. But why? Why not my home page? The content, title, and keywords on my home page better matches the search phrase I used, then does the checkout_shipping.php page. So I'm wondering why they would use it?

 

Any ideas? Is it just a GoogleBot mystery? :)

Link to comment
Share on other sites

I'm confused.

 

I just did a keyword search in Google to take a look at my website listing and it appears that they have indexed pages that I have disallowed in Robots.txt, checkout_shipping.php, in the example below.

 

ImageCritique Photography - Online Image Download of Cheap Stock ...

Online image downloads of cheap stock photos. We have ... photography needs.

ImageCritique Photography - Online Image Download, Cheap Stock Photos, ...

https://st15.startlogic.com/ ~imagecri/checkout_shipping.php - 23k - Dec 19, 2004

 

What would cause this?  In robots.txt (in my Root (Catalog folder)) I have:

 

User-agent: *

....

Disallow: /checkout_shipping.php

....

 

And WHY of all pages would they select checkout_shipping.php in the first place?!!  The result above was the 3rd item on the FIRST page of the results.  So it appears they have concluded that checkout_shipping.php best fits the search phrase I used.  But why?  Why not my home page?  The content, title, and keywords on my home page better matches the search phrase I used, then does the checkout_shipping.php page.  So I'm wondering why they would use it?

 

Any ideas?  Is it just a GoogleBot mystery? :)

 

you do not have a checkout_shipping.php you have a

~imagecri/checkout_shipping.php

Treasurer MFC

Link to comment
Share on other sites

Not sure about whats going on with your issue, but when I went to Your Webpage using Firefox, the background is white and the text is all but invisable (Royalty Free Online Image Download Galleries)

 

Check your stylesheet:

 

Currently:

BODY {

color: #F2F2F2;

background: #121015

margin: 0px;

 

Should be:

BODY {

color: #F2F2F2;

background: #121015;

margin: 0px;

 

Looks like you missed the ";" but it's an easy fix.

 

Bob G.

Installed Contributions: CCGV, Close Popup, Dynamic Meta Tags, Easy Populate, Froogle Data Feeder, Google Position, Infobox Header Entire Row, Live Support for OSC, PayPal Seal with CC images, Report_m Sales, Shop by Price Revised, SQL Updater, Who's Online Enhancement, Footer, GNA EP Assistant and still going.

Link to comment
Share on other sites

"robots.txt" can ONLY be in the SERVER's root, never in a "catalog" root, i.e.

 

www.mydomain.com/robots.txt

 

but NOT

 

www.otherdomain.com/~username/robots.txt

www.mydomain.com/catalog/robots.txt

 

or the like!

 

See my Help on 'robots.txt'.

I don't want to set the world on fire—I just want to start a flame in your heart.

 

osCommerce Contributions:

Class cc_show() v1.0 – Show Credit Cards, Gateways

More Product Weight v1.0

Link to comment
Share on other sites

THIS BOARD IS SUCH A MESS! Always lets me edit and then says "You're not allowed to"...

 

Copy here:

 

"robots.txt" can ONLY be in the SERVER's root, never in a "catalog" root, i.e.

 

www.mydomain.com/robots.txt

 

but NOT

 

www.otherdomain.com/~username/robots.txt

www.mydomain.com/catalog/robots.txt

 

or the like!

 

See my Help on 'robots.txt'.

 

In your case, it should be:

 

https://st15.startlogic.com/robots.txt

 

and probably contain something like:

# ChrisW123 robots.txt

# Currently disallow all shop stuff to the Google Image bot
User-agent: Googlebot-Image
Disallow: /cgi-bin/
Disallow: /usage/
Disallow: /~imagecri/

# ALL search engine spiders/crawlers (put at end of file)
User-agent: *
Disallow: /cgi-bin/
Disallow: /~imagecri/admin/
Disallow: /~imagecri/cache/
Disallow: /~imagecri/download/
Disallow: /~imagecri/images/
Disallow: /~imagecri/includes/
Disallow: /~imagecri/pub/
Disallow: /~imagecri/account.php
Disallow: /~imagecri/advanced_search.php
Disallow: /~imagecri/checkout_shipping.php
Disallow: /~imagecri/create_account.php
Disallow: /~imagecri/login.php
Disallow: /~imagecri/password_forgotten.php
Disallow: /~imagecri/popup_image.php
Disallow: /~imagecri/shopping_cart.php

 

(If anyhow possible, I'd try to get rid of the "~imagecri", customers assume this to be a user directory on free-charge webhosters. This is not good for your reputation.)

 

Hope that helps,

 

Matthias

I don't want to set the world on fire—I just want to start a flame in your heart.

 

osCommerce Contributions:

Class cc_show() v1.0 – Show Credit Cards, Gateways

More Product Weight v1.0

Link to comment
Share on other sites

Just did some little checks, ChrisW123, don't be alarmed.

 

GOOD that you have secured your directories. You probably also want to take the "admin" part out of your admin directory's name.

 

Always keep in mind (everybody) that "robots.txt" can and should NOT be misunderstood as a SECURITY measure! It actually helps "hackers" to find potentially insecure paths/files.

 

It is ONLY a method to help "well-behaved" spiders along their way (THEY don't want to waste energy on worthless content, YOU don't want to waste bandwidth), and even this is disregarded by some "not-so-well-behaved" (i.e., malicious) spiders...

 

Btw, you (or your provider) will get about a zillion 404 log entries as long as there's no 'robots.txt' in the web root—almost ALL spiders explicitly check that file and on your site a '404' page comes up.

 

Moral: ALWAYS put a 'robots.txt' in place if you run a web server. And be it one of the simplest possible:

 

ALLOW EVERYTHING:

User-agent: *
Disallow:

 

DENY EVERYTHING:

User-agent: *
Disallow: /

I don't want to set the world on fire—I just want to start a flame in your heart.

 

osCommerce Contributions:

Class cc_show() v1.0 – Show Credit Cards, Gateways

More Product Weight v1.0

Link to comment
Share on other sites

Oops... to answer your original question:

 

Google is actually one of the most "well-behaved". They will ALWAYS honor 'robots.txt' and even try to interpret 'badly written' robots.txt in your favour, i.e.:

 

User-agent: *
Disallow: 

User-agent: Googlebot
Disallow: /

Now WHAT would YOU do if you were 'Googlebot'?

 

Right. Spider the whole site! (Which would be in accordance with the rules, since they state 'take the FIRST rule that applies to you')

 

Google actually parse the rest and will NOT spider your site in this case.

 

Also, they lately came up with kind-of 'accepting' bad syntax:

User-agent: *
Disallow: /jane
Allow: /john

Keep in mind: There IS NO 'Allow'! Google might honor it, nevertheless (and ONLY them.)

 

There is also a recent 'developmental' feature, also ONLY on Google that allows 'wildcarding' (which is NOT normally possible), so you COULD write:

User-agent: Googlebot
Disallow: *.cgi

Still, no other engine uses that and in order to keep it simple, better still use the 'official' rules which would probably make this more look like:

User-agent: *
Disallow: /cgi-bin/
Disallow: /secret/secret.cgi
Disallow: /catalog/special/another_secret.cgi

 

Sorry. Can't help writing books everytime... :-)

I don't want to set the world on fire—I just want to start a flame in your heart.

 

osCommerce Contributions:

Class cc_show() v1.0 – Show Credit Cards, Gateways

More Product Weight v1.0

Link to comment
Share on other sites

....using Firefox, the background is white and the text is all but invisable (Royalty Free Online Image Download Galleries)

 

Check your stylesheet:

 

Should be:

BODY {

  color: #F2F2F2;

  background: #121015;

  margin: 0px;

 

Looks like you missed the ";" but it's an easy fix.

 

Bob, thank you! This was driving me crazy, I couldn't tell what was wrong. I'll fix it right now and test later (can't test from work).

Link to comment
Share on other sites

"robots.txt" can ONLY be in the SERVER's root, never in a "catalog" root, i.e.

 

www.mydomain.com/robots.txt

 

but NOT

 

www.otherdomain.com/~username/robots.txt

www.mydomain.com/catalog/robots.txt

 

Actually my ROBOTS.TXT is already in: www.imagecritique.com/robots.txt. In my FTP program, the actually folder name is "/public_html". This is also where index.php, etc is, and is my "catalog" folder in effect.

 

The reason it appears to be in st15.startlogic.com/~imagecri/, like in the Google result above, is because that's the URL that is used by my webhost when a secure page such as checkout_shipping.php is called. With my webhost they use this different server (?) when a secure page is required. Not sure why or if that has something to do with the problem.

 

So it seems that Google should have started in www.imagecritique.com and found ROBOTS.TXT, saw that checkout_shipping.php is disallowed and not tried to browse it?

 

Do I need to move ROBOTS.TXT one directory higher? I have a "higher" folder which just has stuff like ".qmail-chris", etc., in it. Just a few files. Should it be there instead?

 

Maybe there's something wrong with ROBOTS.TXT? Here it is:

 

User-agent: *
Disallow: /_private/
Disallow: /vti_bin/
Disallow: /vti_cnf/
Disallow: /vti_log/
Disallow: /vti_pvt/
Disallow: /vti_txt/
Disallow: /cgi-bin/
Disallow: /admin/
Disallow: /download/
Disallow: /images/
Disallow: /font/
Disallow: /pub/
Disallow: /account.php
Disallow: /account_edit.php
Disallow: /account_edit_process.php
Disallow: /account_history.php
Disallow: /account_history_info.php
Disallow: /account_newsletters.php
Disallow: /account_notifications.php
Disallow: /account_password.php
Disallow: /add_checkout_success.php
Disallow: /address_book.php
Disallow: /address_book_process.php
Disallow: /advanced_search.php
Disallow: /advanced_search_result.php
Disallow: /checkout_address.php
Disallow: /checkout_confirmation.php
Disallow: /checkout_payment.php
Disallow: /checkout_payment_address.php
Disallow: /checkout_process.php
Disallow: /checkout_shipping.php
Disallow: /checkout_shipping_address.php
Disallow: /checkout_success.php
Disallow: /contact_us.php
Disallow: /cookie_usage.php
Disallow: /create_account.php
Disallow: /create_account_process.php
Disallow: /create_account_success.php
Disallow: /download.php
Disallow: /login.php
Disallow: /logoff.php
Disallow: /info_shopping_cart.php
Disallow: /ipn.php
Disallow: /password_forgotten.php
Disallow: /popup_coupon_help.php
Disallow: /popup_image.php
Disallow: /popup_paypal.php
Disallow: /popup_search_help.php
Disallow: /product_reviews_write.php 
Disallow: /product_thumb.php
Disallow: /redirect.php
Disallow: /shopping_cart.php
Disallow: /ssl_check.php
Disallow: /tell_a_friend.php
Disallow: /wishlist.php
Disallow: /wishlist_email.php
Disallow: /wishlist_help.php

Link to comment
Share on other sites

Well, I've seen it already ;-)

 

The problem is (as maybe all have using shared SSL proxying) that it must APPEAR to the SE to be in the root. The SE will know nothing about your internal structures, it sees only URI's.

 

I really have to check that for my site, there might be the same problem (when going secure).

 

For everything below 'www.imagecritique.com', the corresponding 'robots.txt' must be

 

www.imagecritique.com/robots.txt

 

wheras everything that could be found below 'st15.startlogic.com' has to have a 'robots.txt' at

 

st15.startlogic.com/robots.txt

 

(which they haven't, and so everything below, i.e. 'st15.startlogic.com/~imagecri/' might be spidered IF there is a link to that area somewhere IN THE WWW)

 

For understandable reasons, you CANNOT specify something like

 

Disallow: st15.startlogic.com/~imagecri/whatever.php

 

in your 'www.imagecritique.com/robots.txt'. Hmmmmm.

 

If your provider won't help you there, I don't know what to do (except going for your own cert). The problem is, the (poor) providers can't generalize: Some users might WANT being indexed, some NOT. And I wouldn't be sure if YOU wanted a complete exclusion of your secure pages.

 

If you wanted this, you might ask them to put a

 

User-agent: *

Disallow: /~imagecri/

 

in THEIR 'st15.startlogic.com/robots.txt'.

 

This (general) problem apparently needs to be thought over, since it affects all of us that use "shared SSL". Thanks for bringing it to the light.

 

Any valid comments, anybody?

I don't want to set the world on fire—I just want to start a flame in your heart.

 

osCommerce Contributions:

Class cc_show() v1.0 – Show Credit Cards, Gateways

More Product Weight v1.0

Link to comment
Share on other sites

Bob, thank you!  This was driving me crazy, I couldn't tell what was wrong.  I'll fix it right now and test later (can't test from work).

 

AHHHHHHHH - much better now :thumbsup:

Installed Contributions: CCGV, Close Popup, Dynamic Meta Tags, Easy Populate, Froogle Data Feeder, Google Position, Infobox Header Entire Row, Live Support for OSC, PayPal Seal with CC images, Report_m Sales, Shop by Price Revised, SQL Updater, Who's Online Enhancement, Footer, GNA EP Assistant and still going.

Link to comment
Share on other sites

If you wanted this, you might ask them to put a

 

User-agent: *

Disallow: /~imagecri/

 

in THEIR 'st15.startlogic.com/robots.txt'.

 

I see what you're saying... Google probably just found st15.start.../~imagecri on it's own and started spidering.

 

OK so if you can't put that disallow in robots.txt, and say my provider won't add in the lines above to disallow, couldn't I do this:

 

Add specific disallows to the pages themselves? I think I've seen meta tags for doing this, where you can disallow from the page code itself? I don't remember exactly what the syntax is, but I think I've seen it. So I would add these to all my pages. Can this be done?

Link to comment
Share on other sites

Sure, and I'm so sad you can't use my new 'universal' HTC yet... (it has something called 'EXTRA_META_TAGS'...)

 

Well, for the time being, use

<meta name="robots" content="noindex,nofollow">

within the <head> part of your pages.

 

These are valid combinations:

<meta name="robots" content="noindex,nofollow">

(don't index THIS page and DON'T follow the links)

 

<meta name="robots" content="noindex,follow">

(don't index THIS page but FOLLOW the links)

 

<meta name="robots" content="index,nofollow">

(INDEX this page but DON'T follow the links)

 

<meta name="robots" content="index,follow">

(INDEX this page and FOLLOW the links)

 

Hope it works out for you!

I don't want to set the world on fire—I just want to start a flame in your heart.

 

osCommerce Contributions:

Class cc_show() v1.0 – Show Credit Cards, Gateways

More Product Weight v1.0

Link to comment
Share on other sites

I'm confused.

 

I just did a keyword search in Google to take a look at my website listing and it appears that they have indexed pages that I have disallowed in Robots.txt, checkout_shipping.php, in the example below.

 

ImageCritique Photography - Online Image Download of Cheap Stock ...

Online image downloads of cheap stock photos. We have ... photography needs.

ImageCritique Photography - Online Image Download, Cheap Stock Photos, ...

https://st15.startlogic.com/ ~imagecri/checkout_shipping.php - 23k - Dec 19, 2004

 

What would cause this?  In robots.txt (in my Root (Catalog folder)) I have:

 

User-agent: *

....

Disallow: /checkout_shipping.php

....

 

And WHY of all pages would they select checkout_shipping.php in the first place?!!  The result above was the 3rd item on the FIRST page of the results.  So it appears they have concluded that checkout_shipping.php best fits the search phrase I used.  But why?  Why not my home page?  The content, title, and keywords on my home page better matches the search phrase I used, then does the checkout_shipping.php page.  So I'm wondering why they would use it?

 

Any ideas?  Is it just a GoogleBot mystery? :)

 

Hi Chris, I just noticed the same thing Googlebot has done to me. But if you click the link it doesn't take you to checkout_shipping.php it takes you to the login.php. Just wanted to ad to your mystery :rolleyes: Actually I think it is because the checkout_shipping.php is secure and redirecting to login.

Happy Holidays!

Log Cabin Fever Gifts

Link to comment
Share on other sites

Hi Chris, I just noticed the same thing Googlebot has done to me. But if you click the link it doesn't take you to checkout_shipping.php it takes you to the login.php.  Just wanted to ad to your mystery :rolleyes:  Actually I think it is because the checkout_shipping.php is secure and redirecting to login.

Happy Holidays!

 

Yep, that is what's happening. Well I made the changes above that mhormann recommended, by adding the robots meta tags to those pages. I also noticed that I had added my Meta Tags Controller code to all those pages! But it doesn't make sense to use it for those, and this is probably why the search engine was giving them "weight" in determing if they should be listed. So I removed the contrib code from those pages and added the "noindex,nofollow" items to them.

 

I also removed the contrib code from a few pages I DO want SE's to index such as links.php, and other informational pages, and customized the Title, Keywords, Description tags for those pages by hand.

 

Now I only use the Meta Tags Controller on index.php, allprods.php, product_info.php, and a couple others.

 

All of this should make for more relavent links on Google and other search engines. :)

 

Thanks everyone for your great ideas!

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...