ChrisW123 Posted December 20, 2004 Share Posted December 20, 2004 I'm confused. I just did a keyword search in Google to take a look at my website listing and it appears that they have indexed pages that I have disallowed in Robots.txt, checkout_shipping.php, in the example below. ImageCritique Photography - Online Image Download of Cheap Stock ... Online image downloads of cheap stock photos. We have ... photography needs. ImageCritique Photography - Online Image Download, Cheap Stock Photos, ... https://st15.startlogic.com/ ~imagecri/checkout_shipping.php - 23k - Dec 19, 2004 What would cause this? In robots.txt (in my Root (Catalog folder)) I have: User-agent: * .... Disallow: /checkout_shipping.php .... And WHY of all pages would they select checkout_shipping.php in the first place?!! The result above was the 3rd item on the FIRST page of the results. So it appears they have concluded that checkout_shipping.php best fits the search phrase I used. But why? Why not my home page? The content, title, and keywords on my home page better matches the search phrase I used, then does the checkout_shipping.php page. So I'm wondering why they would use it? Any ideas? Is it just a GoogleBot mystery? :) Link to comment Share on other sites More sharing options...
boxtel Posted December 20, 2004 Share Posted December 20, 2004 I'm confused. I just did a keyword search in Google to take a look at my website listing and it appears that they have indexed pages that I have disallowed in Robots.txt, checkout_shipping.php, in the example below. ImageCritique Photography - Online Image Download of Cheap Stock ... Online image downloads of cheap stock photos. We have ... photography needs. ImageCritique Photography - Online Image Download, Cheap Stock Photos, ... https://st15.startlogic.com/ ~imagecri/checkout_shipping.php - 23k - Dec 19, 2004 What would cause this? In robots.txt (in my Root (Catalog folder)) I have: User-agent: * .... Disallow: /checkout_shipping.php .... And WHY of all pages would they select checkout_shipping.php in the first place?!! The result above was the 3rd item on the FIRST page of the results. So it appears they have concluded that checkout_shipping.php best fits the search phrase I used. But why? Why not my home page? The content, title, and keywords on my home page better matches the search phrase I used, then does the checkout_shipping.php page. So I'm wondering why they would use it? Any ideas? Is it just a GoogleBot mystery? :) <{POST_SNAPBACK}> you do not have a checkout_shipping.php you have a ~imagecri/checkout_shipping.php Treasurer MFC Link to comment Share on other sites More sharing options...
TCwho Posted December 20, 2004 Share Posted December 20, 2004 ???? now Im confused Drop_Shadow How Did You Hear About Us Email HTML Order Link ---- GMT -5:00 Link to comment Share on other sites More sharing options...
bobg7 Posted December 20, 2004 Share Posted December 20, 2004 Not sure about whats going on with your issue, but when I went to Your Webpage using Firefox, the background is white and the text is all but invisable (Royalty Free Online Image Download Galleries) Check your stylesheet: Currently: BODY { color: #F2F2F2; background: #121015 margin: 0px; Should be: BODY { color: #F2F2F2; background: #121015; margin: 0px; Looks like you missed the ";" but it's an easy fix. Bob G. Installed Contributions: CCGV, Close Popup, Dynamic Meta Tags, Easy Populate, Froogle Data Feeder, Google Position, Infobox Header Entire Row, Live Support for OSC, PayPal Seal with CC images, Report_m Sales, Shop by Price Revised, SQL Updater, Who's Online Enhancement, Footer, GNA EP Assistant and still going. Link to comment Share on other sites More sharing options...
mhormann Posted December 20, 2004 Share Posted December 20, 2004 "robots.txt" can ONLY be in the SERVER's root, never in a "catalog" root, i.e. www.mydomain.com/robots.txt but NOT www.otherdomain.com/~username/robots.txt www.mydomain.com/catalog/robots.txt or the like! See my Help on 'robots.txt'. I don't want to set the world on fire—I just want to start a flame in your heart. osCommerce Contributions: Class cc_show() v1.0 – Show Credit Cards, Gateways More Product Weight v1.0 Link to comment Share on other sites More sharing options...
mhormann Posted December 20, 2004 Share Posted December 20, 2004 THIS BOARD IS SUCH A MESS! Always lets me edit and then says "You're not allowed to"... Copy here: "robots.txt" can ONLY be in the SERVER's root, never in a "catalog" root, i.e. www.mydomain.com/robots.txt but NOT www.otherdomain.com/~username/robots.txt www.mydomain.com/catalog/robots.txt or the like! See my Help on 'robots.txt'. In your case, it should be: https://st15.startlogic.com/robots.txt and probably contain something like: # ChrisW123 robots.txt # Currently disallow all shop stuff to the Google Image bot User-agent: Googlebot-Image Disallow: /cgi-bin/ Disallow: /usage/ Disallow: /~imagecri/ # ALL search engine spiders/crawlers (put at end of file) User-agent: * Disallow: /cgi-bin/ Disallow: /~imagecri/admin/ Disallow: /~imagecri/cache/ Disallow: /~imagecri/download/ Disallow: /~imagecri/images/ Disallow: /~imagecri/includes/ Disallow: /~imagecri/pub/ Disallow: /~imagecri/account.php Disallow: /~imagecri/advanced_search.php Disallow: /~imagecri/checkout_shipping.php Disallow: /~imagecri/create_account.php Disallow: /~imagecri/login.php Disallow: /~imagecri/password_forgotten.php Disallow: /~imagecri/popup_image.php Disallow: /~imagecri/shopping_cart.php (If anyhow possible, I'd try to get rid of the "~imagecri", customers assume this to be a user directory on free-charge webhosters. This is not good for your reputation.) Hope that helps, Matthias I don't want to set the world on fire—I just want to start a flame in your heart. osCommerce Contributions: Class cc_show() v1.0 – Show Credit Cards, Gateways More Product Weight v1.0 Link to comment Share on other sites More sharing options...
mhormann Posted December 20, 2004 Share Posted December 20, 2004 Just did some little checks, ChrisW123, don't be alarmed. GOOD that you have secured your directories. You probably also want to take the "admin" part out of your admin directory's name. Always keep in mind (everybody) that "robots.txt" can and should NOT be misunderstood as a SECURITY measure! It actually helps "hackers" to find potentially insecure paths/files. It is ONLY a method to help "well-behaved" spiders along their way (THEY don't want to waste energy on worthless content, YOU don't want to waste bandwidth), and even this is disregarded by some "not-so-well-behaved" (i.e., malicious) spiders... Btw, you (or your provider) will get about a zillion 404 log entries as long as there's no 'robots.txt' in the web root—almost ALL spiders explicitly check that file and on your site a '404' page comes up. Moral: ALWAYS put a 'robots.txt' in place if you run a web server. And be it one of the simplest possible: ALLOW EVERYTHING: User-agent: * Disallow: DENY EVERYTHING: User-agent: * Disallow: / I don't want to set the world on fire—I just want to start a flame in your heart. osCommerce Contributions: Class cc_show() v1.0 – Show Credit Cards, Gateways More Product Weight v1.0 Link to comment Share on other sites More sharing options...
mhormann Posted December 20, 2004 Share Posted December 20, 2004 Oops... to answer your original question: Google is actually one of the most "well-behaved". They will ALWAYS honor 'robots.txt' and even try to interpret 'badly written' robots.txt in your favour, i.e.: User-agent: * Disallow: User-agent: Googlebot Disallow: / Now WHAT would YOU do if you were 'Googlebot'? Right. Spider the whole site! (Which would be in accordance with the rules, since they state 'take the FIRST rule that applies to you') Google actually parse the rest and will NOT spider your site in this case. Also, they lately came up with kind-of 'accepting' bad syntax: User-agent: * Disallow: /jane Allow: /john Keep in mind: There IS NO 'Allow'! Google might honor it, nevertheless (and ONLY them.) There is also a recent 'developmental' feature, also ONLY on Google that allows 'wildcarding' (which is NOT normally possible), so you COULD write: User-agent: Googlebot Disallow: *.cgi Still, no other engine uses that and in order to keep it simple, better still use the 'official' rules which would probably make this more look like: User-agent: * Disallow: /cgi-bin/ Disallow: /secret/secret.cgi Disallow: /catalog/special/another_secret.cgi Sorry. Can't help writing books everytime... :-) I don't want to set the world on fire—I just want to start a flame in your heart. osCommerce Contributions: Class cc_show() v1.0 – Show Credit Cards, Gateways More Product Weight v1.0 Link to comment Share on other sites More sharing options...
ChrisW123 Posted December 20, 2004 Author Share Posted December 20, 2004 ....using Firefox, the background is white and the text is all but invisable (Royalty Free Online Image Download Galleries) Check your stylesheet: Should be: BODY { color: #F2F2F2; background: #121015; margin: 0px; Looks like you missed the ";" but it's an easy fix. Bob, thank you! This was driving me crazy, I couldn't tell what was wrong. I'll fix it right now and test later (can't test from work). Link to comment Share on other sites More sharing options...
ChrisW123 Posted December 20, 2004 Author Share Posted December 20, 2004 "robots.txt" can ONLY be in the SERVER's root, never in a "catalog" root, i.e. www.mydomain.com/robots.txt but NOT www.otherdomain.com/~username/robots.txt www.mydomain.com/catalog/robots.txt Actually my ROBOTS.TXT is already in: www.imagecritique.com/robots.txt. In my FTP program, the actually folder name is "/public_html". This is also where index.php, etc is, and is my "catalog" folder in effect. The reason it appears to be in st15.startlogic.com/~imagecri/, like in the Google result above, is because that's the URL that is used by my webhost when a secure page such as checkout_shipping.php is called. With my webhost they use this different server (?) when a secure page is required. Not sure why or if that has something to do with the problem. So it seems that Google should have started in www.imagecritique.com and found ROBOTS.TXT, saw that checkout_shipping.php is disallowed and not tried to browse it? Do I need to move ROBOTS.TXT one directory higher? I have a "higher" folder which just has stuff like ".qmail-chris", etc., in it. Just a few files. Should it be there instead? Maybe there's something wrong with ROBOTS.TXT? Here it is: User-agent: * Disallow: /_private/ Disallow: /vti_bin/ Disallow: /vti_cnf/ Disallow: /vti_log/ Disallow: /vti_pvt/ Disallow: /vti_txt/ Disallow: /cgi-bin/ Disallow: /admin/ Disallow: /download/ Disallow: /images/ Disallow: /font/ Disallow: /pub/ Disallow: /account.php Disallow: /account_edit.php Disallow: /account_edit_process.php Disallow: /account_history.php Disallow: /account_history_info.php Disallow: /account_newsletters.php Disallow: /account_notifications.php Disallow: /account_password.php Disallow: /add_checkout_success.php Disallow: /address_book.php Disallow: /address_book_process.php Disallow: /advanced_search.php Disallow: /advanced_search_result.php Disallow: /checkout_address.php Disallow: /checkout_confirmation.php Disallow: /checkout_payment.php Disallow: /checkout_payment_address.php Disallow: /checkout_process.php Disallow: /checkout_shipping.php Disallow: /checkout_shipping_address.php Disallow: /checkout_success.php Disallow: /contact_us.php Disallow: /cookie_usage.php Disallow: /create_account.php Disallow: /create_account_process.php Disallow: /create_account_success.php Disallow: /download.php Disallow: /login.php Disallow: /logoff.php Disallow: /info_shopping_cart.php Disallow: /ipn.php Disallow: /password_forgotten.php Disallow: /popup_coupon_help.php Disallow: /popup_image.php Disallow: /popup_paypal.php Disallow: /popup_search_help.php Disallow: /product_reviews_write.php Disallow: /product_thumb.php Disallow: /redirect.php Disallow: /shopping_cart.php Disallow: /ssl_check.php Disallow: /tell_a_friend.php Disallow: /wishlist.php Disallow: /wishlist_email.php Disallow: /wishlist_help.php Link to comment Share on other sites More sharing options...
mhormann Posted December 20, 2004 Share Posted December 20, 2004 Well, I've seen it already ;-) The problem is (as maybe all have using shared SSL proxying) that it must APPEAR to the SE to be in the root. The SE will know nothing about your internal structures, it sees only URI's. I really have to check that for my site, there might be the same problem (when going secure). For everything below 'www.imagecritique.com', the corresponding 'robots.txt' must be www.imagecritique.com/robots.txt wheras everything that could be found below 'st15.startlogic.com' has to have a 'robots.txt' at st15.startlogic.com/robots.txt (which they haven't, and so everything below, i.e. 'st15.startlogic.com/~imagecri/' might be spidered IF there is a link to that area somewhere IN THE WWW) For understandable reasons, you CANNOT specify something like Disallow: st15.startlogic.com/~imagecri/whatever.php in your 'www.imagecritique.com/robots.txt'. Hmmmmm. If your provider won't help you there, I don't know what to do (except going for your own cert). The problem is, the (poor) providers can't generalize: Some users might WANT being indexed, some NOT. And I wouldn't be sure if YOU wanted a complete exclusion of your secure pages. If you wanted this, you might ask them to put a User-agent: * Disallow: /~imagecri/ in THEIR 'st15.startlogic.com/robots.txt'. This (general) problem apparently needs to be thought over, since it affects all of us that use "shared SSL". Thanks for bringing it to the light. Any valid comments, anybody? I don't want to set the world on fire—I just want to start a flame in your heart. osCommerce Contributions: Class cc_show() v1.0 – Show Credit Cards, Gateways More Product Weight v1.0 Link to comment Share on other sites More sharing options...
bobg7 Posted December 20, 2004 Share Posted December 20, 2004 Bob, thank you! This was driving me crazy, I couldn't tell what was wrong. I'll fix it right now and test later (can't test from work). <{POST_SNAPBACK}> AHHHHHHHH - much better now :thumbsup: Installed Contributions: CCGV, Close Popup, Dynamic Meta Tags, Easy Populate, Froogle Data Feeder, Google Position, Infobox Header Entire Row, Live Support for OSC, PayPal Seal with CC images, Report_m Sales, Shop by Price Revised, SQL Updater, Who's Online Enhancement, Footer, GNA EP Assistant and still going. Link to comment Share on other sites More sharing options...
ChrisW123 Posted December 21, 2004 Author Share Posted December 21, 2004 If you wanted this, you might ask them to put a User-agent: * Disallow: /~imagecri/ in THEIR 'st15.startlogic.com/robots.txt'. I see what you're saying... Google probably just found st15.start.../~imagecri on it's own and started spidering. OK so if you can't put that disallow in robots.txt, and say my provider won't add in the lines above to disallow, couldn't I do this: Add specific disallows to the pages themselves? I think I've seen meta tags for doing this, where you can disallow from the page code itself? I don't remember exactly what the syntax is, but I think I've seen it. So I would add these to all my pages. Can this be done? Link to comment Share on other sites More sharing options...
mhormann Posted December 21, 2004 Share Posted December 21, 2004 Sure, and I'm so sad you can't use my new 'universal' HTC yet... (it has something called 'EXTRA_META_TAGS'...) Well, for the time being, use <meta name="robots" content="noindex,nofollow"> within the <head> part of your pages. These are valid combinations: <meta name="robots" content="noindex,nofollow"> (don't index THIS page and DON'T follow the links) <meta name="robots" content="noindex,follow"> (don't index THIS page but FOLLOW the links) <meta name="robots" content="index,nofollow"> (INDEX this page but DON'T follow the links) <meta name="robots" content="index,follow"> (INDEX this page and FOLLOW the links) Hope it works out for you! I don't want to set the world on fire—I just want to start a flame in your heart. osCommerce Contributions: Class cc_show() v1.0 – Show Credit Cards, Gateways More Product Weight v1.0 Link to comment Share on other sites More sharing options...
ChrisW123 Posted December 21, 2004 Author Share Posted December 21, 2004 Hope it works out for you! <{POST_SNAPBACK}> Excellent, I thought I'd seen those tags before. Great explanation... Thanks I'll try it! Link to comment Share on other sites More sharing options...
Logcbnfvr Posted December 22, 2004 Share Posted December 22, 2004 I'm confused. I just did a keyword search in Google to take a look at my website listing and it appears that they have indexed pages that I have disallowed in Robots.txt, checkout_shipping.php, in the example below. ImageCritique Photography - Online Image Download of Cheap Stock ... Online image downloads of cheap stock photos. We have ... photography needs. ImageCritique Photography - Online Image Download, Cheap Stock Photos, ... https://st15.startlogic.com/ ~imagecri/checkout_shipping.php - 23k - Dec 19, 2004 What would cause this? In robots.txt (in my Root (Catalog folder)) I have: User-agent: * .... Disallow: /checkout_shipping.php .... And WHY of all pages would they select checkout_shipping.php in the first place?!! The result above was the 3rd item on the FIRST page of the results. So it appears they have concluded that checkout_shipping.php best fits the search phrase I used. But why? Why not my home page? The content, title, and keywords on my home page better matches the search phrase I used, then does the checkout_shipping.php page. So I'm wondering why they would use it? Any ideas? Is it just a GoogleBot mystery? :) <{POST_SNAPBACK}> Hi Chris, I just noticed the same thing Googlebot has done to me. But if you click the link it doesn't take you to checkout_shipping.php it takes you to the login.php. Just wanted to ad to your mystery :rolleyes: Actually I think it is because the checkout_shipping.php is secure and redirecting to login. Happy Holidays! Log Cabin Fever Gifts Link to comment Share on other sites More sharing options...
ChrisW123 Posted December 22, 2004 Author Share Posted December 22, 2004 Hi Chris, I just noticed the same thing Googlebot has done to me. But if you click the link it doesn't take you to checkout_shipping.php it takes you to the login.php. Just wanted to ad to your mystery :rolleyes: Actually I think it is because the checkout_shipping.php is secure and redirecting to login.Happy Holidays! <{POST_SNAPBACK}> Yep, that is what's happening. Well I made the changes above that mhormann recommended, by adding the robots meta tags to those pages. I also noticed that I had added my Meta Tags Controller code to all those pages! But it doesn't make sense to use it for those, and this is probably why the search engine was giving them "weight" in determing if they should be listed. So I removed the contrib code from those pages and added the "noindex,nofollow" items to them. I also removed the contrib code from a few pages I DO want SE's to index such as links.php, and other informational pages, and customized the Title, Keywords, Description tags for those pages by hand. Now I only use the Meta Tags Controller on index.php, allprods.php, product_info.php, and a couple others. All of this should make for more relavent links on Google and other search engines. :) Thanks everyone for your great ideas! Link to comment Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.