xcelprinting.com Posted March 6, 2008 Posted March 6, 2008 I am trying to figure out what are the only areas that should be allowed for spiders to crawl, and which to disallow so folders with sensitive or confidential data are not crawled. Can someone help, and tell me which folders should I disallow, and which should I allow....I want the catalog itself and all other links or areas across my site that are intended to be viewed by the user to be crawled and indexed. Thanks in advance for your help. Jon Torres President
♥FWR Media Posted March 6, 2008 Posted March 6, 2008 Well i'm not saying it's right but this is what I use atm (never really looked into it much) User-agent: * Disallow: /MY ADMIN FOLDER/ Disallow: /account.php Disallow: /advanced_search.php Disallow: /checkout_shipping.php Disallow: /create_account.php Disallow: /cookie_usage.php Disallow: /login.php Disallow: /password_forgotten.php Disallow: /popup_image.php Disallow: /shopping_cart.php Disallow: /product_reviews_write.php Ultimate SEO Urls 5 PRO - Multi Language Modern, Powerful SEO Urls KissMT Dynamic SEO Meta & Canonical Header Tags KissER Error Handling and Debugging KissIT Image Thumbnailer Security Pro - Querystring protection against hackers ( a KISS contribution ) If you found my post useful please click the "Like This" button to the right. Please only PM me for paid work.
xcelprinting.com Posted March 6, 2008 Author Posted March 6, 2008 So anything not mentioned is going to be crawled? I guess most of us have different folders, so coud you tell me which are the ones that I should in my catalog allow to be indexed, this way I can disallow the rest? Thanks for your help. Jon Torres President
♥FWR Media Posted March 6, 2008 Posted March 6, 2008 So anything not mentioned is going to be crawled? I guess most of us have different folders, so coud you tell me which are the ones that I should in my catalog allow to be indexed, this way I can disallow the rest? Thanks for your help. I think you may cause yourself problems that way, we need the bots I wouldn't focus on excluding them. What's the driver for this? Ultimate SEO Urls 5 PRO - Multi Language Modern, Powerful SEO Urls KissMT Dynamic SEO Meta & Canonical Header Tags KissER Error Handling and Debugging KissIT Image Thumbnailer Security Pro - Querystring protection against hackers ( a KISS contribution ) If you found my post useful please click the "Like This" button to the right. Please only PM me for paid work.
♥FWR Media Posted March 6, 2008 Posted March 6, 2008 Also bare in mind that if prevent spider sessions is set to true in admin>sessions which it MUST be btw. Then all secure pages are excluded by default. (not that a bot could log in anyway) Ultimate SEO Urls 5 PRO - Multi Language Modern, Powerful SEO Urls KissMT Dynamic SEO Meta & Canonical Header Tags KissER Error Handling and Debugging KissIT Image Thumbnailer Security Pro - Querystring protection against hackers ( a KISS contribution ) If you found my post useful please click the "Like This" button to the right. Please only PM me for paid work.
xcelprinting.com Posted March 6, 2008 Author Posted March 6, 2008 I think you may cause yourself problems that way, we need the bots I wouldn't focus on excluding them. What's the driver for this? Sorry, I meant folders not the bots....what I meant to say was which are the folders that I should allow to be indexed, this way I can go and disallow the rest. Im looking in my CPanel for my host my file manager trying to figure it out, but there are so many folders and directories its crazy. Jon Torres President
xcelprinting.com Posted March 6, 2008 Author Posted March 6, 2008 In my cpanel, if I go up all levels, these are the directories you see, then of course directories within those directories, but this is the 1st level of directories: .cpanel .cpanel-datastore .fantasticodata .htpasswds .sqmaildata .trash access-logs cgi-bin etc mail public_ftp public_html ssl tmp www Create New File .bash_logout .bash_profile .bashrc .canna .contactemail .emacs .ftpquota .lastlogin .rnd .zshrc Jon Torres President
Jack_mcs Posted March 7, 2008 Posted March 7, 2008 No one from the web, even search engines, should be able to get to any directory or file above public_html. If they can, the server is not setup properly. As for what to limit below public_html, that is decided by what you don't want listed. For example, some sites don't want their images listed, so they disallow the images directory. Generally speaking, if the file or directory is not something you want anyone to see, it should be disallowed. Jack Support Links: For Hire: Contact me for anything you need help with for your shop: upgrading, hosting, repairs, code written, etc. All of My Addons Get the latest versions of my addons Recommended SEO Addons
germ Posted March 7, 2008 Posted March 7, 2008 You shouldn't have to exclude any folder that isn't referenced (linked to) in any of your HTML/PHP code, such as your /admin folder. If it's not linked to in any of your osC catalog pages, they don't "guess" what folders you have. They just follow links. And keep this is mind: Just because you ask robots to "stay out", this isn't mandatory. I used to run a site where I had a trap set for "bad robots". In the robots.txt file I excluded a folder that wasn't referenced in any of the files on the site. If the robot disobeyed the directive in the robots.txt file and went into that folder, it recorded their IP address and executed a little PHP script that banned the IP address from the site altogether. I ran that site for about two years, and you'd probably be surprised at the list of banned IP address I accumulated. :lol: If I suggest you edit any file(s) make a backup first - I'm not perfect and neither are you. "Given enough impetus a parallelogramatically shaped projectile can egress a circular orifice." - Me - "Headers already sent" - The definitive help "Cannot redeclare ..." - How to find/fix it SSL Implementation Help Like this post? "Like" it again over there >
xcelprinting.com Posted March 7, 2008 Author Posted March 7, 2008 You shouldn't have to exclude any folder that isn't referenced (linked to) in any of your HTML/PHP code, such as your /admin folder. If it's not linked to in any of your osC catalog pages, they don't "guess" what folders you have. They just follow links. And keep this is mind: Just because you ask robots to "stay out", this isn't mandatory. I used to run a site where I had a trap set for "bad robots". In the robots.txt file I excluded a folder that wasn't referenced in any of the files on the site. If the robot disobeyed the directive in the robots.txt file and went into that folder, it recorded their IP address and executed a little PHP script that banned the IP address from the site altogether. I ran that site for about two years, and you'd probably be surprised at the list of banned IP address I accumulated. :lol: Jim, I am not very savy at any of the technical side of things, and dont want to mess anything up, or expose anything either. The list I posted above of 1st level directories, I assume the catalog end of things, or the directories I should have indexed are in public html right (wrong)? If someone could help me step by step, let me know what you need, I can post the other directory levels on here if someone can help me out figuring this out properly, I didn't setup my site, I had someone else do it, so I dont want to allow a certain directory and come to find out there was a file or files in that directory that shouldnt be indexed.....so if anyone is willing to give me a little time and help me out PM me and let me know, or if you have messenger or Skype. Your help is much appreciated, as I never had robots.txt setup and have been missing out on whatever benefits it could've have brought me. Jon Torres President
germ Posted March 7, 2008 Posted March 7, 2008 I think what Robert posted would suffice, but I wouldn't mention the Admin folder. And your osC site is in your /shop folder, so yours would read: User-agent: * Disallow: /shop/account.php Disallow: /shop/advanced_search.php Disallow: /shop/checkout_shipping.php Disallow: /shop/create_account.php Disallow: /shop/cookie_usage.php Disallow: /shop/login.php Disallow: /shop/password_forgotten.php Disallow: /shop/popup_image.php Disallow: /shop/shopping_cart.php Disallow: /shop/product_reviews_write.php If I suggest you edit any file(s) make a backup first - I'm not perfect and neither are you. "Given enough impetus a parallelogramatically shaped projectile can egress a circular orifice." - Me - "Headers already sent" - The definitive help "Cannot redeclare ..." - How to find/fix it SSL Implementation Help Like this post? "Like" it again over there >
xcelprinting.com Posted March 7, 2008 Author Posted March 7, 2008 If on CPanel I click on the directory public_html, these are the listed directories, my concern is do I have to disallow everyone of these, if not will they be exposed: Artwork _fpclass _private _vti_bin _vti_cnf _vti_log _vti_pvt _vti_txt blog cgi-bin hp images shop Create New File .htaccess Thumbs.db XCEL.htm XcelPrintLogo_300dpi.JPG XcelPrintLogo_300dpi_small.JPG _vti_inf.html animate.js emailblast.html emailpromo.htm googlea5603dff495d916e.html index.html postinfo.html realtorpg.html samples.html samples2.html site.jpg When I click into "shop" this is what is in it, as you can see there are items in there that shouldnt be indexed, this is where I need your help: Swish_files _vti_cnf admin artwork download helpcenter images includes index_files mail portfolio temp templates Create New File .htaccess Thumbs.db about.html account.php account_edit.php account_history.php account_history_info.php account_newsletters.php account_notifications.php account_password.php address_book.php address_book_process.php advanced_search.php advanced_search_result.php all_products.php applicationguidelines.html blank.asp blank.html brokers.html brokersold.html checkout_confirmation.php checkout_payment.php checkout_payment_address.php checkout_payment_old.php checkout_process.php checkout_shipping.php checkout_shipping_address.php checkout_success.php comingsoon.html commonerrors.html conditions.php contact_us.php cookie_usage.php create_account.php create_account_success.php cvv_help.php design.html designquestionnaire.html designrates.html download.php error_log file_uploads.php fileprep.html folding.html header.php header2.php helpcenter.html incalc.htm index.html index.php indexcopy.html info_shopping_cart.php instruction_guidelines.html links.html login.php logoff.php mainbanner2.sbk mainbanner2.swf newsletter.php password_forgotten.php popup_image.php popup_search_help.php popup_tracker.php popup_tracking_header.htm portfolio-2.html portfolio.html pricematch.html privacy.html privacy.php product_info.php product_reviews.php product_reviews_info.php product_reviews_write.php products_new.php proof.html realtors.html redirect.php reviews.php shipping.php shopping_cart.php signs.html sitemap.html specials.php ssl_check.php stylesheet.css submitdesigndetails.html tell_a_friend.php templates.html terms.html testimonials.html thankyou.html tracking.php turnaround.html undercalc.htm upload.html usefullinks.htm we_design.html Jon Torres President
germ Posted March 7, 2008 Posted March 7, 2008 You only have to exclude things linked to from pages in your osC shop that you'd rather not get indexed by legitimate robots. The only way you can "screw things up" (IMHO) is excluding things that you don't have links to, thereby exposing them and giving the "bad robots" access to things they normally wouldn't find. And it's not mortal sin if you accidentally leave something out. At least I don't think so. ;) It's just a way to keep links to things you'd rather not get displayed out of search engine results. Personally, I think the list Robert posted, and I modified slightly, is enough. If I suggest you edit any file(s) make a backup first - I'm not perfect and neither are you. "Given enough impetus a parallelogramatically shaped projectile can egress a circular orifice." - Me - "Headers already sent" - The definitive help "Cannot redeclare ..." - How to find/fix it SSL Implementation Help Like this post? "Like" it again over there >
xcelprinting.com Posted March 7, 2008 Author Posted March 7, 2008 Jim, So all those things that I listed, if I dont disallow them will they be crawled? Jon Torres President
germ Posted March 7, 2008 Posted March 7, 2008 So all those things that I listed, if I dont disallow them will they be crawled? Good robots don't crawl anything you don't have links to in legitimate pages, or things you have excluded in robots.txt Bad (or good) robots can't crawl anything they don't know is there. It's that simple. ;) If I suggest you edit any file(s) make a backup first - I'm not perfect and neither are you. "Given enough impetus a parallelogramatically shaped projectile can egress a circular orifice." - Me - "Headers already sent" - The definitive help "Cannot redeclare ..." - How to find/fix it SSL Implementation Help Like this post? "Like" it again over there >
Guest Posted March 7, 2008 Posted March 7, 2008 I just added to the robots.txt contribution tonight. It is the usual stuff but made it just slightly harder to find the images folder and very hard to find the admin folder. It is really simple.
Recommended Posts
Archived
This topic is now archived and is closed to further replies.