Jump to content
  • Checkout
  • Login
  • Get in touch

osCommerce

The e-commerce.

ROBOTS.TXT


xcelprinting.com

Recommended Posts

Posted

I am trying to figure out what are the only areas that should be allowed for spiders to crawl, and which to disallow so folders with sensitive or confidential data are not crawled. Can someone help, and tell me which folders should I disallow, and which should I allow....I want the catalog itself and all other links or areas across my site that are intended to be viewed by the user to be crawled and indexed.

 

Thanks in advance for your help.

Jon Torres

President

Posted

Well i'm not saying it's right but this is what I use atm (never really looked into it much)

 

User-agent: *
Disallow: /MY ADMIN FOLDER/
Disallow: /account.php
Disallow: /advanced_search.php
Disallow: /checkout_shipping.php
Disallow: /create_account.php
Disallow: /cookie_usage.php
Disallow: /login.php
Disallow: /password_forgotten.php
Disallow: /popup_image.php
Disallow: /shopping_cart.php
Disallow: /product_reviews_write.php

Posted

So anything not mentioned is going to be crawled?

 

I guess most of us have different folders, so coud you tell me which are the ones that I should in my catalog allow to be indexed, this way I can disallow the rest?

 

Thanks for your help.

 

Jon Torres

President

Posted
So anything not mentioned is going to be crawled?

 

I guess most of us have different folders, so coud you tell me which are the ones that I should in my catalog allow to be indexed, this way I can disallow the rest?

 

Thanks for your help.

 

I think you may cause yourself problems that way, we need the bots I wouldn't focus on excluding them.

 

What's the driver for this?

Posted

Also bare in mind that if prevent spider sessions is set to true in admin>sessions which it MUST be btw. Then all secure pages are excluded by default. (not that a bot could log in anyway)

Posted
I think you may cause yourself problems that way, we need the bots I wouldn't focus on excluding them.

 

What's the driver for this?

 

 

Sorry, I meant folders not the bots....what I meant to say was which are the folders that I should allow to be indexed, this way I can go and disallow the rest.

 

Im looking in my CPanel for my host my file manager trying to figure it out, but there are so many folders and directories its crazy.

Jon Torres

President

Posted

In my cpanel, if I go up all levels, these are the directories you see, then of course directories within those directories, but this is the 1st level of directories:

.cpanel

.cpanel-datastore

.fantasticodata

.htpasswds

.sqmaildata

.trash

access-logs

cgi-bin

etc

mail

public_ftp

public_html

ssl

tmp

www

Create New File

.bash_logout

.bash_profile

.bashrc

.canna

.contactemail

.emacs

.ftpquota

.lastlogin

.rnd

.zshrc

Jon Torres

President

Posted

No one from the web, even search engines, should be able to get to any directory or file above public_html. If they can, the server is not setup properly. As for what to limit below public_html, that is decided by what you don't want listed. For example, some sites don't want their images listed, so they disallow the images directory. Generally speaking, if the file or directory is not something you want anyone to see, it should be disallowed.

 

Jack

Support Links:

For Hire: Contact me for anything you need help with for your shop: upgrading, hosting, repairs, code written, etc.

All of My Addons

Get the latest versions of my addons

Recommended SEO Addons

Posted

You shouldn't have to exclude any folder that isn't referenced (linked to) in any of your HTML/PHP code, such as your /admin folder.

 

If it's not linked to in any of your osC catalog pages, they don't "guess" what folders you have. They just follow links.

 

And keep this is mind: Just because you ask robots to "stay out", this isn't mandatory.

 

I used to run a site where I had a trap set for "bad robots".

 

In the robots.txt file I excluded a folder that wasn't referenced in any of the files on the site.

 

If the robot disobeyed the directive in the robots.txt file and went into that folder, it recorded their IP address and executed a little PHP script that banned the IP address from the site altogether.

 

I ran that site for about two years, and you'd probably be surprised at the list of banned IP address I accumulated.

:lol:

If I suggest you edit any file(s) make a backup first - I'm not perfect and neither are you.

 

"Given enough impetus a parallelogramatically shaped projectile can egress a circular orifice."

- Me -

 

"Headers already sent" - The definitive help

 

"Cannot redeclare ..." - How to find/fix it

 

SSL Implementation Help

 

Like this post? "Like" it again over there >

Posted
You shouldn't have to exclude any folder that isn't referenced (linked to) in any of your HTML/PHP code, such as your /admin folder.

 

If it's not linked to in any of your osC catalog pages, they don't "guess" what folders you have. They just follow links.

 

And keep this is mind: Just because you ask robots to "stay out", this isn't mandatory.

 

I used to run a site where I had a trap set for "bad robots".

 

In the robots.txt file I excluded a folder that wasn't referenced in any of the files on the site.

 

If the robot disobeyed the directive in the robots.txt file and went into that folder, it recorded their IP address and executed a little PHP script that banned the IP address from the site altogether.

 

I ran that site for about two years, and you'd probably be surprised at the list of banned IP address I accumulated.

:lol:

 

Jim,

I am not very savy at any of the technical side of things, and dont want to mess anything up, or expose anything either.

 

The list I posted above of 1st level directories, I assume the catalog end of things, or the directories I should have indexed are in public html right (wrong)?

 

If someone could help me step by step, let me know what you need, I can post the other directory levels on here if someone can help me out figuring this out properly, I didn't setup my site, I had someone else do it, so I dont want to allow a certain directory and come to find out there was a file or files in that directory that shouldnt be indexed.....so if anyone is willing to give me a little time and help me out PM me and let me know, or if you have messenger or Skype.

 

Your help is much appreciated, as I never had robots.txt setup and have been missing out on whatever benefits it could've have brought me.

Jon Torres

President

Posted

I think what Robert posted would suffice, but I wouldn't mention the Admin folder.

 

And your osC site is in your /shop folder, so yours would read:

 

User-agent: *
Disallow: /shop/account.php
Disallow: /shop/advanced_search.php
Disallow: /shop/checkout_shipping.php
Disallow: /shop/create_account.php
Disallow: /shop/cookie_usage.php
Disallow: /shop/login.php
Disallow: /shop/password_forgotten.php
Disallow: /shop/popup_image.php
Disallow: /shop/shopping_cart.php
Disallow: /shop/product_reviews_write.php

If I suggest you edit any file(s) make a backup first - I'm not perfect and neither are you.

 

"Given enough impetus a parallelogramatically shaped projectile can egress a circular orifice."

- Me -

 

"Headers already sent" - The definitive help

 

"Cannot redeclare ..." - How to find/fix it

 

SSL Implementation Help

 

Like this post? "Like" it again over there >

Posted

If on CPanel I click on the directory public_html, these are the listed directories, my concern is do I have to disallow everyone of these, if not will they be exposed:

Artwork

_fpclass

_private

_vti_bin

_vti_cnf

_vti_log

_vti_pvt

_vti_txt

blog

cgi-bin

hp

images

shop

Create New File

.htaccess

Thumbs.db

XCEL.htm

XcelPrintLogo_300dpi.JPG

XcelPrintLogo_300dpi_small.JPG

_vti_inf.html

animate.js

emailblast.html

emailpromo.htm

googlea5603dff495d916e.html

index.html

postinfo.html

realtorpg.html

samples.html

samples2.html

site.jpg

 

 

 

When I click into "shop" this is what is in it, as you can see there are items in there that shouldnt be indexed, this is where I need your help:

 

Swish_files

_vti_cnf

admin

artwork

download

helpcenter

images

includes

index_files

mail

portfolio

temp

templates

Create New File

.htaccess

Thumbs.db

about.html

account.php

account_edit.php

account_history.php

account_history_info.php

account_newsletters.php

account_notifications.php

account_password.php

address_book.php

address_book_process.php

advanced_search.php

advanced_search_result.php

all_products.php

applicationguidelines.html

blank.asp

blank.html

brokers.html

brokersold.html

checkout_confirmation.php

checkout_payment.php

checkout_payment_address.php

checkout_payment_old.php

checkout_process.php

checkout_shipping.php

checkout_shipping_address.php

checkout_success.php

comingsoon.html

commonerrors.html

conditions.php

contact_us.php

cookie_usage.php

create_account.php

create_account_success.php

cvv_help.php

design.html

designquestionnaire.html

designrates.html

download.php

error_log

file_uploads.php

fileprep.html

folding.html

header.php

header2.php

helpcenter.html

incalc.htm

index.html

index.php

indexcopy.html

info_shopping_cart.php

instruction_guidelines.html

links.html

login.php

logoff.php

mainbanner2.sbk

mainbanner2.swf

newsletter.php

password_forgotten.php

popup_image.php

popup_search_help.php

popup_tracker.php

popup_tracking_header.htm

portfolio-2.html

portfolio.html

pricematch.html

privacy.html

privacy.php

product_info.php

product_reviews.php

product_reviews_info.php

product_reviews_write.php

products_new.php

proof.html

realtors.html

redirect.php

reviews.php

shipping.php

shopping_cart.php

signs.html

sitemap.html

specials.php

ssl_check.php

stylesheet.css

submitdesigndetails.html

tell_a_friend.php

templates.html

terms.html

testimonials.html

thankyou.html

tracking.php

turnaround.html

undercalc.htm

upload.html

usefullinks.htm

we_design.html

Jon Torres

President

Posted

You only have to exclude things linked to from pages in your osC shop that you'd rather not get indexed by legitimate robots.

 

The only way you can "screw things up" (IMHO) is excluding things that you don't have links to, thereby exposing them and giving the "bad robots" access to things they normally wouldn't find.

 

And it's not mortal sin if you accidentally leave something out.

 

At least I don't think so.

;)

 

It's just a way to keep links to things you'd rather not get displayed out of search engine results.

 

Personally, I think the list Robert posted, and I modified slightly, is enough.

If I suggest you edit any file(s) make a backup first - I'm not perfect and neither are you.

 

"Given enough impetus a parallelogramatically shaped projectile can egress a circular orifice."

- Me -

 

"Headers already sent" - The definitive help

 

"Cannot redeclare ..." - How to find/fix it

 

SSL Implementation Help

 

Like this post? "Like" it again over there >

Posted
So all those things that I listed, if I dont disallow them will they be crawled?

Good robots don't crawl anything you don't have links to in legitimate pages, or things you have excluded in robots.txt

 

Bad (or good) robots can't crawl anything they don't know is there.

 

It's that simple.

;)

If I suggest you edit any file(s) make a backup first - I'm not perfect and neither are you.

 

"Given enough impetus a parallelogramatically shaped projectile can egress a circular orifice."

- Me -

 

"Headers already sent" - The definitive help

 

"Cannot redeclare ..." - How to find/fix it

 

SSL Implementation Help

 

Like this post? "Like" it again over there >

Posted

I just added to the robots.txt contribution tonight. It is the usual stuff but made it just slightly harder to find the images folder and very hard to find the admin folder. It is really simple.

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...