Guest Posted February 18, 2004 Posted February 18, 2004 Hello guys, I have noticed that google.com.au has updated my listing with them however it only seemed to update the index.php (.html with SEF addon) I looked in my logs and found googlebot v2 had visited my site, however he ended up at spiders.txt according to my log and then left. Can anyone help me find out why the googlebot did not index my whole or even partial of my site. I have the setting on in ADMIN for "Prevent Spider Sessions" Here is a copy of that file spiders.txt exactly as it is Any help would be much appreciated. $Id: spiders.txt,v 1.2 2003/05/05 17:58:17 dgw_ Exp $ almaden.ibm.com appie 1.1 architext ask jeeves asterias2.0 augurfind baiduspider bannana_bot bdcindexer crawler crawler@fast docomo fast-webcrawler fluffy the spider frooglebot geobot googlebot gulliver henrythemiragorobot ia_archiver infoseek kit_fireball lachesis lycos_spider mantraagent mercator moget/1.0 muscatferret nationaldirectory-webspider naverrobot ncsa beta netresearchserver ng/1.0 osis-project polybot pompos scooter seventwentyfour sidewinder sleek spider slurp/si [email protected] steeler/1.3 szukacz t-h-u-n-d-e-r-s-t-o-n-e teoma turnitinbot ultraseek vagabondo voilabot w3c_validator zao/0 zyborg/1.0
wizardsandwars Posted February 18, 2004 Posted February 18, 2004 How long has it been. It can easily take about 3 months to get all of your pages indexed. ------------------------------------------------------------------------------------------------------------------------- NOTE: As of Oct 2006, I'm not as active in this forum as I used to be, but I still work with osC quite a bit. If you have a question about any of my posts here, your best bet is to contact me though either Email or PM in my profile, and I'll be happy to help.
Guest Posted February 18, 2004 Posted February 18, 2004 Ok so this visit does not mean googlebot will do the whole site. I was just concerned that maybe the spiders.txt or something was telling google not to index my entire site.
user99999999 Posted February 18, 2004 Posted February 18, 2004 Hi, You should create a 'robots.txt' file that excludes spiders from your includes directory, there is no usefull information in there and you wasted googlebots time looking at some buttons with different languages on it. Just search for 'robots.txt' in google. Plus there is many more things you need to do.
Guest Posted February 18, 2004 Posted February 18, 2004 Can you please help me with the robots.txt Where do I source the content, where do I put it (/includes/) Any support would be much appreciated Is this why it did not index my site because i have no robots.txt
♥yesudo Posted February 18, 2004 Posted February 18, 2004 i think it goes in the root directory of your store. http://www.google.co.uk/search?q=robots.tx...F-8&hl=en&meta= Your online success is Paramount.
Guest Posted February 18, 2004 Posted February 18, 2004 Hello. I now have a robots.txt file. And once again the googlebot came, see below. Visits from Googlebot Googlebot 18/Feb/2004:18:27:48 64.68.82.169 /robots.txt "Googlebot/2.1 18/Feb/2004:18:27:49 64.68.82.169 / "Googlebot/2.1 You see how it stopped at robots.txt What have I done wrong, i really need google to index me and he seems to keep stopping. This is my robots.txt User-agent: * Disallow: /chat/ Disallow: /live_support/ Disallow: /admin/ Please any help from you SEO's would be much appreciated
wizardsandwars Posted February 18, 2004 Posted February 18, 2004 First of all, you don't have to have a robots.txt file to be indexed by Google, or any other search engine. Second, while there are 1000's of thing you can do to increase your rankings once you are fully indexed, you han've done anything wrong as far as begin fully parsed yet. No one ever said that google was supposed to come to your site and parse through all of your pages at once. Typically, this is not the way the googlebots behave. For more information about this subhject, you should research search engine spiders. ------------------------------------------------------------------------------------------------------------------------- NOTE: As of Oct 2006, I'm not as active in this forum as I used to be, but I still work with osC quite a bit. If you have a question about any of my posts here, your best bet is to contact me though either Email or PM in my profile, and I'll be happy to help.
peterr Posted February 19, 2004 Posted February 19, 2004 Hi, Can someone please explain why the file spiders.txt is in /catalog/includes/ path, whilst the file robots.txt is in the /catalog/ path ?? Wouldn't spiders look in the /catalog/ path for spiders.txt ?? Peter
wizardsandwars Posted February 19, 2004 Posted February 19, 2004 Because they are two different files that server two different funcitons. robots.txt will keep spiders from indexing directories or files that you specify. spiders.txt is a part of the spider session killer that will keep defined spider user agents from having a session id in the url so that the session ids don't end up in the search engine index. ------------------------------------------------------------------------------------------------------------------------- NOTE: As of Oct 2006, I'm not as active in this forum as I used to be, but I still work with osC quite a bit. If you have a question about any of my posts here, your best bet is to contact me though either Email or PM in my profile, and I'll be happy to help.
peterr Posted February 19, 2004 Posted February 19, 2004 Hi Chris, Thanks for explaining that. Just so that I have digested it properly. :) 1. robots.txt is for spiders/bots and is not used by osC, as such (i.e. a browser user who is visiting your osC store/site will not use robots.txt ) 2. spiders.txt is only for osC internal controls, and not used by spiders/bots,etc directly, but is used by osC, (if you turn spider session id's of in the admin section), to indicate which spider user agents, to turn off the session id. Hope I have it right ?? Peter
wizardsandwars Posted February 19, 2004 Posted February 19, 2004 Correct. :D ------------------------------------------------------------------------------------------------------------------------- NOTE: As of Oct 2006, I'm not as active in this forum as I used to be, but I still work with osC quite a bit. If you have a question about any of my posts here, your best bet is to contact me though either Email or PM in my profile, and I'll be happy to help.
Guest Posted February 23, 2004 Posted February 23, 2004 Hi again. Since I first posted googlebot has visited me every day. However it NEVER passes robots.txt Is this normal? Surely it is supposed to start looking in my site by now. I have some good titles, for example, if someone does a google for "koala computers" i come up number one. But if they search for "Cheap computers" or "Unlimited ADSL" or $49 adsl etc nothing comes up for me. I have that in title, meta tags, content & many pages. Does any SEO know if i have missed something, will google fully index me? If you are a pro SEO then I am willing to pay you to get me ranked, right now I am using pay per click on google, very well many sales from it, however i dont want to pay for this. PLEASE help me get indexed. Email me on [email protected] with SEO quote if you think you can get me ranked high.
user99999999 Posted February 23, 2004 Posted February 23, 2004 Plus you would want to keep them out of your includes dir. Disallow: /catalog/includes/
wizardsandwars Posted February 25, 2004 Posted February 25, 2004 If your site comes up with "koala computers", then it already *has* been past your robots.txt. The 'cheap computers' keyphrase you mention below is *extremly* compeditive, at over 5 million sites returned.. You can optimize until you are blue in the face, and you'll never even sniff the top 50 pages. The Unlimited ADSL should be doable with the proper SEO. You'll notice that you also come up first with the search "Unlimited ADSL koala" So, the problem isn't necessarily that the spiders havn't indexed you, it's that you havent optimized the stie for the key phrases you want, or the key phrases you want are too compeditive to have a realistic shot at getting anywhere near the top. Needless to say, in either event, you're in the wrong forum. ------------------------------------------------------------------------------------------------------------------------- NOTE: As of Oct 2006, I'm not as active in this forum as I used to be, but I still work with osC quite a bit. If you have a question about any of my posts here, your best bet is to contact me though either Email or PM in my profile, and I'll be happy to help.
peterr Posted March 11, 2004 Posted March 11, 2004 Hi, I'm somewhat mystified by the web server logs, as the spider/bot called 'msnbot' is showing up all the session ID's ??? Here is the setup: 1. Admin | Sessions | Prevent Spider Sessions | True 2. Contents of /catalog/robots.txt User-agent: * Disallow: /images/ Disallow: /includes/ 3. Contents of /catalog/includes/spiders.txt $Id: spiders.txt,v 1.2 2003/05/05 17:58:17 dgw_ Exp $ almaden.ibm.com appie 1.1 architext ask jeeves asterias2.0 augurfind baiduspider bannana_bot bdcindexer crawler crawler@fast docomo fast-webcrawler fluffy the spider frooglebot geobot gigabot googlebot gulliver henrythemiragorobot ia_archiver infoseek kit_fireball lachesis lycos_spider mantraagent mercator moget/1.0 msnbot muscatferret nationaldirectory-webspider naverrobot ncsa beta netresearchserver ng/1.0 osis-project polybot pompos scooter seventwentyfour sidewinder sleek spider slurp/si [email protected] steeler/1.3 szukacz t-h-u-n-d-e-r-s-t-o-n-e teoma turnitinbot ultraseek vagabondo voilabot w3c_validator zao/0 zyborg/1.0 4. Part of the web server logs ......... 65.54.188.30 - - [04/Mar/2004:20:06:04 -0500] "GET /robots.txt HTTP/1.0" 200 54 "-" "msnbot/0.11 (+http://search.msn.com/msnbot.htm)" 65.54.188.30 - - [04/Mar/2004:20:06:04 -0500] "GET / HTTP/1.0" 200 27141 "-" "msnbot/0.11 (+http://search.msn.com/msnbot.htm)" 65.54.188.30 - - [04/Mar/2004:23:32:13 -0500] "GET /index.php?cPath=1&osCsid=5ddbfe3b41ade4da1ce255170335ac6e HTTP/1.0" 200 24579 "-" "msnbot/0.11 (+http://search.msn.com/msnbot.htm)" 65.54.188.30 - - [04/Mar/2004:23:32:18 -0500] "GET /shopping_cart.php?osCsid=5ddbfe3b41ade4da1ce255170335ac6e HTTP/1.0" 200 20324 "-" "msnbot/0.11 (+http://search.msn.com/msnbot.htm)" 65.54.188.30 - - [04/Mar/2004:23:32:45 -0500] "GET /index.php?cPath=6&osCsid=5ddbfe3b41ade4da1ce255170335ac6e HTTP/1.0" 200 24183 "-" "msnbot/0.11 (+http://search.msn.com/msnbot.htm)" 65.54.188.30 - - [04/Mar/2004:23:33:10 -0500] "GET /index.php?cPath=25&osCsid=5ddbfe3b41ade4da1ce255170335ac6e HTTP/1.0" 200 19121 "-" "msnbot/0.11 (+http://search.msn.com/msnbot.htm)" 65.54.188.30 - - [04/Mar/2004:23:34:23 -0500] "GET /specials.php?osCsid=5ddbfe3b41ade4da1ce255170335ac6e HTTP/1.0" 200 19613 "-" "msnbot/0.11 (+http://search.msn.com/msnbot.htm)" 65.54.188.30 - - [04/Mar/2004:23:34:30 -0500] "GET /reviews.php?osCsid=5ddbfe3b41ade4da1ce255170335ac6e HTTP/1.0" 200 19534 "-" "msnbot/0.11 (+http://search.msn.com/msnbot.htm)" I noticed the file format for robots.txt and spiders.txt is Unix (LF only), is that correct ?? Surely I wouldn't have to add: msnbot/0.11 to the spiders.txt file, that is, have to continually monitor the web logs and see if there are new version numers out ?? Would the exlusion of the '/includes/' path in robots.txt have anything to do with it ? Hmmm, I don't see how, it is only for osC useage, the file spiders.txt isn't used by bots, in fact a traverse of /icludes/ is not possible. Other things that may help solve the mystery ........... /catalog/.htccess # $Id: .htaccess,v 1.3 2003/06/12 10:53:20 hpdl Exp $ # # This is used with Apache WebServers # # For this to work, you must include the parameter 'Options' to # the AllowOverride configuration # # Example: # # <Directory "/usr/local/apache/htdocs"> # AllowOverride Options # </Directory> # # 'All' with also work. (This configuration is in the # apache/conf/httpd.conf file) # The following makes adjustments to the SSL protocol for Internet # Explorer browsers <IfModule mod_setenvif.c> <IfDefine SSL> SetEnvIf User-Agent ".*MSIE.*" \ nokeepalive ssl-unclean-shutdown \ downgrade-1.0 force-response-1.0 </IfDefine> </IfModule> # Fix certain PHP values # (commented out by default to prevent errors occuring on certain # servers) #<IfModule mod_php4.c> # php_value session.use_trans_sid 0 # php_value register_globals 1 #</IfModule> /catalog/includes/.htccess # $Id: .htaccess,v 1.4 2001/04/22 20:30:03 dwatkins Exp $ # # This is used with Apache WebServers # The following blocks direct HTTP requests in this directory recursively # # For this to work, you must include the parameter 'Limit' to the AllowOverride configuration # # Example: # #<Directory "/usr/local/apache/htdocs"> # AllowOverride Limit # # 'All' with also work. (This configuration is in your apache/conf/httpd.conf file) # # This does not affect PHP include/require functions # # Example: http://server/catalog/includes/application_top.php will not work <Files *.php> Order Deny,Allow Deny from all </Files> Did a phpinfo(), and .............. session.use.trans.sid is On register_globals is On Thanks, Peter
peterr Posted March 11, 2004 Posted March 11, 2004 Hi, Maybe I have to add any spiders/bots with the version numbers ? I have noticed for nearly a month now, all 'Googlebot' does is 64.68.82.172 - - [01/Mar/2004:03:59:37 -0500] "GET /robots.txt HTTP/1.0" 200 54 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" 64.68.82.172 - - [01/Mar/2004:03:59:39 -0500] "GET / HTTP/1.0" 200 24469 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" 64.68.82.58 - - [01/Mar/2004:04:10:23 -0500] "GET /robots.txt HTTP/1.0" 200 54 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" 64.68.82.58 - - [01/Mar/2004:04:10:24 -0500] "GET / HTTP/1.0" 200 24487 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" ... and the entry for 'Googlebot' in /includes/spiders.txt is ......... googlebot .. so wouldn't this seem to suggest that the entry for 'Googlebot" should be: Googlebot/2.1 and then it would spider the whole site, not just a page or two ?? Thanks, Peter
peterr Posted March 11, 2004 Posted March 11, 2004 Hi, Should the PHP.INI setting "session.use.trans.sid" be On or Off ?? From a phpinfo(), my setting is: session.use.trans.sid is On yet from the osC MS-2, the .htaccess in /catalog/ path indicates that it should be Off, as follows: #<IfModule mod_php4.c> # php_value session.use_trans_sid 0 # php_value register_globals 1 #</IfModule> From http://www.php.net/manual/en/ref.session.p...n.use-trans-sid session.use_trans_sid boolean session.use_trans_sid whether transparent sid support is enabled or not. Defaults to 0 (disabled). Note: For PHP 4.1.2 or less, it is enabled by compiling with --enable-trans-sid. From PHP 4.2.0, trans-sid feature is always compiled. URL based session management has additional security risks compared to cookie based session management. Users may send a URL that contains an active session ID to their friends by email or users may save a URL that contains a session ID to their bookmarks and access your site with the same session ID always, for example. Pulling my hair out on this one !! :D Peter
peterr Posted March 15, 2004 Posted March 15, 2004 Hi Dave, Plus you would want to keep them out of your includes dir. Disallow: /catalog/includes/ When I saw what you said, I thought it was a good idea, so I went and added the above 'disallow' to 'robots.txt'. However, now having had sometime to research more on spiders/bots,etc, I don't think it is such a good idea. 1. If you try http://yourdomainname/includes ... you should get a "403" message, .. forbidden, anyway. :D 2. By adding the "/includes/" path in "robots.txt", you are supplying any spider/bot with path/folder information, that they would not have otherwise been able to know about. I think I'll take it out. :) Peter
user99999999 Posted March 15, 2004 Posted March 15, 2004 Usually I put (one) robots.txt in the root dir. So if your store is in catalog then this 'Disallow: /catalog/includes/' would be right. 65.54.188.30 - - [04/Mar/2004:23:32:18 -0500] "GET /shopping_cart.php?osCsid=5ddbfe3b41ade4da1ce255170335ac6e HTTP/1.0" 200 20324 "-" "msnbot/0.11 (+http://search.msn.com/msnbot.htm)" This to me says your store is in the root directory like so http://mydomain.com/shopping_cart.php But all the files you mentioned said your store was in the catalog dir. So the log should look like this GET /catalog/shopping_cart.php?osCsid... And you would get your store like this http://mydomain.com/catalog/shopping_cart.php Do you have two installed?
user99999999 Posted March 15, 2004 Posted March 15, 2004 2. By adding the "/includes/" path in "robots.txt", you are supplying any spider/bot with path/folder information, that they would not have otherwise been able to know about. They know about it because your language images are in there. http://www.mydomain.com/catalog/includes/l...images/icon.gif
peterr Posted March 15, 2004 Posted March 15, 2004 Hi, The point I'm _trying_ to make is the following in robots.txt Disallow: /catalog/includes/ OR ........... Disallow: /includes/ .. are NOT needed, and is in fact, redundant, because ........ 1. Any spider/bot or web user should not be able to get to the "/includes/" path anyway. (i.e. permissions) 2. Supplying the "/includes/" path will compromise your security, because normally (due to path permissions), no-one can 'see' that path. There is a significant amount of information in /includes/ that you wouldn't want anyone to see/view. 3. .......so, why tell them it is there, it only makes a hackers job easier. :D Peter
user99999999 Posted March 15, 2004 Posted March 15, 2004 That is all great but the fact is no hacker will 'never' bother to look at your robots.txt file. This file is simply a no trespassing sign for spiders/bots. Ill quote myself They know about it because your language images are in there. http://.../catalog/includes/languages/images/icon.gif And restate it... Everyone can simply see the info you want to hide with robots.txt. http://.../catalog/includes/languages/images/icon.gif
user99999999 Posted March 15, 2004 Posted March 15, 2004 And just to stay on topic here... I think catalog/includes/spiders.txt should be renamed to spiders.php so that way it will follow the .htaccess rules.
EricK Posted March 20, 2004 Posted March 20, 2004 Is it possible to prevent browser viewing of your robots.txt file, however still alllow bots to use the file? I moved /admin/ to /supersecretadmin/, then included Disallow: /supersecretadmin/ in robots.txt ... ... because anyone can view robots.txt, this defeats my original purpose. :(
Recommended Posts
Archived
This topic is now archived and is closed to further replies.