Jump to content
  • Checkout
  • Login
  • Get in touch

osCommerce

The e-commerce.

Some info needed please


Guest

Recommended Posts

Posted

Hello guys,

 

I have noticed that google.com.au has updated my listing with them however it only seemed to update the index.php (.html with SEF addon)

 

I looked in my logs and found googlebot v2 had visited my site, however he ended up at spiders.txt according to my log and then left.

 

Can anyone help me find out why the googlebot did not index my whole or even partial of my site.

I have the setting on in ADMIN for "Prevent Spider Sessions"

 

Here is a copy of that file spiders.txt exactly as it is

 

Any help would be much appreciated.

$Id: spiders.txt,v 1.2 2003/05/05 17:58:17 dgw_ Exp $
almaden.ibm.com
appie 1.1
architext
ask jeeves
asterias2.0
augurfind
baiduspider
bannana_bot
bdcindexer
crawler
crawler@fast
docomo
fast-webcrawler
fluffy the spider
frooglebot
geobot
googlebot
gulliver
henrythemiragorobot
ia_archiver
infoseek
kit_fireball
lachesis
lycos_spider
mantraagent
mercator
moget/1.0
muscatferret
nationaldirectory-webspider
naverrobot
ncsa beta
netresearchserver
ng/1.0
osis-project
polybot
pompos
scooter
seventwentyfour
sidewinder
sleek spider
slurp/si
[email protected]
steeler/1.3
szukacz
t-h-u-n-d-e-r-s-t-o-n-e
teoma
turnitinbot
ultraseek
vagabondo
voilabot
w3c_validator
zao/0
zyborg/1.0

Posted

How long has it been.

 

It can easily take about 3 months to get all of your pages indexed.

-------------------------------------------------------------------------------------------------------------------------

NOTE: As of Oct 2006, I'm not as active in this forum as I used to be, but I still work with osC quite a bit.

If you have a question about any of my posts here, your best bet is to contact me though either Email or PM in my profile, and I'll be happy to help.

Posted

Ok so this visit does not mean googlebot will do the whole site.

I was just concerned that maybe the spiders.txt or something was telling google not to index my entire site.

Posted

Hi,

 

You should create a 'robots.txt' file that excludes spiders from your includes directory, there is no usefull information in there and you wasted googlebots time looking at some buttons with different languages on it.

 

Just search for 'robots.txt' in google.

 

Plus there is many more things you need to do.

Posted

Can you please help me with the robots.txt

Where do I source the content, where do I put it (/includes/)

 

Any support would be much appreciated

 

Is this why it did not index my site because i have no robots.txt

Posted

Hello.

I now have a robots.txt file.

And once again the googlebot came, see below.

 

Visits from Googlebot    Googlebot 
18/Feb/2004:18:27:48 64.68.82.169 /robots.txt "Googlebot/2.1 
18/Feb/2004:18:27:49 64.68.82.169 / "Googlebot/2.1

 

You see how it stopped at robots.txt

What have I done wrong, i really need google to index me and he seems to keep stopping.

 

This is my robots.txt

 

User-agent: *
Disallow: /chat/
Disallow: /live_support/
Disallow: /admin/

 

Please any help from you SEO's would be much appreciated

Posted

First of all, you don't have to have a robots.txt file to be indexed by Google, or any other search engine.

 

Second, while there are 1000's of thing you can do to increase your rankings once you are fully indexed, you han've done anything wrong as far as begin fully parsed yet.

 

No one ever said that google was supposed to come to your site and parse through all of your pages at once. Typically, this is not the way the googlebots behave.

 

For more information about this subhject, you should research search engine spiders.

-------------------------------------------------------------------------------------------------------------------------

NOTE: As of Oct 2006, I'm not as active in this forum as I used to be, but I still work with osC quite a bit.

If you have a question about any of my posts here, your best bet is to contact me though either Email or PM in my profile, and I'll be happy to help.

Posted

Hi,

 

Can someone please explain why the file spiders.txt is in /catalog/includes/ path, whilst the file robots.txt is in the /catalog/ path ??

 

Wouldn't spiders look in the /catalog/ path for spiders.txt ??

 

Peter

Posted

Because they are two different files that server two different funcitons.

 

robots.txt will keep spiders from indexing directories or files that you specify.

 

spiders.txt is a part of the spider session killer that will keep defined spider user agents from having a session id in the url so that the session ids don't end up in the search engine index.

-------------------------------------------------------------------------------------------------------------------------

NOTE: As of Oct 2006, I'm not as active in this forum as I used to be, but I still work with osC quite a bit.

If you have a question about any of my posts here, your best bet is to contact me though either Email or PM in my profile, and I'll be happy to help.

Posted

Hi Chris,

 

Thanks for explaining that. Just so that I have digested it properly. :)

 

1. robots.txt is for spiders/bots and is not used by osC, as such (i.e. a browser user who is visiting your osC store/site will not use robots.txt )

 

2. spiders.txt is only for osC internal controls, and not used by spiders/bots,etc directly, but is used by osC, (if you turn spider session id's of in the admin section), to indicate which spider user agents, to turn off the session id.

 

Hope I have it right ??

 

Peter

Posted

Correct.

 

:D

-------------------------------------------------------------------------------------------------------------------------

NOTE: As of Oct 2006, I'm not as active in this forum as I used to be, but I still work with osC quite a bit.

If you have a question about any of my posts here, your best bet is to contact me though either Email or PM in my profile, and I'll be happy to help.

Posted

Hi again.

 

Since I first posted googlebot has visited me every day.

However it NEVER passes robots.txt

 

Is this normal?

Surely it is supposed to start looking in my site by now.

I have some good titles, for example, if someone does a google for "koala computers" i come up number one.

 

But if they search for "Cheap computers" or "Unlimited ADSL" or $49 adsl etc nothing comes up for me.

I have that in title, meta tags, content & many pages.

 

Does any SEO know if i have missed something, will google fully index me?

 

If you are a pro SEO then I am willing to pay you to get me ranked, right now I am using pay per click on google, very well many sales from it, however i dont want to pay for this.

 

PLEASE help me get indexed.

Email me on [email protected] with SEO quote if you think you can get me ranked high.

Posted

If your site comes up with "koala computers", then it already *has* been past your robots.txt.

 

The 'cheap computers' keyphrase you mention below is *extremly* compeditive, at over 5 million sites returned.. You can optimize until you are blue in the face, and you'll never even sniff the top 50 pages.

 

The Unlimited ADSL should be doable with the proper SEO. You'll notice that you also come up first with the search "Unlimited ADSL koala"

 

So, the problem isn't necessarily that the spiders havn't indexed you, it's that you havent optimized the stie for the key phrases you want, or the key phrases you want are too compeditive to have a realistic shot at getting anywhere near the top.

 

Needless to say, in either event, you're in the wrong forum.

-------------------------------------------------------------------------------------------------------------------------

NOTE: As of Oct 2006, I'm not as active in this forum as I used to be, but I still work with osC quite a bit.

If you have a question about any of my posts here, your best bet is to contact me though either Email or PM in my profile, and I'll be happy to help.

  • 3 weeks later...
Posted

Hi,

 

I'm somewhat mystified by the web server logs, as the spider/bot called 'msnbot' is showing up all the session ID's ???

 

Here is the setup:

 

1. Admin | Sessions | Prevent Spider Sessions | True

 

2. Contents of /catalog/robots.txt

 

User-agent: *
Disallow: /images/
Disallow: /includes/

 

3. Contents of /catalog/includes/spiders.txt

 

$Id: spiders.txt,v 1.2 2003/05/05 17:58:17 dgw_ Exp $
almaden.ibm.com
appie 1.1
architext
ask jeeves
asterias2.0
augurfind
baiduspider
bannana_bot
bdcindexer
crawler
crawler@fast
docomo
fast-webcrawler
fluffy the spider
frooglebot
geobot
gigabot
googlebot
gulliver
henrythemiragorobot
ia_archiver
infoseek
kit_fireball
lachesis
lycos_spider
mantraagent
mercator
moget/1.0
msnbot
muscatferret
nationaldirectory-webspider
naverrobot
ncsa beta
netresearchserver
ng/1.0
osis-project
polybot
pompos
scooter
seventwentyfour
sidewinder
sleek spider
slurp/si
[email protected]
steeler/1.3
szukacz
t-h-u-n-d-e-r-s-t-o-n-e
teoma
turnitinbot
ultraseek
vagabondo
voilabot
w3c_validator
zao/0
zyborg/1.0

 

4. Part of the web server logs .........

 

65.54.188.30 - - [04/Mar/2004:20:06:04 -0500] "GET /robots.txt HTTP/1.0" 200 54 "-" "msnbot/0.11 (+http://search.msn.com/msnbot.htm)"
65.54.188.30 - - [04/Mar/2004:20:06:04 -0500] "GET / HTTP/1.0" 200 27141 "-" "msnbot/0.11 (+http://search.msn.com/msnbot.htm)"
65.54.188.30 - - [04/Mar/2004:23:32:13 -0500] "GET /index.php?cPath=1&osCsid=5ddbfe3b41ade4da1ce255170335ac6e HTTP/1.0" 200 24579 "-" "msnbot/0.11 (+http://search.msn.com/msnbot.htm)"
65.54.188.30 - - [04/Mar/2004:23:32:18 -0500] "GET /shopping_cart.php?osCsid=5ddbfe3b41ade4da1ce255170335ac6e HTTP/1.0" 200 20324 "-" "msnbot/0.11 (+http://search.msn.com/msnbot.htm)"
65.54.188.30 - - [04/Mar/2004:23:32:45 -0500] "GET /index.php?cPath=6&osCsid=5ddbfe3b41ade4da1ce255170335ac6e HTTP/1.0" 200 24183 "-" "msnbot/0.11 (+http://search.msn.com/msnbot.htm)"
65.54.188.30 - - [04/Mar/2004:23:33:10 -0500] "GET /index.php?cPath=25&osCsid=5ddbfe3b41ade4da1ce255170335ac6e HTTP/1.0" 200 19121 "-" "msnbot/0.11 (+http://search.msn.com/msnbot.htm)"
65.54.188.30 - - [04/Mar/2004:23:34:23 -0500] "GET /specials.php?osCsid=5ddbfe3b41ade4da1ce255170335ac6e HTTP/1.0" 200 19613 "-" "msnbot/0.11 (+http://search.msn.com/msnbot.htm)"
65.54.188.30 - - [04/Mar/2004:23:34:30 -0500] "GET /reviews.php?osCsid=5ddbfe3b41ade4da1ce255170335ac6e HTTP/1.0" 200 19534 "-" "msnbot/0.11 (+http://search.msn.com/msnbot.htm)"

 

I noticed the file format for robots.txt and spiders.txt is Unix (LF only), is that correct ??

 

Surely I wouldn't have to add:

 

msnbot/0.11

 

to the spiders.txt file, that is, have to continually monitor the web logs and see if there are new version numers out ??

 

Would the exlusion of the '/includes/' path in robots.txt have anything to do with it ? Hmmm, I don't see how, it is only for osC useage, the file spiders.txt isn't used by bots, in fact a traverse of /icludes/ is not possible.

 

Other things that may help solve the mystery ...........

 

/catalog/.htccess

 

# $Id: .htaccess,v 1.3 2003/06/12 10:53:20 hpdl Exp $
#
# This is used with Apache WebServers
#
# For this to work, you must include the parameter 'Options' to
# the AllowOverride configuration
#
# Example:
#
# <Directory "/usr/local/apache/htdocs">
#   AllowOverride Options
# </Directory>
#
# 'All' with also work. (This configuration is in the
# apache/conf/httpd.conf file)

# The following makes adjustments to the SSL protocol for Internet
# Explorer browsers

<IfModule mod_setenvif.c>
 <IfDefine SSL>
   SetEnvIf User-Agent ".*MSIE.*" \
            nokeepalive ssl-unclean-shutdown \
            downgrade-1.0 force-response-1.0
 </IfDefine>
</IfModule>

# Fix certain PHP values
# (commented out by default to prevent errors occuring on certain
# servers)

#<IfModule mod_php4.c>
#  php_value session.use_trans_sid 0
#  php_value register_globals 1
#</IfModule>

 

/catalog/includes/.htccess

 

# $Id: .htaccess,v 1.4 2001/04/22 20:30:03 dwatkins Exp $
#
# This is used with Apache WebServers
# The following blocks direct HTTP requests in this directory recursively
#
# For this to work, you must include the parameter 'Limit' to the AllowOverride configuration
#
# Example:
#
#<Directory "/usr/local/apache/htdocs">
#  AllowOverride Limit
#
# 'All' with also work. (This configuration is in your apache/conf/httpd.conf file)
#
# This does not affect PHP include/require functions
#
# Example: http://server/catalog/includes/application_top.php will not work

<Files *.php>
Order Deny,Allow
Deny from all
</Files>

 

Did a phpinfo(), and ..............

 

session.use.trans.sid is On

register_globals is On

 

Thanks,

 

Peter

Posted

Hi,

 

Maybe I have to add any spiders/bots with the version numbers ? I have noticed for nearly a month now, all 'Googlebot' does is

 

64.68.82.172 - - [01/Mar/2004:03:59:37 -0500] "GET /robots.txt HTTP/1.0" 200 54 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.82.172 - - [01/Mar/2004:03:59:39 -0500] "GET / HTTP/1.0" 200 24469 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.82.58 - - [01/Mar/2004:04:10:23 -0500] "GET /robots.txt HTTP/1.0" 200 54 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.82.58 - - [01/Mar/2004:04:10:24 -0500] "GET / HTTP/1.0" 200 24487 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

 

... and the entry for 'Googlebot' in /includes/spiders.txt is .........

 

googlebot

 

.. so wouldn't this seem to suggest that the entry for 'Googlebot" should be:

 

Googlebot/2.1

 

and then it would spider the whole site, not just a page or two ??

 

Thanks,

 

Peter

Posted

Hi,

 

Should the PHP.INI setting "session.use.trans.sid" be On or Off ??

 

From a phpinfo(), my setting is:

 

session.use.trans.sid is On

 

yet from the osC MS-2, the .htaccess in /catalog/ path indicates that it should be Off, as follows:

 

#<IfModule mod_php4.c>
#  php_value session.use_trans_sid 0
#  php_value register_globals 1
#</IfModule>

 

From http://www.php.net/manual/en/ref.session.p...n.use-trans-sid

 

session.use_trans_sid  boolean

 

    session.use_trans_sid whether transparent sid support is enabled or not. Defaults to 0 (disabled).

 

        Note: For PHP 4.1.2 or less, it is enabled by compiling with --enable-trans-sid. From PHP 4.2.0, trans-sid feature is always compiled.

 

        URL based session management has additional security risks compared to cookie based session management. Users may send a URL that contains an active session ID to their friends by email or users may save a URL that contains a session ID to their bookmarks and access your site with the same session ID always, for example.

 

Pulling my hair out on this one !! :D

 

Peter

Posted

Hi Dave,

 

Plus you would want to keep them out of your includes dir.

 

Disallow: /catalog/includes/

 

When I saw what you said, I thought it was a good idea, so I went and added the above 'disallow' to 'robots.txt'.

 

However, now having had sometime to research more on spiders/bots,etc, I don't think it is such a good idea.

 

1. If you try http://yourdomainname/includes

 

... you should get a "403" message, .. forbidden, anyway. :D

 

2. By adding the "/includes/" path in "robots.txt", you are supplying any spider/bot with path/folder information, that they would not have otherwise been able to know about.

 

I think I'll take it out. :)

 

Peter

Posted

Usually I put (one) robots.txt in the root dir. So if your store is in catalog then this 'Disallow: /catalog/includes/' would be right.

 

65.54.188.30 - - [04/Mar/2004:23:32:18 -0500] "GET /shopping_cart.php?osCsid=5ddbfe3b41ade4da1ce255170335ac6e HTTP/1.0" 200 20324 "-" "msnbot/0.11 (+http://search.msn.com/msnbot.htm)"

 

This to me says your store is in the root directory like so

 

http://mydomain.com/shopping_cart.php

 

But all the files you mentioned said your store was in the catalog dir.

 

So the log should look like this

 

GET /catalog/shopping_cart.php?osCsid...

 

And you would get your store like this

 

http://mydomain.com/catalog/shopping_cart.php

 

Do you have two installed?

Posted

Hi,

 

The point I'm _trying_ to make is the following in robots.txt

 

Disallow: /catalog/includes/

 

OR ...........

 

Disallow: /includes/

 

.. are NOT needed, and is in fact, redundant, because ........

 

1. Any spider/bot or web user should not be able to get to the "/includes/" path anyway. (i.e. permissions)

 

2. Supplying the "/includes/" path will compromise your security, because normally (due to path permissions), no-one can 'see' that path. There is a significant amount of information in /includes/ that you wouldn't want anyone to see/view.

 

3. .......so, why tell them it is there, it only makes a hackers job easier. :D

 

Peter

Posted

That is all great but the fact is no hacker will 'never' bother to look at your robots.txt file.

 

This file is simply a no trespassing sign for spiders/bots.

 

Ill quote myself

 

They know about it because your language images are in there.

 

http://.../catalog/includes/languages/images/icon.gif

 

And restate it...

 

Everyone can simply see the info you want to hide with robots.txt.

 

http://.../catalog/includes/languages/images/icon.gif

Posted

Is it possible to prevent browser viewing of your robots.txt file, however still alllow bots to use the file?

 

I moved /admin/ to /supersecretadmin/, then included

 

Disallow: /supersecretadmin/

 

in robots.txt ...

 

... because anyone can view robots.txt, this defeats my original purpose. :(

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...