Jump to content
  • Checkout
  • Login
  • Get in touch

osCommerce

The e-commerce.

Spider Help


masat

Recommended Posts

Posted

How do you determine what to put in the spider list to kill the sid when the site is indexed.

 

Here is what my visited list shows:

 

Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp). March 15, 2004, 4:23 am http://coppercolander.com/advanced_search....6a8bb57ec4a5d69

 

I am guessing it is Yahoo! Slurp but could someone verify this for me.

 

Also I have noticed previously many of the SE robots seem to get hung up at the advanced_search.php link. i.e. in the above report that spider managed to get about 200 entries into my visit report and never went any further.

 

I have been working on a contrib that gives the above report and it seems to have some draw backs. I've been trying to get the install put together so it is some what smooth. I'll try to get it posted soon 'cause I would really like some help with it.

 

Tim

*edit* - URL

How do you know when you know what you want to do for the rest of your life?

Posted

Thanks Peter for your reply but I don't see where that thread answered the question unless your suggesting I place a disallow in the robots.txt file which I currently do not use anyway. I will implement the robots file later.

 

I currently have these bots listed in the /includes/spiders.txt file:

 

sleek spider

slurp/si

[email protected]

slurp

 

It is appearing to me that since yahoo acquired inktomi a while back they are useing the slurp bot to index sites now and the bot (slurp) as I have listed in the /includes/spiders.txt file is not keeping this bot from generating sid's. ...But I don't know what to place in the /includes/spiders.txt file to prevent sid generation for this particular flavor of slurp.

 

I am kind of curious about how yahoo is using these indexing visits also.

 

Tim

How do you know when you know what you want to do for the rest of your life?

Posted

Hi,

 

Do you have the string 'yahoo' defined in spiders.txt ?

 

When there was discussion about msnbot (starting carwling and no-one knew, .. too late it got the Sid's), I added:

 

msnbot

 

and then noticed from my web serve rlogs, the spider was called 'msnbot/0.11', so (panic,panic), I added:

 

msnbot/0.11

 

to spiders.txt, all to find out later, after looking at the osC code, that any wildcard is okay, so, in regards to msnbot, only:

 

msnbot

 

is needed. Since the first string that appeared in your logs was 'yahoo' (case is not important), but don't have uppercase in your spiders.txt ( I _think_ ), I'd place the string 'yahoo' in your spiders.txt, it's certainly not going to do any harm. Trouble is, you will have to wait till the next crawl, to get rid of the Sid's.

 

Peter

Posted

Hi,

 

Silly question I know, but you do have the following set:

 

Admin | Config | Sessions | Prevent Spider Sessions | True

 

Peter

Posted

Hi,

 

When I was concerned about spiders and how osC would handle them, I made up this small PHP script, and forced the user_agent to whatever I need to test for.

 

<?php
// include the domain checking functions  (osCommerce code)
 require('includes/application_top.php');
 
echo $session_started;
echo '<br>';
echo $user_agent;
echo '<br>';
echo $spider_flag;
echo '<br>';
echo $spiders[$i];
echo '<br>';
echo $SID;
echo '<br>';

// start the session  --  osCommerce code
 $session_started = false;
 if (SESSION_FORCE_COOKIE_USE == 'True') {
   tep_setcookie('cookie_test', 'please_accept_for_session', time()+60*60*24*30, $cookie_path, $cookie_domain);

   if (isset($HTTP_COOKIE_VARS['cookie_test'])) {
     tep_session_start();
     $session_started = true;
   }
 } elseif (SESSION_BLOCK_SPIDERS == 'True') {
   //$user_agent = strtolower(getenv('HTTP_USER_AGENT'));
   //$user_agent = strtolower("Googlebot/2.1 (+http://www.googlebot.com/bot.html)");
   $user_agent = strtolower("msnbot/0.11 (+http://search.msn.com/msnbot.htm)");
   $spider_flag = false;

   if (tep_not_null($user_agent)) {
     $spiders = file(DIR_WS_INCLUDES . 'spiders.txt');

     for ($i=0, $n=sizeof($spiders); $i<$n; $i++) {
       if (tep_not_null($spiders[$i])) {
         if (is_integer(strpos($user_agent, trim($spiders[$i])))) {
           $spider_flag = true;
           break;
         }
       }
     }
   }

   if ($spider_flag == false) {
     tep_session_start();
     $session_started = true;
   }
 } else {
   tep_session_start();
   $session_started = true;
 }

// set SID once, even if empty
 $SID = (defined('SID') ? SID : '');

echo $session_started;
echo '<br>';
echo $user_agent;
echo '<br>';
echo $spider_flag;
echo '<br>';
echo $spiders[$i];
echo '<br>';
echo $SID;
echo '<br>';

?>

 

For your testing, force the user_agent to be:

 

$user_agent = strtolower("Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)");

 

place the script in your 'catalog' path.

 

Peter

Posted

Hey Peter,

 

You are right it couldn't hurt. I placed yahoo in the spiders.txt and uploaded it to the site.

 

I placed your script as bot_test.php on my test site, ran it and got the following output:

 

1

 

 

 

osCsid=40d0e25000d013a326857e08fc2fedb0

1

 

 

 

osCsid=40d0e25000d013a326857e08fc2fedb0

 

What does this tell us?

 

Tim

How do you know when you know what you want to do for the rest of your life?

Posted

Sorry I missed one reply there...

 

Sessions was set false so I set it true and the bot_test.php script returned:

 

1

mozilla/4.0 (compatible; msie 6.0; windows nt 5.0; crazy browser 1.0.5; alexa toolbar; .net clr 1.1.4322)

 

 

osCsid=bf423d569bee3b536898bdd463436e2b

 

mozilla/5.0 (compatible; yahoo! slurp; http://help.yahoo.com/help/us/ysearch/slurp)

1

slurp

osCsid=bf423d569bee3b536898bdd463436e2b

 

Smae question... what does this tell us?

 

It looks to me like the bot_test is generating session id's

 

What say you my friend?

 

Tim

How do you know when you know what you want to do for the rest of your life?

Posted

Hi,

 

Sessions was set false so I set it true and the bot_test.php script returned:

 

1

mozilla/4.0 (compatible; msie 6.0; windows nt 5.0; crazy browser 1.0.5; alexa toolbar; .net clr 1.1.4322)

 

 

osCsid=bf423d569bee3b536898bdd463436e2b

 

 

mozilla/5.0 (compatible; yahoo! slurp; http://help.yahoo.com/help/us/ysearch/slurp)

1

slurp

osCsid=bf423d569bee3b536898bdd463436e2b

 

It looks to me like the bot_test is generating session id's

 

Okay, there are 5 echo's before and 5 after the 'hard coded' user agent.

 

1. The first display is:

  • Session started
  • Your browser as the user agent
  • No spider found
  • No spider to display
  • SID is displayed

2. The second display is:

  • Session has NOT started
  • The hard coded user agent
  • Spider has been found
  • The spider name is slurp
  • SID is displayed

The PHP strpos() function will find the position of first occurrence of a string, and the string is obtained from the file 'spiders.txt'

 

It found 'slurp', simply because 's' comes before 'y' (yahoo).

 

So, you have slurp defined, and it has been recognised. The only thing you wouldn't want is the SID of course, but that is only because the preceeding code has picked it up from your bowser user agent, or an already existing session that you have going (check your cookies).

 

I could find a 'tep_session_stop()', but did find a ' tep_session_close()' function, and a 'tep_session_destroy()' function, so try adding one of those just after the first lot of 'echos', and see if that gets rid of the SID for the hardcoded user agent.

 

Peter

Posted

Hi,

 

Oops, 5 minutes and the EDIT button disappears again.

 

I could find a 'tep_session_stop()', but did find a ' tep_session_close()' function, and a 'tep_session_destroy()' function, so ........

 

should be .........

 

I could not find a 'tep_session_stop()', but did find a ' tep_session_close()' function, and a 'tep_session_destroy()' function, so ............

 

Peter

Posted

Well this is a bit beyond my comfort zone as it may be bvut here is what I did...

 

I found in application_top.php this snippet:

 

// verify the browser user agent if the feature is enabled

if (SESSION_CHECK_USER_AGENT == 'True') {

$http_user_agent = getenv('HTTP_USER_AGENT');

if (!tep_session_is_registered('SESSION_USER_AGENT')) {

$SESSION_USER_AGENT = $http_user_agent;

tep_session_register('SESSION_USER_AGENT');

}

 

if ($SESSION_USER_AGENT != $http_user_agent) {

tep_session_destroy();

tep_redirect(tep_href_link(FILENAME_LOGIN));

}

}

 

I removed the redirect line and posted it just before the ?> in the bot_test.php script. I saw no change in the output other than the sid changeing.

 

I removed all three referals to "slurp" in the spiders.txt file and no difference.

 

I'm not sure why but there are no cookies listed in my cookies folder. It maybe because I'm running a localhost??? Are cookies used when sessions are enabled? I was kind of understanding that cookies are only used when sessions were not able to be used for some reason.

 

Should the snippet I placed effectively kill the session?

 

Tim

How do you know when you know what you want to do for the rest of your life?

Posted

TIA

 

This is the spider list I have(anyone know where I can get an up to date one ?:

 

acoon

ah-ha.com

ahoy

almaden.ibm.com

altavista-intranet

ananzi

anthill

appie 1.1

arachnophilia

arale

araneo

aranha

architext

aretha

arks

ask jeeves

asterias2.0

atn worldwide

atomz

augurfind

backrub

baiduspider

bannana_bot

bdcindexer

big brother

bjaaland

blackwidow

bloodhound

bradley

calif

cassandra

churl

cienciaficcion.net

cmc/0.01

collective

combine system

computingsite robi/1.0

crawler

crawler@fast

cusco

cyberspyder link test

deepindex

deweb? katalog/index

die blinde kuh

dig

digger

direct hit grabber

dittospyder

docomo

download express

dwcp (dridus' web cataloging project)

ebiness

e-collector

emacs-w3 search engine

esther

evliya celebi

ezresult

fast-webcrawler

felix ide

ferret

fetchrover

fido

fish search

fluffy

fluffy the spider

fouineur

freecrawl

frooglebot

funnelweb

gazz

gcreep

geobot

getterroboplus puu

geturl

golem

googlebot

grapnel/0.01 experiment

griffon

gromit

gulliver

 

hamahakki

harvest

havindex

henrythemiragorobot

hku www octopus

html index

html_analyzer

htmlgobble

hubater

hyper-decontextualizer

ia_archiver

ibm_planetwide

iltrovatore-setaccio

image.kapsi.net

imagelock

incywincy

infobee

informant

infoseek

infoseek sidewinder

ingrid

inktomisearch.com

inspector web

intelliagent

internet shinchakubin

ip3000

iron33

israeli-search

jack

javabee

jumpstation

katipo

kdd-explorer

kilroy

kit_fireball

kit-fireball

kototoi

labelgrabber

lachesis

larbin

legs

libwww-perl

link validator

linkscan

linkwalker

lockon

lycos

lycos_spider

mac wwwworm

magpie

mantraagent

marvin/infoseek

mattie

mediafox

mercator

merzscope

mnogosearch

moget

moget/1.0

moget/2.0

monster

moose

motor

muncher

muscatferret

mwd.search

nationaldirectory

nationaldirectory-webspider

naverrobot

nazilla

ncsa beta

nec-meshexplorer

nederland.zoek

netcarta webmap engine

netmechanic

netresearchserver

netscoop

newscan-online

ng/1.0

nhse web forager

nomad

northern light gulliver

nzexplorer

occam

open text

openfind data gatherer

orb search

osis-project

pack rat

pageboy

parasite

partnersite

patric

pegasus

pgp key agent

phantom

phpdig

picosearch

piltdownman

pimptrain.com

pinpoint

pioneer

piranha

plumtreewebaccessor

polybot

pompos

poppi

popular iconoclast

raven search

roach.smo.av.com-1.0

road runner

roadhouse

robbie

robofox

robozilla

rules

scooter

scrubby

search.aus-au.com

searchprocess

semanticdiscovery/0.1

senrigan

seventwentyfour

sg-scout

shagseeker

shai'hulud

shark

sidewinder

sift

simmany

site searcher

site valet

sitetech-rover

skymob.com

sleek

sleek spider

slurp

slurp/si

Slurp

[email protected]

snooper

speedfind

steeler/1.3

suke

suntek

supersnooper

surfnomore

sven

szukacz

tach black widow

tarantula

tcl w3

templeton

teoma

teoma_agent1

teomaagent

teomatechnologies

the peregrinator

the world wide web wanderer

the world wide web worm

t-h-u-n-d-e-r-s-t-o-n-e

titan

titin

tkwww

toutatis

t-rex

turnitinbot

ucsd

udmsearch

ultraseek

url check

vagabondo

valkyrie

verticrawl

victoria

vision-search

voilabot

voyager

w3c_validator

w3m2

w3mir

walhello appie

wallpaper

web core / roots

web hopper

web wombat

webbandit

webcatcher

webcopy

webfoot

weblayers

weblinker

weblog monitor

webmirror

webquest

webreaper

webs

websnarf

webstolperer

webvac

webwalk

webwalker

webwatch

webwombat

webzinger

wget

whatuseek winona

whizbang

whowhere

wild ferret

wire

wired digital

wwwc

xget

xift

yandex

Yahoo

Yahoo! Slurp

YahooSeeker/1.1

zao/0

zyborg

zyborg/1.0

Your online success is Paramount.

Posted

I am using a spider simulator to check my site.

 

If I set

Prevent Spider Sessions = True

and

Force Cookie Use = False .

 

I get:

 

If I set

Prevent Spider Sessions = True

and

Force Cookie Use = True .

 

I get:

 

What advantages/disadvantages are there in these options?

I'm afraid I don't fully understand the role all these options play in SEO.

Posted

Hi 'yesudo',

 

As you will see from this post:

 

http://www.oscommerce.com/forums/index.php?sho...60entry342554

 

it appears you should have the user agents in lowercase. You have the yahoo entries as mixed case, and as the user agent is converted to all lowercase chars before the search for the spider name (in the osC code), the spider name of 'Yahoo' woudn't be found. However, as the user agent usually has the url , which would contain the string 'yahoo' somewhere, it would find a match. :D

 

Also, you don't have 'msnbot' ?

 

Peter

 

PHP refs

 

http://au2.php.net/manual/en/function.strtolower.php

http://au2.php.net/manual/en/function.strpos.php

Posted

Hi Peter,

 

Thanx for the spot on msnbot - think i must have deleted it accidently at some stage - BIG mistake.

 

Apologies I got a bit confused with:

 

 However, as the user agent usually has the url , which would contain the string 'yahoo' somewhere, it would find a match.

 

Would you mind explaining differently - Many thanx.

Your online success is Paramount.

Posted

Hi,

 

I think a lot of people got caught out with 'msnbot'; I sure did, they crawled the site before I had it in 'spiders.txt', so now the SID's _may_ turn up in their search engines (if that's what the M$ msnbot is all about).

 

Apologies I got a bit confused with:

 

 However, as the user agent usually has the url , which would contain the string 'yahoo' somewhere, it would find a match.

 

Would you mind explaining differently

 

Okay, I'm no PHP guru, so just my understanding of how this code works:

 

/catalog/includes/application_top.php - lines 167 to 203

 

// start the session
 $session_started = false;
 if (SESSION_FORCE_COOKIE_USE == 'True') {
   tep_setcookie('cookie_test', 'please_accept_for_session', time()+60*60*24*30, $cookie_path, $cookie_domain);

   if (isset($HTTP_COOKIE_VARS['cookie_test'])) {
     tep_session_start();
     $session_started = true;
   }
 } elseif (SESSION_BLOCK_SPIDERS == 'True') {
   $user_agent = strtolower(getenv('HTTP_USER_AGENT'));
   $spider_flag = false;

   if (tep_not_null($user_agent)) {
     $spiders = file(DIR_WS_INCLUDES . 'spiders.txt');

     for ($i=0, $n=sizeof($spiders); $i<$n; $i++) {
       if (tep_not_null($spiders[$i])) {
         if (is_integer(strpos($user_agent, trim($spiders[$i])))) {
           $spider_flag = true;
           break;
         }
       }
     }
   }

   if ($spider_flag == false) {
     tep_session_start();
     $session_started = true;
   }
 } else {
   tep_session_start();
   $session_started = true;
 }

// set SID once, even if empty
 $SID = (defined('SID') ? SID : '');

 

If someones spiders.txt has the following:

 

....
Yahoo
Yahoo! Slurp
YahooSeeker/1.1
.....

 

then (my understand of) the osC PHP code, converts the user agent names to lowercase, and we have:

 

mozilla/5.0 (compatible; yahoo! slurp; http://help.yahoo.com/help/us/ysearch/slurp)

 

as the agent name. Now, the PHP function strpos() won't find a match on "Yahoo! Slurp", because:

 

'Yahoo! Slurp' <> 'yahoo! slurp'

 

.. and it won't find a match on these agents either, ......

 

Yahoo

YahooSeeker/1.1

 

but if you added 'yahoo' (lowercase) in spiders.txt, then it should find a match on all three user agents, in fact, my understanding of these 2 PHP functions is that you should only need to place 'yahoo' in your spiders.txt

 

This is because, to me, a strpos() would return true if the string it was trying to find was 'yahoo', and the user agent was any of the following:

 

Yahoo

Yahoo! Slurp

YahooSeeker/1.1

 

(because remember osC first converts the user agent to all lowercase first, before the strpos() function. )

 

You could try this code:

 

//$user_agent = strtolower(getenv('HTTP_USER_AGENT'));
$user_agent = strtolower('Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)');
$spider_flag = false;

if (tep_not_null($user_agent)) {
 $spiders = file(DIR_WS_INCLUDES . 'spiders.txt');

 for ($i=0, $n=sizeof($spiders); $i<$n; $i++) {
   if (tep_not_null($spiders[$i])) {
     if (is_integer(strpos($user_agent, trim($spiders[$i])))) {
       $spider_flag = true;
       break;
     }
   }
 }
}
echo $spider_flag;
echo $spiders[$i];

 

with using your _current_ 'spiders.txt', and then add the string 'yahoo' to the file 'spiders.txt', and run the small script again. I'm 99% certain that the first time around, the var $spider_flag will be false, the second time around , it will return true.

 

This is simply because the PHP strpos() function appears to be case sensitive, I couldn't see any references to it being otherwise, so if the agent user name gets converted to all lowercase, the only way you will get a match (true) is for the spider names to be lowercase.

 

My experience is application programming, definitely NOT web programming, I'm still an infant at this stuff, but the above is simply how I see the code working. Someone please correct me if I have misunderstood the use of these 2 PHP functions.

 

Hope that helps, :)

 

Peter

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...