Updated spiders.txt Official Support Topic

stevel · October 2, 2006

Richard,

The answer to your question is provided in the "readme".

DeadDingo · October 2, 2006

Thanks.

I must be going blind in my old age.

wr19026 · October 18, 2006

I'm using the September update of the spiders.txt. shopwiki is mentioned in that file.

However it seems to be ignoring the spiders.txt, as it's been sucking up a ridiculous amount of bandwith without any return.

Is there anything else that I can do to make sure that shopwiki is denied access to my site?

stevel · October 18, 2006

You can contact the shopwiki.com support people - they were very responsive to a question I asked a short time ago. (The shopwiki spider has an annoying habit of trying variations on URLs, truncating them at punctuation points. I was told that this was their attempt to "optimize". It gets me a lot of 404 errors..)

If you truly want to deny access, you can add a "Deny from" entry to your .htaccess for their IP range (I don't know what it is offhand), but that won't stop them from trying.

forest23 · October 19, 2006

I'm using the September update of the spiders.txt. shopwiki is mentioned in that file.

However it seems to be ignoring the spiders.txt, as it's been sucking up a ridiculous amount of bandwith without any return.

Is there anything else that I can do to make sure that shopwiki is denied access to my site?

I was having the same problem but shopwiki have obeyed the robots.txt when I disallowed them...so far at least!

Phocea · October 26, 2006

A couple you migh want to add which are browsing mainly french sites:

GET /links.php HTTP/1.1" 200 71316 "-" "BIGLOTRON (Beta 2;GNU/Linux)

Comes regularly but only do a few pages

GET /robots.txt HTTP/1.0" 200 1666 "-" "Graal (http://www.gralon.net)

Comes very often tu update its directory/saearch engines and crawls about 500 pages at a time resulting in a big shopping cart :)

stevel · October 26, 2006

Thanks.

acb · November 14, 2006

Hello all,

I have a spider cruising around my site: 216.113.181.67, and it seems to be identified as EBay.

http://www.showmyip.com/?ip=216.113.181.67

OrgName: eBay, Inc

OrgID: EBAY

Address: 2145 Hamilton Ave

City: San Jose

StateProv: CA

PostalCode: 95008

Country: US

I have put put "ebay" into spiders.txt but it does not prevent this one from getting sessions; sometimes about 10 or 15 at a time!

Any ideas what to do? Does it have another name or user-agent?

Thanks for all ideas....

stevel · November 14, 2006

You have not shown what the user agent string is. The IP does not help.

acb · November 15, 2006

It shows no user-agent in the whois entry... that is what is puzzling me.

acb · November 15, 2006

It shows no user-agent in the whois entry... that is what is puzzling me.

(The other bots seem to).

stevel · November 19, 2006

WHOIS entries rarely show user agent strings - in fact I have yet to see one that does. What I was asking for was the user agent string from your access log. That is what the "prevent spider sessions" feature looks at, not IPs.

Andreas2003 · December 6, 2006

Hi Steve,

I got hits from someone with the referer "PycURL/7.15.5".

I already have an updated version of your spider-list, but do you know if this is a regular (search-engine) spider or an unfriendly one ?

I'm asking because I have hits from two different IP-addresses (one is located in Brazil and one in Saudi Arabia) and both had this referer as a.m.

Thanks for your opinion in advance,

kind regards

Andreas

stevel · December 6, 2006

pycURL is an implementation of the cURL library for the Python language. It is not "bad", but you can assume that anyone using pycURL is not browsing your site normally and can be treated as a spider.

Add the string:

pycurl

to spiders.txt for now. I'll add this in the next update.

Andreas2003 · December 7, 2006

Hi Steve,

thanks for your reply.

Is the pycURL already in your spider.txt, isnt it ??

I assumed so, because I was the opinion, that when I got that referer mentioned in my stats, the referer is already there.

You also said, that this is not for normal browsing. Do this mean, someone is trying to grab something from my site with pycURL ?

stevel · December 7, 2006

I'm away from my files so I am not sure if it is there. But I think "curl" is there which would take care of this. Yes, this does indicate some sort of automated grabber.

Andreas2003 · December 7, 2006

If someone is grabbing something, do you know a way, how to stop him (robots.txt) ?

stevel · December 7, 2006

robots.txt will work only if the spider obeys it, and that is not likely in this case. You can block that user agent using .htaccess. Is this "visitor" causing problems for you?

Andreas2003 · December 10, 2006

Dont know, just want to know, what a Grabber will grab.

stevel · December 10, 2006

"grabber" is a term used for any kind of automated web page reader that stores copies of what it finds on the web page. A search engine spider is a grabber, but usually people reserve this term for scripts other than search engine spiders. For example, I know of a script that grabs copies of any favicons it finds on a site.

For the purpose of spiders.txt, you'd like to be able to recognize non-human visitors so as to avoid assigning a session to them. Being listed in spiders.txt does NOT restrict a non-human visitor from seeing the pages on your site (other than those that require a session, such as the cart).

If you have a non-human visitor that is causing you problems, such as excessive bandwidth, you have to look to other means to stop them. Well-behaved scripts do obey robots.txt, but there are many not well behaved (often run by individuals.) For these, you have to resort to other means such as IP and user agent blocks in .htaccess.

stevel · December 11, 2006

It turns out that "curl" wasn't in spiders.txt and it definitely needs to be - especially due to pyCURL. I have updated the contrib to include this and some other strings.

NancyL7 · December 12, 2006

Hello,

Newbie here....... but, I'm getting a lot of hits from

74.6.86.148

74.6.66.51

74.6.73.248

74.6.86.148

74.6.87.103

and so on

The network id says it is the Inktomi Corporation (from domain dossier)

These connections are constant, make huge guest carts, and the connections are multiplying. I now have 6

Is this a spider? I am using your spiders_large.txt in my site (renamed of course) but it is not preventing this.

Is this an 'ok' connection, or something I should worry about? It seems to be cycling through all my products, over and over.

Anyone have advice?

Thanks,

Nancy

stevel · December 12, 2006

That's one of Yahoo's spiders. Do you have the user agent string from the access log? Typically Yahoo's spiders have :"slurp" in the UA which spiders.txt includes.

NancyL7 · December 12, 2006

Ok, I need help finding the 'access.log'

-Nancy

NancyL7 · December 12, 2006

That's one of Yahoo's spiders. Do you have the user agent string from the access log? Typically Yahoo's spiders have :"slurp" in the UA which spiders.txt includes.

yippee.. I figured out where to find the answer!

Yes, it does.. here is a paste of one of them:

74.6.74.31 - - [11/Dec/2006:15:09:38 -0600] "GET /products_new.php?action=buy_now&products_id=219&osCsid=f56ba6b26df5bafbf65ddae3118e7f88 HTTP/1.0" 302 0 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

Incidentally, the number of connections has gone down.. I now only have 4, but their carts have grown - and one now contains 26 items!

Thanks,

Nancy

Updated spiders.txt Official Support Topic

Recommended Posts

Link to comment

Share on other sites

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Popular Posts

stevel

sackling

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation