stevel Posted October 2, 2006 Author Share Posted October 2, 2006 Richard, The answer to your question is provided in the "readme". Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
DeadDingo Posted October 2, 2006 Share Posted October 2, 2006 Thanks. I must be going blind in my old age. Quote Link to comment Share on other sites More sharing options...
wr19026 Posted October 18, 2006 Share Posted October 18, 2006 I'm using the September update of the spiders.txt. shopwiki is mentioned in that file. However it seems to be ignoring the spiders.txt, as it's been sucking up a ridiculous amount of bandwith without any return. Is there anything else that I can do to make sure that shopwiki is denied access to my site? Quote Link to comment Share on other sites More sharing options...
stevel Posted October 18, 2006 Author Share Posted October 18, 2006 You can contact the shopwiki.com support people - they were very responsive to a question I asked a short time ago. (The shopwiki spider has an annoying habit of trying variations on URLs, truncating them at punctuation points. I was told that this was their attempt to "optimize". It gets me a lot of 404 errors..) If you truly want to deny access, you can add a "Deny from" entry to your .htaccess for their IP range (I don't know what it is offhand), but that won't stop them from trying. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
forest23 Posted October 19, 2006 Share Posted October 19, 2006 I'm using the September update of the spiders.txt. shopwiki is mentioned in that file. However it seems to be ignoring the spiders.txt, as it's been sucking up a ridiculous amount of bandwith without any return. Is there anything else that I can do to make sure that shopwiki is denied access to my site? I was having the same problem but shopwiki have obeyed the robots.txt when I disallowed them...so far at least! Quote Link to comment Share on other sites More sharing options...
Phocea Posted October 26, 2006 Share Posted October 26, 2006 A couple you migh want to add which are browsing mainly french sites: GET /links.php HTTP/1.1" 200 71316 "-" "BIGLOTRON (Beta 2;GNU/Linux) Comes regularly but only do a few pages GET /robots.txt HTTP/1.0" 200 1666 "-" "Graal (http://www.gralon.net) Comes very often tu update its directory/saearch engines and crawls about 500 pages at a time resulting in a big shopping cart :) Quote Link to comment Share on other sites More sharing options...
stevel Posted October 26, 2006 Author Share Posted October 26, 2006 Thanks. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
acb Posted November 14, 2006 Share Posted November 14, 2006 Hello all, I have a spider cruising around my site: 216.113.181.67, and it seems to be identified as EBay. http://www.showmyip.com/?ip=216.113.181.67 OrgName: eBay, Inc OrgID: EBAY Address: 2145 Hamilton Ave City: San Jose StateProv: CA PostalCode: 95008 Country: US I have put put "ebay" into spiders.txt but it does not prevent this one from getting sessions; sometimes about 10 or 15 at a time! Any ideas what to do? Does it have another name or user-agent? Thanks for all ideas.... Quote Link to comment Share on other sites More sharing options...
stevel Posted November 14, 2006 Author Share Posted November 14, 2006 You have not shown what the user agent string is. The IP does not help. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
acb Posted November 15, 2006 Share Posted November 15, 2006 It shows no user-agent in the whois entry... that is what is puzzling me. Quote Link to comment Share on other sites More sharing options...
acb Posted November 15, 2006 Share Posted November 15, 2006 It shows no user-agent in the whois entry... that is what is puzzling me. (The other bots seem to). Quote Link to comment Share on other sites More sharing options...
stevel Posted November 19, 2006 Author Share Posted November 19, 2006 WHOIS entries rarely show user agent strings - in fact I have yet to see one that does. What I was asking for was the user agent string from your access log. That is what the "prevent spider sessions" feature looks at, not IPs. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
Andreas2003 Posted December 6, 2006 Share Posted December 6, 2006 Hi Steve, I got hits from someone with the referer "PycURL/7.15.5". I already have an updated version of your spider-list, but do you know if this is a regular (search-engine) spider or an unfriendly one ? I'm asking because I have hits from two different IP-addresses (one is located in Brazil and one in Saudi Arabia) and both had this referer as a.m. Thanks for your opinion in advance, kind regards Andreas Quote Link to comment Share on other sites More sharing options...
stevel Posted December 6, 2006 Author Share Posted December 6, 2006 pycURL is an implementation of the cURL library for the Python language. It is not "bad", but you can assume that anyone using pycURL is not browsing your site normally and can be treated as a spider. Add the string: pycurl to spiders.txt for now. I'll add this in the next update. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
Andreas2003 Posted December 7, 2006 Share Posted December 7, 2006 Hi Steve, thanks for your reply. Is the pycURL already in your spider.txt, isnt it ?? I assumed so, because I was the opinion, that when I got that referer mentioned in my stats, the referer is already there. You also said, that this is not for normal browsing. Do this mean, someone is trying to grab something from my site with pycURL ? Quote Link to comment Share on other sites More sharing options...
stevel Posted December 7, 2006 Author Share Posted December 7, 2006 I'm away from my files so I am not sure if it is there. But I think "curl" is there which would take care of this. Yes, this does indicate some sort of automated grabber. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
Andreas2003 Posted December 7, 2006 Share Posted December 7, 2006 If someone is grabbing something, do you know a way, how to stop him (robots.txt) ? Quote Link to comment Share on other sites More sharing options...
stevel Posted December 7, 2006 Author Share Posted December 7, 2006 robots.txt will work only if the spider obeys it, and that is not likely in this case. You can block that user agent using .htaccess. Is this "visitor" causing problems for you? Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
Andreas2003 Posted December 10, 2006 Share Posted December 10, 2006 Dont know, just want to know, what a Grabber will grab. Quote Link to comment Share on other sites More sharing options...
stevel Posted December 10, 2006 Author Share Posted December 10, 2006 "grabber" is a term used for any kind of automated web page reader that stores copies of what it finds on the web page. A search engine spider is a grabber, but usually people reserve this term for scripts other than search engine spiders. For example, I know of a script that grabs copies of any favicons it finds on a site. For the purpose of spiders.txt, you'd like to be able to recognize non-human visitors so as to avoid assigning a session to them. Being listed in spiders.txt does NOT restrict a non-human visitor from seeing the pages on your site (other than those that require a session, such as the cart). If you have a non-human visitor that is causing you problems, such as excessive bandwidth, you have to look to other means to stop them. Well-behaved scripts do obey robots.txt, but there are many not well behaved (often run by individuals.) For these, you have to resort to other means such as IP and user agent blocks in .htaccess. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
stevel Posted December 11, 2006 Author Share Posted December 11, 2006 It turns out that "curl" wasn't in spiders.txt and it definitely needs to be - especially due to pyCURL. I have updated the contrib to include this and some other strings. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
NancyL7 Posted December 12, 2006 Share Posted December 12, 2006 Hello, Newbie here....... but, I'm getting a lot of hits from 74.6.86.148 74.6.66.51 74.6.73.248 74.6.86.148 74.6.87.103 and so on The network id says it is the Inktomi Corporation (from domain dossier) These connections are constant, make huge guest carts, and the connections are multiplying. I now have 6 Is this a spider? I am using your spiders_large.txt in my site (renamed of course) but it is not preventing this. Is this an 'ok' connection, or something I should worry about? It seems to be cycling through all my products, over and over. Anyone have advice? Thanks, Nancy Quote Link to comment Share on other sites More sharing options...
stevel Posted December 12, 2006 Author Share Posted December 12, 2006 That's one of Yahoo's spiders. Do you have the user agent string from the access log? Typically Yahoo's spiders have :"slurp" in the UA which spiders.txt includes. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
NancyL7 Posted December 12, 2006 Share Posted December 12, 2006 Ok, I need help finding the 'access.log' -Nancy Quote Link to comment Share on other sites More sharing options...
NancyL7 Posted December 12, 2006 Share Posted December 12, 2006 That's one of Yahoo's spiders. Do you have the user agent string from the access log? Typically Yahoo's spiders have :"slurp" in the UA which spiders.txt includes. yippee.. I figured out where to find the answer! Yes, it does.. here is a paste of one of them: 74.6.74.31 - - [11/Dec/2006:15:09:38 -0600] "GET /products_new.php?action=buy_now&products_id=219&osCsid=f56ba6b26df5bafbf65ddae3118e7f88 HTTP/1.0" 302 0 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" Incidentally, the number of connections has gone down.. I now only have 4, but their carts have grown - and one now contains 26 items! Thanks, Nancy Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.