Jump to content
  • Checkout
  • Login
  • Get in touch

osCommerce

The e-commerce.

How to serve a 404 for pages that do not exist and contain index.php?


Biancoblu

Recommended Posts

Hi,

 

I'm trying to get non existent pages that contain index.php to serve a 404.

 

Example: I want http://www.mysite.com/index.php/typo to give a 404 file not found. Right now a page like that just redirects to the home page, the outcome is that non existent pages have started showing in search engines results.

 

Currently I can only serve a 404 for non existent pages WITHOUT index.php: for example: http://www.mysite.com/typo gives a 404.

I have placed

ErrorDocument 404 /404.php

in htacces.

 

Based on the above I tried

ErrorDocument index.php/404 /404.php

but I am given an internal server error.

 

Any help appreciated.

Thanks

 

Isabella

~ Don't mistake my kindness for weakness ~

Link to comment
Share on other sites

Hi Isabella.

 

I had a similar problem. Google found hundreds of ways to spider the reviews.

It found urls like

product_reviews.php?sort=3a&page=6&products_id=836

And Google had “reviews” as my top keyword.

 

There are several ways to prevent this. Add this to your Robots file

 

User-agent: Googlebot

Disallow: /typo

 

Or, to serve a 404 error use

 

header('HTTP/1.1 404 Not Found');

or

header('HTTP/1.1 410 Gone');

 

There is no contribution to solve this, which surprises me as it must be a common problem. There should be some way to tell Google to look at our sitemaps ONLY and not search for these obscure urls.

 

Regards

 

Ken

Link to comment
Share on other sites

@@Ken44

 

There is no contribution to solve this

 

For duplicate content issues, did you look at http://addons.oscommerce.com/info/7163

Sam

 

Remember, What you think I ment may not be what I thought I ment when I said it.

 

Contributions:

 

Auto Backup your Database, Easy way

 

Multi Images with Fancy Pop-ups, Easy way

 

Products in columns with multi buy etc etc

 

Disable any Category or Product, Easy way

 

Secure & Improve your account pages et al.

Link to comment
Share on other sites

@@Ken44

 

 

Or, to serve a 404 error use

 

header('HTTP/1.1 404 Not Found');

or

header('HTTP/1.1 410 Gone');

 

 

Hi Ken

 

thanks for replying. The above would work for a page that exists, but how does one deal with pages that don't exist in the first place? As for the command in robots.txt, I can add the lines for the wrong url's I found in Google, but that will never take care of all of them as the possibilities are endless.

 

That's why I was thinking of doing it through htaccess by just serving a 404 to all non existent pages with index.php contained in the url, it works for pages that DON'T contain index.php, so it must be feasible, but the code I tried obviously is not correct.

 

Any experts on htaccess syntax, please help.

 

@@spooks

Hi Sam, thanks for the link to your addon, however my problem is not duplicate content, it's preventing google from listing non existent pages that redirect to home, so I'm not sure that addon would help.

~ Don't mistake my kindness for weakness ~

Link to comment
Share on other sites

problem is not duplicate content

I'm aware of that, but Ken was describing an issue that is and saying there was no solution, when there is.

 

Re your issue, I find you need to adjust your htaccess as the issues arise & sending to a 404 is a mistake as Google will complain at that

 

My approach is: often the link contains a remnant of the original page, so I use htaccess to redirect to a php page that searches for a partial match then does a 301 there, if no match it goes to index.

 

for example you can get pages that put up links that were in search results so are truncated and terminate with .. to match those I use:

 

RewriteRule ^([^\.]+)\.\.$ /redirect.php?page=$1 [L]

Sam

 

Remember, What you think I ment may not be what I thought I ment when I said it.

 

Contributions:

 

Auto Backup your Database, Easy way

 

Multi Images with Fancy Pop-ups, Easy way

 

Products in columns with multi buy etc etc

 

Disable any Category or Product, Easy way

 

Secure & Improve your account pages et al.

Link to comment
Share on other sites

Hi.

 

I currently have kiss header tags with canonical links installed, and I have your code installed on my test site, they both work fine, however canonical links do nothing to prevent Google from ‘inventing’ new links. Also they do not prevent Google from spidering empty review pages.

 

Canonical url seems to be just a suggestion to Google.

 

Google does not automatically remove non-canonical urls from its database, hence the reason that ‘reviews’ became my top keyword.

 

The way I resolved it was to add header('HTTP/1.1 410 Gone'); to my empty review pages and this restored my keywords

 

Regards

 

Ken

Link to comment
Share on other sites

Hi Isabella.

 

One solution is to install a canonical contribution and then compare the current url to the canonical url and then write some code to redirect Google if the url is not the canonical.

 

Regards

 

Ken

Link to comment
Share on other sites

@@Ken44 My experience with the canonical is it does work, but is not instant and can take maybe a month before google reflects the tag in its data center.

 

I have not seen any case to date where the tag was ignored, it was Google's idea after all

Sam

 

Remember, What you think I ment may not be what I thought I ment when I said it.

 

Contributions:

 

Auto Backup your Database, Easy way

 

Multi Images with Fancy Pop-ups, Easy way

 

Products in columns with multi buy etc etc

 

Disable any Category or Product, Easy way

 

Secure & Improve your account pages et al.

Link to comment
Share on other sites

Hi Isabella.

 

One solution is to install a canonical contribution and then compare the current url to the canonical url and then write some code to redirect Google if the url is not the canonical.

 

Regards

 

Ken

 

@@Ken44

 

Hi Ken

 

I too use KISS header tags and can confirm what you said in the other post: it does not prevent search engines from "inventing" new links.

 

My actual problem is someone posted some wrong urls of mine on a forum, for example http://www.mysite.com/index.php/wrongpage , the forum got crawled and google listed the urls because upon clicking them they don't give a 404, if they did, Google wouldn't have listed them in the first place.

 

@@spooks

 

Sam, would the rewrite rule you posted help for my problem?

~ Don't mistake my kindness for weakness ~

Link to comment
Share on other sites

Hi

 

Google has found these on my site

 

product_reviews.php?cPath=1_7&products_id=222

product_reviews.php?cPath=5&products_id=222

product_reviews.php?amp;products_id=222

 

and hundreds of others similar. What’s more these products do not even have any reviews

 

If I wish to redirect all these pages to review 222. Then the code would be

 

<?php

header('Location: http://www.test.com/product_reviews.php? products_id=222’) ;

?>

 

Would this be a better solution?

 

 

Regards

 

Ken

Link to comment
Share on other sites

I wouldn't do that, whats the canonical for the page, if its set to ignore all but products_id then that would work

 

@@Biancoblu I'm sorry to disagree, I've found google can list links that 404, that's why I would never use that as a method to control them.

Sam

 

Remember, What you think I ment may not be what I thought I ment when I said it.

 

Contributions:

 

Auto Backup your Database, Easy way

 

Multi Images with Fancy Pop-ups, Easy way

 

Products in columns with multi buy etc etc

 

Disable any Category or Product, Easy way

 

Secure & Improve your account pages et al.

Link to comment
Share on other sites

If you have URLs that no longer exist, it is better to 301 them or 410 them. 404 is gone for unexplained reason, Google will keep coming back to look for the URL indefinitely.

 

There are cases where the SE will ignore the canonical directive, when it is not properly being used for instance.

Link to comment
Share on other sites

@@spooks @@Hotclutch

 

@@Biancoblu I'm sorry to disagree, I've found google can list links that 404, that's why I would never use that as a method to control them.

 

My problem is with pages that NEVER existed, not pages that moved, or pages that were removed, but pages that never existed.

 

I'm following Google's guidelines on how to deal with non-existing pages, they say:

 

"Usually, when someone requests a page that doesn’t exist, a server will return a 404 (not found) error. This HTTP response code clearly tells both browsers and search engines that the page doesn’t exist. As a result, the content of the page (if any) won’t be crawled or indexed by search engines. We recommend that you always return a 404 (Not found) or a 410 (Gone) response code in response to a request for a non-existing page."

 

Based on Google's guidelines, I return a 404 on pages that never existed, but that only works for the base domain http://www.mysite.com, it doesn't work for url's that contain index.php.

 

So, http://www.mysite.com/tttokjhgjjuo returns a 404 which is what I want

 

BUT

 

for example http://www.mysite.com/index.php/tthhgdfshklgioo doesn't return a 404

~ Don't mistake my kindness for weakness ~

Link to comment
Share on other sites

The problem lies within your script. If a link NEVER existed like you say, then there is no way that it can end up in the index.

 

In my opinion a 404 error is undesirable, something that you want to attend to, either by way of a 301 redirect, or a 410 response.

Link to comment
Share on other sites

@@Hotclutch

 

The link never existed on my site, but Google saw it as a valid link because someone posted that link on a forum, to make the point that my non-existing links don't return a 404, so Google crawled that forum and indexed that link because it took it as valid.

 

I have tested this on several oscommerce shops, all behave the same, pages like http://www.site.com/index.php/tthhgdfshklgio return a 200OK response.

 

 

Try this, it says 404 page not found: http://demo.oscommerce.com/gaga

 

Now try this, the page returns a 200OK response: http://demo.oscommerce.com/index.php/gaga

~ Don't mistake my kindness for weakness ~

Link to comment
Share on other sites

@@Biancoblu

 

I'm no htaccess guru, but what of this method:

 

RewriteCond %{THE_REQUEST} ^GET\ /.*\;.*\ HTTP/  
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(.*)index\.php(.*)\ HTTP/ [NC]
RewriteCond %{QUERY_STRING} !^?(.+)$  
RewriteRule .* do something

 

What that should do (not tested)

 

Where its an external request (ie not from your site) and the page requested is index.php and the request contains a query string that does not start with ? the do whatever your replace 'do something' with.

 

Of course if you already use a rewriter that removes the ? you'll need to find another test.

Sam

 

Remember, What you think I ment may not be what I thought I ment when I said it.

 

Contributions:

 

Auto Backup your Database, Easy way

 

Multi Images with Fancy Pop-ups, Easy way

 

Products in columns with multi buy etc etc

 

Disable any Category or Product, Easy way

 

Secure & Improve your account pages et al.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...