Jump to content
  • Checkout
  • Login
  • Get in touch

osCommerce

The e-commerce.

Google Duplicate Content with Strange cPath Variable Indexed


clustersolutions

Recommended Posts

In our Goole duplicate content we can see that Google had indexed some strange cPath variables, i.e. www.xyz.com/abc.html?cPath=24_0_21, www.xyz.com/abc.html?cPath=23_0_53, and etc. The problem is we don't use subcategories so I have no idea where the bot had found these links and the cPath variables (unless they are from external). So any idea and don't mind sharing your thoughts? Thanks!

 

Tim

Link to comment
Share on other sites

Have the same problem using usu5 pro contrib. Seems that google started to index these urls and unfortunately oscommerce code does not return the 404 error code status. I think oscommerce needs in the code to check if the cpath number is correct, if not then return 404 error.

Link to comment
Share on other sites

USU5 pro that's Ultimate SEO 5 by Chemo? We have installed that years ago and after so many other packages later we just can't pinpoint the cause anymore. If USU5 for sure is the cause may be it is a better idea to go fix that...for now we did do a checker and 301 redirect if a product and cPath don't match...

Link to comment
Share on other sites

Well...if anyone can help in pinpointing the root of the bug I can help getting it fixed...as with all open source...sometime a bandage is a quick fix and it is good...the subroutine below can check if the url cPath parameter is valid by comparing it to the system. We have a package that validate our SEO URL and we perform the check there and if it returns false we would then do a 301 redirect to the URL without the cPath variable (also, look into rel="canonical" as we use that where it is a better solution). Hope this is helpful...

 

/*

Copyright © 2011 clustersolutions.net

Released under the GNU General Public License.

Please give credit where credit is due.

*/

// Validate URL cPath Parameter

function tep_validate_url_cpath() {

global $HTTP_GET_VARS, $products_id;

if (isset($HTTP_GET_VARS['cPath']) && tep_not_null($products_id)) {

$bb = array();

$prod_cat_check_query = tep_db_query("select categories_id from ". TABLE_PRODUCTS_TO_CATEGORIES . " where products_id = " . $products_id);

while ($prod_cat_check = tep_db_fetch_array($prod_cat_check_query)) {

$aa = array();

$path_check['parent_id'] = $prod_cat_check['categories_id'];

do {

array_push($aa, $path_check['parent_id']);

$path_check_query = tep_db_query("select * from " . TABLE_CATEGORIES . " where categories_id = " . $path_check['parent_id']);

$path_check = tep_db_fetch_array($path_check_query);

} while ($path_check['parent_id'] != 0);

array_push($bb, implode('_', array_reverse($aa)));

}

return (in_array($HTTP_GET_VARS['cPath'], $bb) ? true : false);

} else {

return true;

}

}

Link to comment
Share on other sites

  • 2 weeks later...

I probably wouldn't do it there as that funcation just parse the cPath variable and return it in an array...I would do it where it does the URL check...and you probably should have the SEO URL validation contrib installed as without that gives problems with SEO too...we did that way back and it was beneficial...good luck...tim

Link to comment
Share on other sites

  • 3 weeks later...

Assuming abc.html is actually providing the category information, then links with cPath parameters are probably redundant and will be viewed by Google as a duplicate page. You should try using the CANONICAL link:

 

http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html

 

which will allow Google to index only the correct pages

Link to comment
Share on other sites

  • 3 weeks later...

Hello,

 

i'm tryng to use this function and changed code in application_top.php to the following:

 

// calculate category path
 if (isset($_GET['cPath'])) {
   $cPath = $_GET['cPath'];
 } elseif (isset($_GET['products_id']) && !isset($_GET['manufacturers_id'])) {
   $cPath = tep_get_product_path($_GET['products_id']);
 } else {
   $cPath = '';
 }

 if (tep_validate_url_cpath($cPath) === false)   {   
	 header('HTTP/1.1 404 Not Found');
	 echo   '<h1>404 Not Found</h1>';
  tep_exit(); 
 } else {				
 if (tep_not_null($cPath)) {   
   $cPath_array = tep_parse_category_path($cPath);
   $cPath = implode('_', $cPath_array);
   $current_category_id = $cPath_array[(sizeof($cPath_array)-1)];
 } else {
   $current_category_id = 0;
 }   
 }

 

But the function it returns always true and goes the 404.

 

I'm using canonical contribution but they return as canonical links also these duplicate content categoryes because this is part of the ocommerce core.

 

I think that this is the right place to put the check

Link to comment
Share on other sites

  • 1 month later...

I reupped this topic to ask if somene has resolved with this duplicate content issue.

 

Unfortunately i tried several canonical urls contribs and using ultimate seo URL 5 pro but unfortunately they cannot resolve this problem.

 

Google webmaster tools reported up to 2000 duplicated content pages with random cpath in the url.

Link to comment
Share on other sites

To acidvertigo:

 

Try to add Disallow: /*?* to robots.txt to remove the duplicate issue.

 

Still one seo problem remains. Links to products pages from categories pages end with .html?cPath...

 

Can anyone suggest me how to fix the categories template so links will be ".html" ?

Link to comment
Share on other sites

I'm trying with this code in applceation_top.php

 

<code>

 

 

$duplicate = array( '52_260','288_380_186','288_380_2_186','288_504_2_186','504_22_301_77','47_2_114','3_47_22_301','70_34_544_514','288_504_34_389',

'70_34_389','34_389','288_380_389','369_546','288_52_260_531','70_537_160_479','288_380_34_474_491','288_504_34_474_491','70_537_34_474_491','288_504_70_537_560','288_504_443_444',

'380_34_474_602');

 

if  (in_array($_GET['cPath'], $duplicate)) {

header("HTTP/1.1 404 Not Found");

echo "<h1>404 Not Found</h1>";

unset($duplicate);

 tep_exit();

}

 

 

</code>

Link to comment
Share on other sites

If I'm not mistaken, that should work.

 

Here is what I found at webmaster centre help at google:

  • To block access to all URLs that include a question mark (?) (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):
    User-agent: Googlebot
    Disallow: /*?

BTW, any suggestions regarding fixing the links on category pages? I really need to get rid of "?cPach" in links to products from categories.

Link to comment
Share on other sites

I find in this topic this function to get the full catalog path

 

http://www.oscommerce.com/forums/topic/377890-how-to-get-full-cat-path/

 

function get_full_cat_from_cPath ($zipote)
{
$query_trabajo_1=tep_db_query("SELECT `parent_id` FROM `categories` WHERE `categories_id` =  '" . $zipote . "'");
$land = tep_db_fetch_array($query_trabajo_1);
$cat_completa = $zipote;
while (! $land[parent_id] == 0) {
$query_ciclica=tep_db_query("SELECT `parent_id` FROM `categories` WHERE `categories_id` =  '" . $land[parent_id] . "'");
$land=tep_db_fetch_array($query_ciclica);
if (! $land[parent_id] == 0) {
$cat_completa = $land[parent_id] . '_' . $cat_completa;
}
}
    return $cat_completa;
}

 

I put this in general.php but i cannot make it work. If this function can return the fulul catalog path it can be compared with the current url and if doesn't match give a 301 redirect o 404 error code.

 

Please let me know if this is a good place where to start and how i can make work this function

Link to comment
Share on other sites

@@alarm_seo -- I probably should have qualified that. Your solution will work for Google if they really do read the robots.txt file that way. However, it does not meet the standard, so other search engines probably won't read it that way. So, you can use that code in a robots.txt block that is for Google only, but probably not in a general block.

 

The cPath in a URL is used to provide the navigation in the categories box. You can rewrite it to something else, but the category information must still be in the link somewhere for the navigation to work.

 

Since you quote Google on robots.txt, why don't you read this Google help page. I'll quote the relevant sentence:

It's much safer to serve us the original dynamic URL and let us handle the problem of detecting and avoiding problematic parameters.

 

I suggest you stop wasting time trying to fix a broken URL rewriter that won't do you any good, ans start spending time on things that actually will help your search engine ranking.

 

Regards

Jim

See my profile for a list of my addons and ways to get support.

Link to comment
Share on other sites

@@clustersolutions and @@All I have modified the previous function in general.php as it follows

 

function get_full_cat_from_cPath($zipote)
{
$query1=tep_db_query("SELECT parent_id FROM categories WHERE categories_id =  '" . $zipote . "'");
$land = tep_db_fetch_array($query1);
$cat_completa = $zipote;
while (! $land[parent_id] == 0) {
tep_redirect(tep_href_link(FILENAME_DEFAULT));
 tep_exit();
$cat_completa = $land[parent_id] . '_' . $cat_completa;
}
    return $cat_completa;
}

 

calling this function in index.php redirects to the default page. UNFORTUNATELY it works only for the categories were the parent_id is not set.

 

for example if the orginal cPAth=160_479 i go to the correct page, calling only cPAth 479 it redirects to the dafault page (deleting in my case some hundreds of duplicate pages). But if i call 1_479 (1 is a existant parent_id ) this code does not make the redirect.

 

p.s. in my webmaster tools i have duplicate content for urls with 8 concatenated cPath like 8_256_47_48_8_78_54_132 and still growing!!!!

Link to comment
Share on other sites

Hi!

 

My opinion that would be better to catch in application_top. The tep_parse_category_path() function is good for it.

 

 if (tep_not_null($cPath)) {
   $cPath_array = tep_parse_category_path($cPath);

 

so in the tep_parse_category_path() function can do controll anything and this is the main built in function.

 

////
// Parse and secure the cPath parameter values
 function tep_parse_category_path($cPath) {
// make sure the category IDs are integers
   $cPath_array = array_map('tep_string_to_int', explode('_', $cPath));
// make sure no duplicate category IDs exist which could lock the server in a loop
   $tmp_array = array();
   $n = sizeof($cPath_array);
   for ($i=0; $i<$n; $i++) {
  if (!in_array($cPath_array[$i], $tmp_array)) {
    $tmp_array[] = $cPath_array[$i];
  }
   }

/*** Here is the estimated controlling place and need to validate cPath string ***/
   return $tmp_array;
 }

 

This problem maybe is persist for all cPath used pages.

:blink:
osCommerce based shop owner with minimal design and focused on background works. When the less is more.
Email managment with tracking pixel, package managment for shipping, stock management, warehouse managment with bar code reader, parcel shops management on 3000 pickup points without local store.

Link to comment
Share on other sites

Or something that controls the $tree array in bm_categories here:

 

function getData() {
  global $categories_string, $tree, $languages_id, $cPath, $cPath_array;
  $categories_string = '';
  $tree = array();
  $categories_query = tep_db_query("select c.categories_id, cd.categories_name, c.parent_id from " . TABLE_CATEGORIES . " c, " . TABLE_CATEGORIES_DESCRIPTION . " cd where c.parent_id = '0' and c.categories_id = cd.categories_id and cd.language_id='" . (int)$languages_id ."' order by sort_order, cd.categories_name");
  while ($categories = tep_db_fetch_array($categories_query))  {
    $tree[$categories['categories_id']] = array('name' => $categories['categories_name'],
											    'parent' => $categories['parent_id'],
											    'level' => 0,
											    'path' => $categories['categories_id'],
											    'next_id' => false);
    if (isset($parent_id)) {
	  $tree[$parent_id]['next_id'] = $categories['categories_id'];
    }
    $parent_id = $categories['categories_id'];
    if (!isset($first_element)) {
	  $first_element = $categories['categories_id'];
    }
  }
  if (tep_not_null($cPath)) {
    $new_path = '';
    reset($cPath_array);
    while (list($key, $value) = each($cPath_array)) {
	  unset($parent_id);
	  unset($first_id);
	  $categories_query = tep_db_query("select c.categories_id, cd.categories_name, c.parent_id from " . TABLE_CATEGORIES . " c, " . TABLE_CATEGORIES_DESCRIPTION . " cd where c.parent_id = '" . (int)$value . "' and c.categories_id = cd.categories_id and cd.language_id='" . (int)$languages_id ."' order by sort_order, cd.categories_name");
	  if (tep_db_num_rows($categories_query)) {
	    $new_path .= $value;
	    while ($row = tep_db_fetch_array($categories_query)) {
		  $tree[$row['categories_id']] = array('name' => $row['categories_name'],
											   'parent' => $row['parent_id'],
											   'level' => $key+1,
											   'path' => $new_path . '_' . $row['categories_id'],
											   'next_id' => false);
		  if (isset($parent_id)) {
		    $tree[$parent_id]['next_id'] = $row['categories_id'];
		  }
		  $parent_id = $row['categories_id'];
		  if (!isset($first_id)) {
		    $first_id = $row['categories_id'];
		  }
		  $last_id = $row['categories_id'];
	    }
	    $tree[$last_id]['next_id'] = $tree[$value]['next_id'];
	    $tree[$value]['next_id'] = $first_id;
	    $new_path .= '_';
	  } else {
	    break;
	  }
    }
  }

 

This code outputs the array as follows;

 

Array
(
[1] => Array
(
[name] => Hardware
[parent] => 0
[level] => 0
[path] => 1
[next_id] => 17
)
[2] => Array
(
[name] => Software
[parent] => 0
[level] => 0
[path] => 2
[next_id] => 3
)
[3] => Array
(
[name] => DVD Movies
[parent] => 0
[level] => 0
[path] => 3
[next_id] =>
)
[17] => Array
(
[name] => CDROM Drives
[parent] => 1
[level] => 1
[path] => 1_17
[next_id] => 4
)
[4] => Array
(
[name] => Graphics Cards
[parent] => 1
[level] => 1
[path] => 1_4
[next_id] => 8
)
[8] => Array
(
[name] => Keyboards
[parent] => 1
[level] => 1
[path] => 1_8
[next_id] => 16
)
[16] => Array
(
[name] => Memory
[parent] => 1
[level] => 1
[path] => 1_16
[next_id] => 9
)
[9] => Array
(
[name] => Mice
[parent] => 1
[level] => 1
[path] => 1_9
[next_id] => 6
)
[6] => Array
(
[name] => Monitors
[parent] => 1
[level] => 1
[path] => 1_6
[next_id] => 5
)
[5] => Array
(
[name] => Printers
[parent] => 1
[level] => 1
[path] => 1_5
[next_id] => 7
)
[7] => Array
(
[name] => Speakers
[parent] => 1
[level] => 1
[path] => 1_7
[next_id] => 2
)
)

 

But unfortunately in the duplicated pages this array still is valid with all duplicates values i cannot find anything to check if this array is the good one or the duplicated one. :unsure:

Link to comment
Share on other sites

Google webmaster tools reported up to 2000 duplicated content pages with random cpath in the url.

Are you sure google is reporting 2000 duplicate content pages and not 2000 pages with possible duplicate meta tags? The latter is very common and harmless, assuming the tags are setup correctly.

Support Links:

For Hire: Contact me for anything you need help with for your shop: upgrading, hosting, repairs, code written, etc.

Get the latest versions of my addons

Recommended SEO Addons

Link to comment
Share on other sites

  • 7 months later...
  • 8 months later...

I know its an old topic, but due to the fact its the first on Google im going to give the solution. If your running r205, you should fix the problem by replacing a single file.

 

http://www.oscommerce.com/forums/topic/336702-ultimate-seo-urls-5-by-fwr-media/page__st__3020#entry1576463

 

Why this is not a part of the contribution ( or just a single file uploaded) you can download I dont know, but FWR media has been informed and would hopefully update the contribution soon, and maybe with even more stuff.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...