Looking at someone's site for them and from the structure I would expect them to have a canonical issue as there are multiple ways to navigate to the same page. However only 1 copy of the page has been indexed.
Using seoquake all versions appear to have been cached - but on the same day. Click to view the cached version and the cache of the main iteration has been shown.
No rel=canonical in code (well, actually there is , but all versions point to self - even if you make up your own url). So how does G know that they are the same page?
* They have no robots.txt
* No redirecting happening
*URLs are full unique urls, not just different get vars
>>was the idea that Google must take into account CMS structures when crawling and indexing websites, so using something so popular should mean you get ranked properly.
That is not an unreasonable premise either.
I agree with the idea above. Don't think it is an issue in this case though : Their CMS is not a widely used one. Doesn't make any sense to me at the moment. Either I am missing an obvious point or one of my tools is giving me a bad result.
most obvious would be that there was a robots.txt which someone has removed?
Failing that - do they have a webmaster account and have you checked the settings?
I'd love if Google started sorting canonicals automatically but I'm pretty confident thet don't at the moment - so there clearly is *something* that you're not seeing - question really is how its hiding from you:)
Can't think what they could do in WMT that would cause this. remove URls shouldn't cause the linking of the cached versions. URL parameters wouldn't help as they are in the format of var1/var2/pagename rather than pagename?v1=xx&v2=gg . Don't have WMT access at the moment, but am thinking I could almost certainly guess it in under 3 attempts.
Going to re-check all my facts before I get clever on this.
This is annoying me. can't even remember why I need to understand this now, but I know I do !
So - two pages. Same content, different urls. Theoretically you could have a near infinite number with the same content - however let's think of just 2.
- Page A is indexed. Page B isn't
- If I view the cache of page B then it actually reports that it is showing the cache of page A
- There is no robots.txt
- Both pages have a canonical tag, however this points to self (in A it points to A, in B it points to B)
- Nothing has been set up in WMT. They are registered, but there are not sitemaps, excluded urls, URL parameter handling etc
- Response headers from both pages are just status 200
- Response headers are also status 200 when user agent changed to googlebot
- Using webmaster tools "fetch as googlebot" option everythign looks exactly as I see it
At the moment the only thing that makes sense is gurties theory that someone was in place, but isn't. However that is quite a coincidence. Hate mysteries.
ok - if page b isn't indexed, how are you viewing the cache?
(i swear I am being a fucktard here - but can't see it).
ok - if page b isn't indexed, how are you viewing the cache?
Initially I navigated to page B, hit the cache button in seoquake, then clicked the date to view it. I then decided that seoquake was the obviously source of fuckwittery (couldn't be me), so installed the google toolbar and did the same. Same result.
I then thought it was maybe just a dumb weird thing that they were going on uncached pages at the moment, so tried it on some generated URLs, but they behave properly.
This surely is strange, a page that hasn't been indexed cant appear in the cache results of Google......
Yes, yes! A page that hasn't been indexed CANNOT appear in the cache results of Google!!!
Thanks a lot for agreeing with my point of view. :)




Because since Google started using rel=canonballs they have learnt from sites that use it to determine more accurately which pages are really unique and which are just duplicated by a CMS and so know what to look for and which files / URLs should be indexed.
Total guess.
One of the reasons why I stuck with Wordpress (other than it being free and so easy to use that even I can do it) was the idea that Google must take into account CMS structures when crawling and indexing websites, so using something so popular should mean you get ranked properly.
Another stab in the dark of course.
Clutching at straws still!