There was a small kerfuffle when Gary Ilyes’s comments that “60% of the internet is duplicate” were taken by some to mean that 60% of the internet was content scraped and regurgitated by ne’er-do-wells (some is). The short answer is it’s not, but this headline doesn’t give much room for context.
There is an awful lot of duplication from unredirected domain variants: http / https / www / non-www – Google counts those as 4 separate URLs. There is also an awful lot from query string / UserID URLs which make it into the crawl list and need to be de-duplicated. There is still some more from country variants which don’t use href.lang and yet more from content aggregators.
Google uses checksums, or hashes to spot duplicate content, as a very crude example: how many bytes, multiplied by how many href tags, multiplied by how many heading tags would give a fairly unique number, even with billions of documents. Google’s hash will be more complex than that, but it is a very simple and effective way of identifying content which is likely to be the same. And when you drill down into excluding headers, footers, sidebars etc and use the same methodology, it becomes very straight-forward to spot duplicated content across different sites.
So, the learning from this is that sometimes an attention-grabbing statistic in a presentation might get a little out of hand, unless you add a dab of context to the shoutline. Always remember that when creating your presentations.
Contact me to discuss duplicated content issues and how Google works.