Google: Content Stitching Or Quilting Is Not Near Duplicate Content

Jun 21, 2017 - 8:11 am 9 by

Google Content Stiching Quilting

Dawn Anderson followed up on a topic around what is near duplicate content with Google's Gary Illyes - asking if it is similar to content stitching and quilting. As Dawn suspected, Gary said no, it is not. Here it is on Twitter where Dawn asked "'Content stitching / quilting'... this is not the same as near-duplicate as defined in ur prev tweet?" and Gary responded that she is correct.

Here are the tweets:

Dawn then sent me some more technical information on this. She said that Marc Najork, who is now at Google, wrote a paper on this while at Microsoft named Detecting Quilted Web Pages at Scale. Here is the abstract:

Web-based advertising and electronic commerce, combined with the key role of search engines in driving visitors to ad-monetized and e-commerce web sites, has given rise to the phenomenon of web spam: web pages that are of little value to visitors, but that are created mainly to mislead search engines into driving traffic to target web sites. A large fraction of spam web pages is automatically generated, and some portion of these pages is generated by stitching together parts (sentences or paragraphs) of other web pages. This paper presents a scalable algorithm for detecting such “quilted” web pages. Previous work by the author and his collaborators introduced a sampling-based algorithm that was capable of detecting some, but by far not all quilted web pages in a collection. By contrast, the algorithm presented in this work identifies all quilted web pages, and it is scalable to very large corpora. We tested the algorithm on the half-billion page English-language subset of the ClueWeb09 collection, and evaluated its effectiveness in detecting web spam by manually inspecting small samples of the detected quilted pages. This manual inspection guided us in iteratively refining the algorithm to be more efficient in detecting real-world spam.

There is no doubt Google and other search engines are on to this type of behavior but it is always nice pointing to research papers when we can. Thanks Dawn.

Forum discussion at Twitter.

 

Popular Categories

The Pulse of the search community

Follow

Search Video Recaps

 
Google Core Update Flux, AdSense Ad Intent, California Link Tax & More - YouTube
Video Details More Videos Subscribe to Videos

Most Recent Articles

Google Updates

Google March Core Update Still Rolling Out & Heated SEO Chatter Continue

Apr 25, 2024 - 7:51 am
Google

Report: How Prabhakar Raghavan Killed Google Search

Apr 25, 2024 - 7:41 am
Google Search Engine Optimization

Google Favicon Documentation Adds Rel Attribute Value Definitions

Apr 25, 2024 - 7:31 am
Google Ads

Google Ads API Version 16.1 Now Available

Apr 25, 2024 - 7:21 am
Google Search Engine Optimization

Google: Splitting & Merging Sites Takes Longer Than Normal Site Migrations

Apr 25, 2024 - 7:11 am
Search Forum Recap

Daily Search Forum Recap: April 24, 2024

Apr 24, 2024 - 4:00 pm
Previous Story: Google Got An Interactive Fidget Spinner