Sunday, September 07, 2008

Detecting documents near duplications in realtime

Crawling the web for news I found few interesting things about its topology and mainly about news on the web.
One of the bothers in serving news is dealing with duplications. There are a lot of near duplications of news articles on the web. Looking at our data its something between 20% and 40% of the articles we harvest. By duplicates I mean articles that have a significant overlap between them and that a human will notice that they are actually the same articles that might have went through some editing.
Both blogs and professional news providers have duplications and we treat them the same in this respect.
The most common reason for article duplication is a press release or an article posted by one of the major news brokers. These articles are often published word by word from news providers with some minor editing, adding images and links, etc.

There are many ways of how to deal with this problem, and it depends a lot of what kind of service you are writing. In any point in time, we have millions of potential news articles our search can come by.
On the performance side

  • Latency: we can have up to 10 milliseconds check the result set (about a hundred articles) for duplication when serving the results.
  • Throughput: tens of thousands transactions per minute per server on peak time.

To make it clear, our problem is not to find all near duplications. We just with to find near duplications in articles we serve, but it must be very fast. We might return 100 articles in one set. Comparing all of them to each other will take about 10K comparisons.
Some of conventional methods to solve near duplications I know of are using shingling and matching term frequency vectors. Shingling is great and most accurate, but is expensive. You can’t take all the articles and compare them, not mentioning keeping all in memory. Creating the shingles takes time and in large documents there may be many of them. Vectors might be less accurate for these purposes and have similar caveats.

The system is divided to two stages.

Offline: creating signatures, which are like a set of shingles.
  1. Breaking the article to natural sentences. Doing that and not using fixed word N-grams reduce the number of shingles.
  2. Stemming the content and lowercasing. Removing short words, punctuations etc.
  3. Taking a numeric value out of the shingle using CRC32. We will treat it as a unique numeric value of the sentence, allowing extremely rare errors.
  4. Sorting the set of shingles to a list and trimming it to a fixed size. We found that in large documents it’s enough to use only subset to determine near duplications as long as we are consistent with the sample we take. Using the shingles with the minimum values is ensuring we’ll always take the same sample set.
  5. Attaching the short list of numbers to the article.

Realtime: finding duplications in a set of a hundred documents. As you understand by now I actually compare the singles.
  1. Grab the sorted integer arrays representing the shingles of the two documents you compare.
  2. Compare the values of the two vectors and advance the pointers as you go. Since the vectors are sorted, you don’t need to compare each value in one vector to all values in the other vector. The comparing sorted vectors this way is very fast!
The threshold of shingles overlap is known in advance and the result should be boolean. I.e. once we know it’s a near duplication, we don’t care how close it is. You can do lots of optimizations using this knowledge. For example, if the duplication threshold is 50% similarity, you processed 60% of the terms and all matched or didn’t match then there is no sense continuing on.
Eventually, if you optimize in the right places, finding overlapping vectors in a set of hundred documents is super fast and could take less then a millisecond with a real life dataset.

Optimizing further on, we use caching. Many of the document sets we are deduping in a session are going to be evaluated again in the near future. Hashing and caching the session set gives an extra boost in response time.

0 comments:

Creative Commons License This work by Eishay Smith is licensed under a Creative Commons Attribution 3.0 Unported License.