Wednesday, September 17, 2008


Take a look at the new site. Great content, and I really liked the way they encourage good content and usage. They definitely hit the right spot with the reputation and voting system. Also, another winning OpenID usage. Obviously, I'm subscribed to the Java feed :-)

Thursday, September 11, 2008

Military History Podcast Rocks

I really liked this one, the Military History Podcast is very informative and interesting. Its not too long or too short, and I found there lots of pearls I didn't know about. Here in the US I don't do any military reserve duty anymore, kinda miss the long military discussions I used to have with my colleagues :-)

Anyway, this is a nice compensation. If you are interested in the history that made the world as it is today, and if you have somewhat interest in military, you'll definitely like this one.

Wednesday, September 10, 2008

Eclipse Birt is cool

I've starting to mess around with Eclipse BIRT on my Mac. It is a bit annoying that in their download page they offer an 'All In One' just for Windows, it gives the perception of the tool available only on this OS. There are projects in Eclipse that are not working on all platform, like the profiler, fortunately not BIRT.
Using a simple Eclipse update you can have it install, and it works without a problem. Last time I messed with it was a couple of years ago, I wonder what's new...

Anyway, go get it. Even it you're on Macs :-)

Monday, September 08, 2008

Wordle is cool !

Bounced into Wordle after hearing about then in the JavaPosse podcast. Really cool, but too bad I can't get the code and make it run out of our DB.
Here are some words related to the news stuff I'm doing now:

Sunday, September 07, 2008

Detecting documents near duplications in realtime

Crawling the web for news I found few interesting things about its topology and mainly about news on the web.
One of the bothers in serving news is dealing with duplications. There are a lot of near duplications of news articles on the web. Looking at our data its something between 20% and 40% of the articles we harvest. By duplicates I mean articles that have a significant overlap between them and that a human will notice that they are actually the same articles that might have went through some editing.
Both blogs and professional news providers have duplications and we treat them the same in this respect.
The most common reason for article duplication is a press release or an article posted by one of the major news brokers. These articles are often published word by word from news providers with some minor editing, adding images and links, etc.

There are many ways of how to deal with this problem, and it depends a lot of what kind of service you are writing. In any point in time, we have millions of potential news articles our search can come by.
On the performance side

  • Latency: we can have up to 10 milliseconds check the result set (about a hundred articles) for duplication when serving the results.
  • Throughput: tens of thousands transactions per minute per server on peak time.

To make it clear, our problem is not to find all near duplications. We just with to find near duplications in articles we serve, but it must be very fast. We might return 100 articles in one set. Comparing all of them to each other will take about 10K comparisons.
Some of conventional methods to solve near duplications I know of are using shingling and matching term frequency vectors. Shingling is great and most accurate, but is expensive. You can’t take all the articles and compare them, not mentioning keeping all in memory. Creating the shingles takes time and in large documents there may be many of them. Vectors might be less accurate for these purposes and have similar caveats.

The system is divided to two stages.

Offline: creating signatures, which are like a set of shingles.
  1. Breaking the article to natural sentences. Doing that and not using fixed word N-grams reduce the number of shingles.
  2. Stemming the content and lowercasing. Removing short words, punctuations etc.
  3. Taking a numeric value out of the shingle using CRC32. We will treat it as a unique numeric value of the sentence, allowing extremely rare errors.
  4. Sorting the set of shingles to a list and trimming it to a fixed size. We found that in large documents it’s enough to use only subset to determine near duplications as long as we are consistent with the sample we take. Using the shingles with the minimum values is ensuring we’ll always take the same sample set.
  5. Attaching the short list of numbers to the article.

Realtime: finding duplications in a set of a hundred documents. As you understand by now I actually compare the singles.
  1. Grab the sorted integer arrays representing the shingles of the two documents you compare.
  2. Compare the values of the two vectors and advance the pointers as you go. Since the vectors are sorted, you don’t need to compare each value in one vector to all values in the other vector. The comparing sorted vectors this way is very fast!
The threshold of shingles overlap is known in advance and the result should be boolean. I.e. once we know it’s a near duplication, we don’t care how close it is. You can do lots of optimizations using this knowledge. For example, if the duplication threshold is 50% similarity, you processed 60% of the terms and all matched or didn’t match then there is no sense continuing on.
Eventually, if you optimize in the right places, finding overlapping vectors in a set of hundred documents is super fast and could take less then a millisecond with a real life dataset.

Optimizing further on, we use caching. Many of the document sets we are deduping in a session are going to be evaluated again in the near future. Hashing and caching the session set gives an extra boost in response time.

Creative Commons License This work by Eishay Smith is licensed under a Creative Commons Attribution 3.0 Unported License.