Wednesday, April 16, 2008

Java's URL little secret

Lately I played a lot with fetching large amounts of data from a large number of URLs. Sounds like a fun project, and indeed it was. The first phase was to load, compare, classify and sort the urls from the DB. After all was written and tester I wanted to start rocking on a nice strong server with few millions of URLs and hundred of threads, but the application crawled!
After blaming the slow network (which wasn't at all), deadlocks (none) and the weather I did kill -3 to see what the threads are actually doing. Most of them where stuck on:

at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$1.lookupAllHostAddr
...
at java.net.URLStreamHandler.getHostAddress
- locked <0x0ad4e050> (a sun.net.www.protocol.http.Handler)
at java.net.URLStreamHandler.hashCode
at java.net.URL.hashCode
It freaked me out a bit.

I did created URL objects out of strings, stored them in a hash maps and sets, and did all kinds of comparisons, adding and removing from the containers. It appears that there is a DNS lookup on the first time you try to either do compare on the object or get its hash code (i.e. when adding it to a hash map/set). The DNS lookup is not happening when the object is initially created, rather it is lazy and happens on demand, which makes it even harder to locate the problem.

Looking at the javadocs it's rather cryptic:
The hash code is based upon all the URL components relevant for URL comparison. As such, this operation is a blocking operation.

In any way I wouldn't have guessed that it is going to do IO. And what the heck does it mean that "it is a blocking operation" ?
I wouldn't think anyone would expect hashCode to be synchronic, right ??

And even if the javadoc did mention what's going on behind the scenes then one can think of few flaws in this behavior:
* Exposing the implementation to the "8 Fallacies of Distributed Computing" which James Gosling who is signed on the URL class donated its eight fallacy. Especially the fallacies: 'The network is reliable', 'Latency is zero' and 'The network is homogeneous'.
What happens if you create a hash code from one URL object, after a while the network is gone ore the DNS server state changes and you create another URL with the same string as the first URL? Will the two URLs be equal? Should they be?

* One expect the compute of the hash code or equals to be relatively fast and not IO bound. See Joshua Bloch in Effective Java: "Don't write an equals that relies on unreliable resources"

* Most programmers do not think too much about the consequences of using hash containers (Map/Set) and don't go and check the hashCode methods when doing so. It is not at all apparent that if you're going to place a large set of URL object into a container you're going to create a lot of network traffic.

Looking at the source code, it seems that URI is not suffering from these problems. It might be a good idea to use it instead of URL.

Tuesday, April 15, 2008

'Eclipse IDE at a crossroads' and a word of IDEA

Or is it...

I was quoted not long ago at EclipseCon by Paul Krill (InfoWorld) for an article by the somewhat controversial topic 'Eclipse IDE at a crossroads'.
I'm not sure the quote was very accurate and that all the speakers in the panel talked about the same thing, but I least I learned something about journalism in the process :-)
Reminds me the German phrase "People who like sausage or politics shouldn't watch either being made"

A week or two before the conference I tried out to use IntelliJ IDEA. Why? Because I don't need to pay for the license (site licence) and few very smart people I know are using it and claiming its the best. So I installed it, and took at least a week to get comfertable with it.
It appears that either I got very used to Eclipse so I missed all of IDEA great things, or Eclipse is really better. Either way I happily returned to Eclipse, not looking back.
I would still be happy to be convince otherwise, but as of now I do think that for my style of work Eclipse is better.

Few things I noticed which made it harder for me to switch:

  • No continues compilation in IDEA. Its actually a problem when you have a huge project like I have and want to run a small JUnit test from within the IDE debuger (and don't want to wait five minutes to compile the wholl prokect).
  • Many ways to do auto compleate which means no one is perfect. Eclipse has a very nice auto compleate that pops to the top of the list the stuff you probably care about right now.
  • No spell checker. Its actually very important for me as my spelling is horrable and I do wish my javadocs will look decent.

Creative Commons License This work by Eishay Smith is licensed under a Creative Commons Attribution 3.0 Unported License.