Wednesday, April 16, 2008

Java's URL little secret

Lately I played a lot with fetching large amounts of data from a large number of URLs. Sounds like a fun project, and indeed it was. The first phase was to load, compare, classify and sort the urls from the DB. After all was written and tester I wanted to start rocking on a nice strong server with few millions of URLs and hundred of threads, but the application crawled!
After blaming the slow network (which wasn't at all), deadlocks (none) and the weather I did kill -3 to see what the threads are actually doing. Most of them where stuck on:

at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$1.lookupAllHostAddr
...
at java.net.URLStreamHandler.getHostAddress
- locked <0x0ad4e050> (a sun.net.www.protocol.http.Handler)
at java.net.URLStreamHandler.hashCode
at java.net.URL.hashCode
It freaked me out a bit.

I did created URL objects out of strings, stored them in a hash maps and sets, and did all kinds of comparisons, adding and removing from the containers. It appears that there is a DNS lookup on the first time you try to either do compare on the object or get its hash code (i.e. when adding it to a hash map/set). The DNS lookup is not happening when the object is initially created, rather it is lazy and happens on demand, which makes it even harder to locate the problem.

Looking at the javadocs it's rather cryptic:
The hash code is based upon all the URL components relevant for URL comparison. As such, this operation is a blocking operation.

In any way I wouldn't have guessed that it is going to do IO. And what the heck does it mean that "it is a blocking operation" ?
I wouldn't think anyone would expect hashCode to be synchronic, right ??

And even if the javadoc did mention what's going on behind the scenes then one can think of few flaws in this behavior:
* Exposing the implementation to the "8 Fallacies of Distributed Computing" which James Gosling who is signed on the URL class donated its eight fallacy. Especially the fallacies: 'The network is reliable', 'Latency is zero' and 'The network is homogeneous'.
What happens if you create a hash code from one URL object, after a while the network is gone ore the DNS server state changes and you create another URL with the same string as the first URL? Will the two URLs be equal? Should they be?

* One expect the compute of the hash code or equals to be relatively fast and not IO bound. See Joshua Bloch in Effective Java: "Don't write an equals that relies on unreliable resources"

* Most programmers do not think too much about the consequences of using hash containers (Map/Set) and don't go and check the hashCode methods when doing so. It is not at all apparent that if you're going to place a large set of URL object into a container you're going to create a lot of network traffic.

Looking at the source code, it seems that URI is not suffering from these problems. It might be a good idea to use it instead of URL.

4 comments:

John Yeary June 27, 2008 at 7:54 AM  

I love it. This is a perfect example of where even experts make mistakes. Since Java is open source (OpenJDK), you should submit a fix if you have one. Thanks for posting your insights.

Eishay Smith June 27, 2008 at 10:21 AM  
This comment has been removed by the author.
Eishay Smith September 29, 2008 at 11:58 AM  

Thanks, but I don't think there is an easy solution. The 'little secret' is not really a bug but is an unexpected and arguably - contract breaking behavior. Its like the java Properties class which inherits from Hashtable. Its obviously wrong but you can't roll it back since there might be some code there that is relying on the flawed behavior.

metatech November 6, 2013 at 7:21 AM  

I had the same problem on a Windows machine, and I ticked the option "Disable Netbios over TCP/IP". Afterwards, the delay in DNS lookup disappeared...

Creative Commons License This work by Eishay Smith is licensed under a Creative Commons Attribution 3.0 Unported License.