Lately I played a lot with fetching large amounts of data from a large number of URLs. Sounds like a fun project, and indeed it was. The first phase was to load, compare, classify and sort the urls from the DB. After all was written and tester I wanted to start rocking on a nice strong server with few millions of URLs and hundred of threads, but the application crawled!
After blaming the slow network (which wasn't at all), deadlocks (none) and the weather I did kill -3 to see what the threads are actually doing. Most of them where stuck on:
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)It freaked me out a bit.
- locked <0x0ad4e050> (a sun.net.www.protocol.http.Handler)
I did created URL objects out of strings, stored them in a hash maps and sets, and did all kinds of comparisons, adding and removing from the containers. It appears that there is a DNS lookup on the first time you try to either do compare on the object or get its hash code (i.e. when adding it to a hash map/set). The DNS lookup is not happening when the object is initially created, rather it is lazy and happens on demand, which makes it even harder to locate the problem.
Looking at the javadocs it's rather cryptic:
The hash code is based upon all the URL components relevant for URL comparison. As such, this operation is a blocking operation.
In any way I wouldn't have guessed that it is going to do IO. And what the heck does it mean that "it is a blocking operation" ?
I wouldn't think anyone would expect hashCode to be synchronic, right ??
And even if the javadoc did mention what's going on behind the scenes then one can think of few flaws in this behavior:
* Exposing the implementation to the "8 Fallacies of Distributed Computing" which James Gosling who is signed on the URL class donated its eight fallacy. Especially the fallacies: 'The network is reliable', 'Latency is zero' and 'The network is homogeneous'.
What happens if you create a hash code from one URL object, after a while the network is gone ore the DNS server state changes and you create another URL with the same string as the first URL? Will the two URLs be equal? Should they be?
* One expect the compute of the hash code or equals to be relatively fast and not IO bound. See Joshua Bloch in Effective Java: "Don't write an equals that relies on unreliable resources"
* Most programmers do not think too much about the consequences of using hash containers (Map/Set) and don't go and check the hashCode methods when doing so. It is not at all apparent that if you're going to place a large set of URL object into a container you're going to create a lot of network traffic.
Looking at the source code, it seems that URI is not suffering from these problems. It might be a good idea to use it instead of URL.