Thursday, March 26, 2009

Benchmarking is tricky

Once again I found a flaw in the benchmarking. Thanks for Ismael Juma's challenge, I fixed the benchmark to me more fair. It seamed like protobuf is taking the charge but then again, I decided to provide a fresh object to serialize for each serializer each time and json came up to the top again. Please everyone review the fairness of the code.
Thanks also to Chris Pettitt who pointed me to the -XX:CompileThreshold flag that would help the JIT get into business sooner then later, it might have helped changing the results. The full results are in the java benchmarking serialization project wiki page.

10 comments:

Ismael Juma March 27, 2009 at 12:26 AM  

I haven't had a chance to look at the latest changes, but please do not use -XX:CompileThreshold. It causes the HotSpot JIT to generate worse code as it has less profiling information.

Ophir Radnitz March 27, 2009 at 1:04 AM  

A proper warm up could replace the need for -XX:CompileThreshold.Something like Japex could help.

Ismael Juma March 27, 2009 at 2:17 AM  

I looked at the code and the reason for the change when supplying the fresh object is most likely that HotSpot was optimising away the code altogether.

The benchmark should probably be modified to use the result of the serialize call in some way (a simple way is to add the length of the returned array from each call and return that from the method).

By the way, I still find it misleading to say that json came up on top when message size is close to 50% bigger. That is often more important in many situations.

Furthermore, it's also worth mentioning that the current benchmark doesn't exercise certain things that would favour one or another library. For example, Protocol Buffers has a very efficient mechanism for encoding ints, but it's a bit wasteful when encoding repeated parameters that are not primitives (it used to be wasteful when encoding any type of repeated parameter, but it now supports packed repeated primitives in SVN).

Ismael

Eishay Smith March 27, 2009 at 9:27 AM  

I totally agree with you about having one or the other serialization method "on top" of the other. Every serializer has its pros and cons. Protobuf's great backwards and forward compatibility, the generated code and services stubs is of a great value. SBinary is a present surprise, but will not give you compatibility. Not always you need it all, and most always people are looking on the wrong attributes and ignoring their real bottlenecks.

About the benchmarking code, if you wish I can give you SVN read access and you can try to fix the benchmarks.
I myself am using protobuf in my last project and would like it to "win", maybe later in the game we should use another dataset which plays to its strengths. Its not to sway people to use one of the other, its mostly to educate that benchmarking, like in statistics, it all "depends".

Ismael Juma March 27, 2009 at 9:37 AM  

Yes, agreed.

SVN access would be nice. I'd like to create a couple of benchmarks with different characteristics. As you say, it would help people understand (and think about their requirements) instead of just proclaiming one to be faster/better than another.

I probably won't have time for this for another couple of weeks though, so no rush.

Eishay Smith March 27, 2009 at 9:42 AM  

For svn access, send me an email with your gmail account to eishay [at] gmail.com

cowtowncoder March 31, 2009 at 1:25 PM  

Wrt using fresh object: I doubt dead code elimination was being used.
It seldom is, because of all potential side effects. Creating new objects shouldn't hurt so maybe it's a good change, and good from "just to make sure" angle.

Regarding benefits, yes, there's always something or other that benefits one case over others. For example, FI almost certainly suffers from tiny size of messages; as well as from 'unique' element names (no repetition it can squeeze -- PB can do it, since it requires and uses external schema to define logical name binding).

Also: binary formats would fare better with bigger messages, most likely (including PB, Thrift, FI).
Relevance of passing numeric data is also heavily use case dependant -- best solution is to offer multiple alternative scenarios.
There are so many variables that it just has to be a compromise: but there the key is that what is tested is a valid use case for someone (like author).

Regarding size difference, json vs PB -- I don't think it's true that size matters in most cases. It does, sometimes, but mostly for larger messages, or in network-bandwidth constrained environments.

For me results indicate that speed-wise top contenders have very little difference, so that performance is not a big factor in choosing. Especially considering how many message read/writes can be done per second per core -- that's huge, compared to actual business logic. Databases can't push records as fast as they can be serialized and passed around.

I think this benchmark is good in that it gives a useful data point, and gives some boundaries -- a rough order-of-magnitude idea between choices.
Exact numbers vary for sure. It should also help tamper down unrealistic "my format is 10-100x faster than that standard format" expectations (sometimes fuelled by uncritical claims): it is very hard to get that kinds of improvements. Even 2x faster is a respectable achievement, given how well most implementations are optimized these days.

Ismael Juma March 31, 2009 at 1:38 PM  

"Wrt using fresh object: I doubt dead code elimination was being used.
It seldom is, because of all potential side effects. Creating new objects shouldn't hurt so maybe it's a good change, and good from "just to make sure" angle."

The point is that when Eishay provided a fresh object each time, there _was_ a big change. So, it was more than "just to make sure". What I suggested (to use the result of the method) was indeed "just to make sure".

"Relevance of passing numeric data is also heavily use case dependant -- best solution is to offer multiple alternative scenarios."

No doubt.

"Regarding size difference, json vs PB -- I don't think it's true that size matters in most cases."

No-one said "most", just "many".

"It does, sometimes, but mostly for larger messages, or in network-bandwidth constrained environments."

From my experience, the latter is not uncommon when talking about environments that care about tests like these.

"I think this benchmark is good in that it gives a useful data point"

Yes, as long as it's phrased that way.

Eishay Smith March 31, 2009 at 2:11 PM  

>>I think this benchmark is good in that it gives a useful data point
>Yes, as long as it's phrased that way.
I tried to phrase it that way. I did not, on purpose, ever announced a "winner" or sorted the result list by any order. I know that numbers are something that pops up, but like any statistics they are very misleading. I say statistics but this benchmarking is not even that, we picked a random data structure with random size and data types. I hope that in the future we'll expand the benchmark to more data types/sizes etc and maybe have a table somewhat explaining features of each library. If someone wish to start it please go ahead.

Ismael Juma March 31, 2009 at 2:15 PM  

"I tried to phrase it that way."

Sorry, Eishay, I was just making a general point. It wasn't directed at you. :)

You've been very responsive and have consistently improved the benchmark code to be as fair as possible.

"I hope that in the future we'll expand the benchmark to more data types/sizes etc and maybe have a table somewhat explaining features of each library."

Indeed, as I said earlier, I hope to contribute too.

Creative Commons License This work by Eishay Smith is licensed under a Creative Commons Attribution 3.0 Unported License.