Benchmarking is tricky
Once again I found a flaw in the benchmarking. Thanks for Ismael Juma's challenge, I fixed the benchmark to me more fair. It seamed like protobuf is taking the charge but then again, I decided to provide a fresh object to serialize for each serializer each time and json came up to the top again. Please everyone review the fairness of the code.
Thanks also to Chris Pettitt who pointed me to the -XX:CompileThreshold flag that would help the JIT get into business sooner then later, it might have helped changing the results. The full results are in the java benchmarking serialization project wiki page. |xstream (xpp with conv)|xstream (stax)|stax/woodstox|scala|binaryxml/FI|xstream (stax with conv)|sbinary|protobuf|stax/aalto|json (jackson)|java (externalizable)&lklk&chdlp=t&chco=660000|660033|660066|660099|6600CC|6600FF|663300|663333|663366|663399|6633CC|6633FF|666600|666633|666666&cht=bhg&chbh=10&chxt=y&nonsense=aaa.png)







10 comments:
I haven't had a chance to look at the latest changes, but please do not use -XX:CompileThreshold. It causes the HotSpot JIT to generate worse code as it has less profiling information.
A proper warm up could replace the need for -XX:CompileThreshold.Something like Japex could help.
I looked at the code and the reason for the change when supplying the fresh object is most likely that HotSpot was optimising away the code altogether.
The benchmark should probably be modified to use the result of the serialize call in some way (a simple way is to add the length of the returned array from each call and return that from the method).
By the way, I still find it misleading to say that json came up on top when message size is close to 50% bigger. That is often more important in many situations.
Furthermore, it's also worth mentioning that the current benchmark doesn't exercise certain things that would favour one or another library. For example, Protocol Buffers has a very efficient mechanism for encoding ints, but it's a bit wasteful when encoding repeated parameters that are not primitives (it used to be wasteful when encoding any type of repeated parameter, but it now supports packed repeated primitives in SVN).
Ismael
I totally agree with you about having one or the other serialization method "on top" of the other. Every serializer has its pros and cons. Protobuf's great backwards and forward compatibility, the generated code and services stubs is of a great value. SBinary is a present surprise, but will not give you compatibility. Not always you need it all, and most always people are looking on the wrong attributes and ignoring their real bottlenecks.
About the benchmarking code, if you wish I can give you SVN read access and you can try to fix the benchmarks.
I myself am using protobuf in my last project and would like it to "win", maybe later in the game we should use another dataset which plays to its strengths. Its not to sway people to use one of the other, its mostly to educate that benchmarking, like in statistics, it all "depends".
Yes, agreed.
SVN access would be nice. I'd like to create a couple of benchmarks with different characteristics. As you say, it would help people understand (and think about their requirements) instead of just proclaiming one to be faster/better than another.
I probably won't have time for this for another couple of weeks though, so no rush.
For svn access, send me an email with your gmail account to eishay [at] gmail.com
Wrt using fresh object: I doubt dead code elimination was being used.
It seldom is, because of all potential side effects. Creating new objects shouldn't hurt so maybe it's a good change, and good from "just to make sure" angle.
Regarding benefits, yes, there's always something or other that benefits one case over others. For example, FI almost certainly suffers from tiny size of messages; as well as from 'unique' element names (no repetition it can squeeze -- PB can do it, since it requires and uses external schema to define logical name binding).
Also: binary formats would fare better with bigger messages, most likely (including PB, Thrift, FI).
Relevance of passing numeric data is also heavily use case dependant -- best solution is to offer multiple alternative scenarios.
There are so many variables that it just has to be a compromise: but there the key is that what is tested is a valid use case for someone (like author).
Regarding size difference, json vs PB -- I don't think it's true that size matters in most cases. It does, sometimes, but mostly for larger messages, or in network-bandwidth constrained environments.
For me results indicate that speed-wise top contenders have very little difference, so that performance is not a big factor in choosing. Especially considering how many message read/writes can be done per second per core -- that's huge, compared to actual business logic. Databases can't push records as fast as they can be serialized and passed around.
I think this benchmark is good in that it gives a useful data point, and gives some boundaries -- a rough order-of-magnitude idea between choices.
Exact numbers vary for sure. It should also help tamper down unrealistic "my format is 10-100x faster than that standard format" expectations (sometimes fuelled by uncritical claims): it is very hard to get that kinds of improvements. Even 2x faster is a respectable achievement, given how well most implementations are optimized these days.
"Wrt using fresh object: I doubt dead code elimination was being used.
It seldom is, because of all potential side effects. Creating new objects shouldn't hurt so maybe it's a good change, and good from "just to make sure" angle."
The point is that when Eishay provided a fresh object each time, there _was_ a big change. So, it was more than "just to make sure". What I suggested (to use the result of the method) was indeed "just to make sure".
"Relevance of passing numeric data is also heavily use case dependant -- best solution is to offer multiple alternative scenarios."
No doubt.
"Regarding size difference, json vs PB -- I don't think it's true that size matters in most cases."
No-one said "most", just "many".
"It does, sometimes, but mostly for larger messages, or in network-bandwidth constrained environments."
From my experience, the latter is not uncommon when talking about environments that care about tests like these.
"I think this benchmark is good in that it gives a useful data point"
Yes, as long as it's phrased that way.
>>I think this benchmark is good in that it gives a useful data point
>Yes, as long as it's phrased that way.
I tried to phrase it that way. I did not, on purpose, ever announced a "winner" or sorted the result list by any order. I know that numbers are something that pops up, but like any statistics they are very misleading. I say statistics but this benchmarking is not even that, we picked a random data structure with random size and data types. I hope that in the future we'll expand the benchmark to more data types/sizes etc and maybe have a table somewhat explaining features of each library. If someone wish to start it please go ahead.
"I tried to phrase it that way."
Sorry, Eishay, I was just making a general point. It wasn't directed at you. :)
You've been very responsive and have consistently improved the benchmark code to be as fair as possible.
"I hope that in the future we'll expand the benchmark to more data types/sizes etc and maybe have a table somewhat explaining features of each library."
Indeed, as I said earlier, I hope to contribute too.
Post a Comment