Thursday, March 26, 2009

Benchmarking is tricky

Once again I found a flaw in the benchmarking. Thanks for Ismael Juma's challenge, I fixed the benchmark to me more fair. It seamed like protobuf is taking the charge but then again, I decided to provide a fresh object to serialize for each serializer each time and json came up to the top again. Please everyone review the fairness of the code.
Thanks also to Chris Pettitt who pointed me to the -XX:CompileThreshold flag that would help the JIT get into business sooner then later, it might have helped changing the results. The full results are in the java benchmarking serialization project wiki page.

Monday, March 23, 2009

Protocol Buffers pitfalls

There are few rough edges to protobuf and Java.
One I noticed today was that if you have an enum field and do not define a default value, protobuf auto generated code sets a default for you (the first enum value). I expected it to have null but this mistake costed me with a nasty bug.
Conclusion: you must set a default for protobuf enums as in:

  enum Player {
UNKNOWN = 0;
MP3 = 1;
VIDEO = 2;
}
optional Player player = 10 [default = UNKNOWN];
The other one I saw is with repeated string option like:
repeated string person = 9;

If you set no value if the list then on serialization it blows away with an NPE when it tries to check the encoding of the string.
Conclusion: Avoid. If you must then embed it in another object which you can then use as a list item.

Sunday, March 22, 2009

Moved new benchmarking discussion to project wiki

There are may more interesting benchmarking results. Please check them out in the project wiki. Special thanks to David Bernard for the last updates.
Its probably best that further updates and discussions will be held on the project wiki and google group. I had the first opportunity to use the Google Charts API and found them to be rather slick. Check out the wiki to see the full results.

Friday, March 20, 2009

Listen to JavaPosse Podcast, now also at LinkedIn

Just pushed the java posse podcast to the Java Posse LinkedIn group.

You can listen to the podcast from the page. Thanks for Armin and Scott for embedding the mp3 player!

Tuesday, March 17, 2009

More on benchmarking java serialization tools

The serialization benchmarking discussed in previous posts is getting to be more interesting. Thanks to all who looked at the code, contributed, suggested and pointed bugs. Tree major contributions are from cowtowncoder who fixed the stax code, Chris Pettitt who added the json code and David Bernard for the xstream and java externalizable. Most of the code is at the google code svn repository.
The charts are scaled and some are chopped. So if you’re interested in exact numbers, here they are:

Library, Object create, Serializaton, Deserialization, Serilized Size
java , 113.23390, 17305.80500, 72637.29300, 845
xstream default , 116.40035, 119932.61000, 171796.68850, 931
json , 112.58555, 3324.76450, 5318.12600, 310
stax , 113.05025, 6172.06000, 9566.96200, 406
java (externalizable) , 99.76580, 6250.40100, 18970.58100, 315
thrift , 174.72665, 4635.35750, 5133.24450, 314
scala , 66.10890, 27047.10850, 155413.44000, 1473
protobuf , 250.37140, 3849.69050, 2416.94800, 217
xstream with conv , 115.22810, 13492.50250, 47056.58750, 321

Serialize size (bytes), less is better.
May very a lot depending on number of repetitions in lists, usage of number compacting in protobuf, strings vs numerics and more. Interesting point is Scala and Java which holds the name of the classes in the serialized form. I.e. longer class names = larger serialized form. In Scala its worse since the Scala compiler creates more implicit classes then java.

Deserialization in nanoseconds. The most expensive operation. Note that the xstream and Scala lines got trimmed.

Serialization (nanoseconds), way faster then deserialization.

Object creation, not so meaningful since it takes in average 100 nano to create an object. The surprise comes from protobuf which takes a very long time to create an object. Its the only point in this set of benchmarks where it didn't perform as well as thrift. Scala (and to a lesser point - java) on the other hand is fast, seems like its a good language to handle in memory data structures but when coming to serialization you might want to check the alternatives.

Wednesday, March 04, 2009

Thrift vs Protocol Buffers in Python

I've read Justin's post about thrift and protocol buffers and verified the results. I also found it hard to understand why protobuf is considerably slower then thrift.
In the example Justin did not add the line

option optimize_for = SPEED;
but it appears that it does not have any effect on performance. A bit strange since it definitely appears in the protobuf python docs.
Anyway, as stated in the java protobuf/thrift post it seems that at least in java protobuf performance is better then thrift, and there there is a great performance improvement with the "optimize_for" option.

The test without speed optimization:
5000 total records (0.577s)

get_thrift (0.031s)
get_pb (0.364s)

ser_thrift (0.277s) 555313 bytes
ser_pb (1.764s) 415308 bytes
ser_json (0.023s) 718640 bytes
ser_cjson (0.028s) 718640 bytes
ser_yaml (6.903s) 623640 bytes

ser_thrift_compressed (0.329s) 287575 bytes
ser_pb_compressed (1.758s) 284423 bytes
ser_json_compressed (0.067s) 292871 bytes
ser_cjson_compressed (0.075s) 292871 bytes
ser_yaml_compressed (6.949s) 291236 bytes

serde_thrift (0.725s)
serde_pb (3.156s)
serde_json (0.055s)
serde_cjson (0.045s)
serde_yaml (20.339s)
And with speed optimization:
5000 total records (0.577s)

get_thrift (0.031s)
get_pb (0.364s)

ser_thrift (0.275s) 555133 bytes
ser_pb (1.752s) 415166 bytes
ser_json (0.023s) 718462 bytes
ser_cjson (0.028s) 718462 bytes
ser_yaml (6.925s) 623462 bytes

ser_thrift_compressed (0.330s) 287673 bytes
ser_pb_compressed (1.767s) 284419 bytes
ser_json_compressed (0.067s) 293012 bytes
ser_cjson_compressed (0.078s) 293012 bytes
ser_yaml_compressed (7.038s) 290980 bytes

serde_thrift (0.723s)
serde_pb (3.125s)
serde_json (0.056s)
serde_cjson (0.046s)
serde_yaml (20.318s)
As noted before, there is no noticeable difference. If would be interesting to run the same test in java.
Anyway, the conclusion is that the language and probably the data structure counts when coming to decide which serialization method to pick and one language does not necessarily infer to the next.

Tuesday, March 03, 2009

Protobuf String serialization

Some of the reasons I heard about choosing XML for protobuf is human readability. Actually protobuf has a human readable string representation, perhaps more readable then XML. The cost of course is time and space, but when testing on low scale where the storage is a relational DB then I use the text representation for debug purposes.
The memory cost is heavy, about 500% more then binary representation and 30% more then java serialization (which tells us something about java serialization).

Note that the numbers may vary according to the object size and the data types it use (numbers/chars). The one I tested with is mostly double, uint32 and sfixed64, which are better represented in binary then text so the 500% is more understood.

Creative Commons License This work by Eishay Smith is licensed under a Creative Commons Attribution 3.0 Unported License.