Monday, December 15, 2008

Scala ends the Groovy fad


Looks like Groovy is going to loose its steam.

Two reasons: tooling and governance.

Tooling is a critical element as the code base and number of involved engineers increase. Refactoring in a single code base which involves a dynamic language and dozens or hundreds of engineers may lead to hairy situations. When you change a public library API in Java your tool will let you know where is the effected code, update the references for you and will let you know if a contract is getting broken. Tests should locate problems created by refactoring, but:

  • Groovy engineers (and not only them) are sometimes too cool to do a 100% code test coverage.
  • Even a 100% test coverage can not guarantee that there are no bugs, it can only confirm that the test code didn't find bugs.
Typical refactoring problem with Groovy might be changing Java code like:
public int getProductID();
to:
public ProductID getProductID();
class ProductID{
int id;
VendorID vendorID;
}

So if you have a method in your Groovy shopping chart that gets a productId and compares it to a local product id it will not break the code. I.e., no method names where changed and the compare with int and object will simply return false. If the method is rarely used and not tested, or tested only in a way that supposed to return false then you won't find it before your user, and they always find these things.
Prime large scale products can not afford this.

Most of the claims I heard for having dynamic typing is for cleaner and smaller code. Yes, you may have strong typing in Groovy if you wish, but then your back to ugly Java code. With the amazingly comfortable type inference in Scala, you enjoy both words.

The second reason is governance. Both Groovy and Scala are still in the youth where the Cathedral model is the only way for them to react fast and efficiently enough to feedback from the market. Unfortunately Groovy is JSRd. On its face it looks good, hey - what's wrong with Democracy? Well, it appears that its not the best way to run an engineering project and in some case it locks the product in infinite politics games and narrow interests of the few in charge.

Sunday, December 14, 2008

Scala prez by Twitter

Nice presentation, learned few things from it. One of them is that Scala is supported by JavaRebel which is in use @ LinkedIn too.

Why Scala?
View SlideShare presentation or Upload your own. (tags: c4 scala)

Thursday, December 11, 2008

No closures in Java 7

Ricky's technical blog: Java Just Died (no closures in Java 7), referring to a tweet.
As Neal Gafter wrote in his blog post Is the Java Language Dying? "...we should place Java on life support and move our development to new languages such as Scala...". Yea, I know I abuse the quote, but its not too bad.
Actually I think Java is going to be around for a long while, like Cobol, there is just too much good tooling around it.
Fortunately, it will not be too hard to switch to Scala being statically typed and thus stable in a large code base and tooling friendly. Haven knows how much grief you can have in a very large code base with refactoring and not statically typed languages like groovy, ruby, python.

Had to add this one: "I'm not dead yet!"

Thursday, December 04, 2008

Great Dijkstra quotes

Taken from his Dijkstra's interview video

  • Computer science is no more about computers than astronomy is about telescopes.
  • The competent programmer is fully aware of the limited size of his own skull. He therefore approaches his task in full humility and avoids clever tricks like the plague.
  • We should not introduce errors through sloppiness but systematically keep them out.
  • Program testing can convincingly show the presence of bugs but it is hopelessly inadequate to show their absence.
  • Elegance is not a dispensable luxury but a factor that decides between success and failure (is it about Scala ?).
Its a nice video which also explains well why Alan Kay said that arrogance in CS is measured by nano-Dijkstras.

Wednesday, December 03, 2008

Scala only Spring RPC/Remoting Service: works

I know it should work, but didn't see and reference that it does. So I tried it out and, of course, it does.
Since the service object will be represented by an AopProxy which is basically a Java proxy, you must have the service to implement a Java Interface (proxy can be created only to interfaces). Scala does not have the notion of interfaces like Java, but the language creators did invested a lot in compatibility with java. Creating a plain Scala trait spits out a POJI (Plain Old Java Interface), which solves the issue. And even though declaring checked exception is not native to Scala, they did enabled the special throws annotation for that which helps in case your service must declare some kind of remote exception.

Here is sample interface that you would create to your spring service:

trait ScalaService {
@throws(classOf[MyRemoteException])
def bar(info: MyInfo)
}
And this is what you'll get by running javap on it:
Compiled from "ScalaService.scala"
public interface ScalaService{
public abstract void bar(MyInfo) throws MyRemoteException;
}

Monday, December 01, 2008

Is object creation is Scala really faster then Java?

The argument is that it should not be that case since Scala compiles to Java class files and runs on the same JVM/JIT, hence they are actually identical.

In my other post I claimed that for the benchmark I wrote, Scala object creation is faster. Actually, it took about 323 nano to create the java object set and only 221 nano to create the scala object set. I.e. it took about 46% more time to create the objects in Java then it would in Scala. Its pretty significant.

The code is out there so I urge you to check it out and prove me wrong.

Here is the code for creating Scala objects (from a Java Class):

public MediaContent create(){
Media media = new Media("http://javaone.com/keynote.mpg",
"Javaone Keynote", 0, 0, "video/mpg4", 1234567, 0,
123, null, Player.JAVA());
media.addPerson("Bill Gates");
media.addPerson("Steve Jobs");
Image image1 = new Image("A", "Javaone Keynote", 0, 0,
Size.LARGE());
Image image2 = new Image("B", "Javaone Keynote", 0, 0,
Size.LARGE());
MediaContent content = new MediaContent(media);
content.addImage(image1);
content.addImage(image2);
return content;
}
Here is the code that creates the matching Java object set:
public MediaContent create(){
Media media = new Media(null, "video/mpg4", Player.JAVA,
"Javaone Keynote", "http://javaone.com/keynote.mpg",
1234567, 123, 0, 0, 0);
media.addToPerson("Bill Gates");
media.addToPerson("Steve Jobs");
Image image1 = new Image(0, "Javaone Keynote", "A", 0,
Size.LARGE);
Image image2 = new Image(0, "Javaone Keynote", "B", 0,
Size.SMALL);
MediaContent content = new MediaContent(media);
content.addImage(image1);
content.addImage(image2);
return content;
}

The code as you see is almost the same. There is a minor difference with param order in the contractors, should not matter a bit, and the reference to a Scala enum Size.LARGE() which is not a real Java enum. The hidden nugget in the story is the usage of lists in Scala and Java. In Scala lists are immutable so you throw the old list once you added a new element to it. As Itay mentioned, the object immutability might makes the implementation faster.
You can also compare the Scala class set and the Java class Set and check out the benchmark runner. As expected from code I write for fun after midnight, it is fully documented for those of you who can read Java and Scala.

In Java, the way I handled lists is like this:
private List Images _images;//html generics format problem

public void addImage (Image image){
if(_images == null)
_images = new ArrayList Images();//generics format problem
_images.add(image);
}
And in Scala:var _images: List[Image] = Nil

def addImage(image: Image){
image :: _images
}

Sunday, November 30, 2008

Inconsistencies when moving from Java to Scala

Writing the Scala serialization benchmark I did a Java to Scala calls. It is very simple, just as using a yet another Java library. You only need to add the Scala library jar into your classpath and you're ready to go.
Alas there where to small but nasty quirks

  • Scala's List is not a Java Collection list. Actually I could not create a Scala list since its abstract. Scala is using lists heavily and its list is very powerful. Still it sucks that you can't easily create a Scala list in Java and set it in a Scala object. Obviously there are easy ways around this problem.
  • The second is that Scala's enum is not a Java enum. It has many implications when trying to use a library written in Scala.

Scala is faster then Java until you hit Object Serialization

Adding Scala to the serialization benchmarking parade along with Java, Stax, Thrift and Protobuf.
Scala is actually closer to Java, actually as for the serialization engine it IS java since it compiles to java classes. Since you could use Scala in a Java environment like a yet another Jar file then it would be good to check out the serialization cost, especially if you're using RMI, remote Spring or other protocol based on Java serialization.
The surprising part of the Scala compare is that Scala is actually faster in creating objects. To be fair, I've created the exact same objects in Java and Scala and created all the Scala objects from Java code! Creating Scala objects from Scala code might be even faster.
In the chart below size is the size of the object's serialized byte array.In the chart below time is measured in nanoseconds.



Does anyone have an explanation to this?
Here are my assumptions:

  • Since for each class in Scala the compiler creates two Java classes then the encoding in the serialized form needs to contains twice the meta data.
  • Scala's Enumeration does not translate to Java enumeration. I assume that an enum object serialized representation in Java is more compact then a regular object. Scala looses that.
  • As a new language Scala could do a better job in performance and still lavarage the JIT. But they didn't really cared too much about serialization. Actually, there is a good reason, if one cares about serialization performance he should pick up Protobuf or Thrift.
By the way, its really fun writing Scala code. Its much smaller and nicer.

Friday, November 28, 2008

Avoid large scale Java serialization

Oh, it makes so much sense reading Ted's post Don't Serialize Java Objects... about large scale Java serialization and performance. A good reinforcement to earlier conclusion.

What This Cost In Space And Time

First, the Java serialization space overhead. On a toy example of this object, serialization to a byte array used 953 bytes. Properly writing out the instance variables consumed 296 bytes. In production, doing it the right way shrunk a 1,600-record SequenceFile from 1.4GB to 825MB.

Time savings were great, too. In the same toy example, it took my JVM 7.2 milliseconds to serialize the object and 1.7 milliseconds to unserialize. Doing with with stream I/O only took 76,000 nanoseconds to serialize, 58,000 nanoseconds to unserialize.

Wednesday, November 26, 2008

Using Spring RPC for Protobuf transport?

I tried playing around with some code to make Spring RPC be protobuf's transport and have it coexist with other Spring RPC services.
To give credit to Spring, they make it very easy to extend their framework:

Simple, right?
Well, if you want to use the Spring RPC transparent way of remoting services then you must overwrite getProxyForService of RemoteExporter to route Protobuf service calls (i.e. mimic the protobuf service stub). Alas, you can't do it since the service Protobuf does not have an interface with the method signature so you can't make a proxy out of it. And anyway, the methods protobuf service is generating are not intended to seem POJO like at all with the RpcCallback and RpcController arguments. It sucks a bit if you with to integrate protobuf in an existing environment.
One can always wrap the generated service so it will look like an innocent POJO, but on the other hand there is a good point in the protobuf way.
Using Spring RPC it is way too easy to forget you are calling a remote service. You make your business objects Serializable and just move them around, having most of Java data containers serializable just makes it easier. The next thing you know you're API has object serialized back and forth without justification. So Spring RPC is a big gun to shoot yourself in the foot with when you forget about the eight Fallacies of Distributed Computing.
My conclusion (for now) is that if you have to use Spring RPC for transport and have protobuf objects floating around, you better use a protobuf java seialization wrapper.

Tuesday, November 25, 2008

Protobuf vs Spring RPC

Did some RPC tests over the wire and got some interesting results. Hope that someone could do something similar and verify.
I run three RPC client/server combination (see links to source code):

  1. Protobuf as the protocol over a simple TCP/IP client/server java sockets.
  2. Protobuf as the protocol over a HTTP where the client is using Apache HTTP client (reusing connection), and the server part is a servlet in a War on Jetty v6. The server side (protowar) implementation is very basic and ment only for benchmarking.
  3. Spring RPC, the war container is Jetty v6. Not posting the source code for this one, but its basically the same as in the protobuf example, using java serialization and spring.
I did a warmup for each client service combination to get the JIT going, and took the minimum of many iterations to try and dodge the GC. All the test I run on the same single (localhost) machine which is a MacBook Pro with 4GB of RAM and Intel Core 2 Duo. Surely its not a falst machine and you'll get much better results on a decent server, still, I would expect the relationships will stay the same. The results:
  1. protobuf on plain socket: 0.228 milli/roundtrip
  2. protobuf on HTTP: 2.08 milli/roundtrip
  3. Spring/RPC: 106.8 milli/roundtrip
Note that the chart below is in logarithmic scale!
I.e. you loose over an order of magnitude by using HTTP. But say you would like to keep it around and use the benefits of web containers, HTTP VIPs etc, then a milli (less on a real server) might be worth it. But jumping to Spring/RPC, that will cost you. Sure, you have tons of benefits there, just make sure you know the tradeof.

Thursday, November 20, 2008

Some perspective

On the same machine, creating a plain object, default constructor, not setting any fields or invoking methods, dodging the GC, single thread, and subtracting any other code takes about 7 nano seconds per object (tested 100k iterations).
Of course, in real life you have more stuff involved in object life cycle.

Tuesday, November 18, 2008

Thrift vs Protobuf object creation patterns, or: the builder pattern

The protobuf object creation API is beautiful. However designed it did a good job. Not that the thrift one is bad, its actually simple and straight forward and much more flexible, but this flexibility is a loaded gun that one can shut his own foot with. Its strange a bit since it seems that Thrift is heavily inspired by protobuf, yet it did no adopt few key elements from it.

The protobuf API successfully confronted conflicting needs:

  • Make business objects immutable (thread safety etc., Scala and Erlang took it to the extreme)
  • Make it flexible to create a business object. For example, if I have ten fields in the objects, all optional I want to allow some of the fields not to be set. Alas! I want it to be an immutable object. Do I need to have factory methods/constructors of all the permutations?!
  • Don't have factory methods/constructors with many parameters. There will be numerous bugs which the compiler won't catch for having the developer mixing up values. For example when there are seven integers in a row and the product and member ids order got mixed up.
The protobuf solution is a nice builder pattern. The builder is a write only interface, like a java bean with only setters. After one finished setting all the values, an immutable object is built (see sample code). It seems co cover all the above requirements.
The only caveat is performance, thought its probably very marginal. The time to create a sample business object is:Protobuf 0.00085 milli > Thrift 0.00051 milli >POJO 0.00032 milli
the builder pattern does have its cost, but consider that its less then a nano second and the number of protobuf objects created in a transaction is probably few orders of magnitude less then creating POJOs.

Protobuf

MediaContent content = MediaContent.newBuilder().
setMedia(
Media.newBuilder().setUri("http://javaone.com/keynote.mpg").
setFormat("video/mpg4").
setTitle("Javaone Keynote").setDuration(1234567).
setBitrate(123).addPerson("Bill Gates").
addPerson("Steve Jobs").setPlayer(Player.JAVA).build()).
addImage(
Image.newBuilder().setUri("http://javaone.com/keynote_large.jpg").
setSize(Size.LARGE).setTitle("Javaone Keynote").build()).
addImage(
Image.newBuilder().setUri("http://javaone.com/keynote_thumbnail.jpg").
setSize(Size.SMALL).setTitle("Javaone Keynote").build()).
build();
Thrift

Media media = new Media();
media.setUri("http://javaone.com/keynote.mpg");
media.setFormat("video/mpg4");
media.setTitle("Javaone Keynote");
media.setDuration(1234567);
media.setBitrate(123);
media.addToPerson("Bill Gates");
media.addToPerson("Steve Jobs");
media.setPlayer(Player.JAVA);
Image image1 = new Image();
image1.setUri("http://javaone.com/keynote_large.jpg");
image1.setSize(Size.LARGE);
image1.setTitle("Javaone Keynote");
Image image2 = new Image("http://javaone.com/keynote_thumbnail.jpg",
"Javaone Keynote", -1, -1, Size.SMALL);
MediaContent content = new MediaContent();
content.setMedia(media);
content.addToImage(image1);
content.addToImage(image2);

Protobuf with option optimize for SPEED

Jon Skeet pointed out to me that I missed to include
option optimize_for = SPEED
Thanks Jon!

I added it and it does make a lot of difference. I wonder why its not the default. It appears that without the flag protobuf is using Java introspection which is very expensive. With the speed optimization protobuf is faster then thrift, though not by far.
The serialization speed differences between them are probably not meaningful for transactions that take few miliseconds to perform.




Monday, November 17, 2008

Java, StAX, Protobuf and Thrift

Another option to serialize objects is using the XML format. It has a lot of advantages, but its not performing well. In come cases this performance aspect is only a very small part of the transaction, but since there is so much of SOAP protocols floating around, in some cases they should be reconsidered. The fastest Java XML library that I know of is StAX. I've created a matching XML to the Thrift and Protobuf schemas, limiting the tag sizes to only two chars. Makes the XML not too readable, but limits its total size.
Here are the performance charts comparing to StAX with Java plain Serialization, Thrift and Protobuf.




Here is the Thrift object description:

namespace java serializers.thrift
typedef i32 int
typedef i64 long
enum Size {
SMALL = 1,
LARGE = 2,
}

enum Player {
JAVA = 0,
FLASH = 1,
}

/**
* Some comment...
*/
struct Image {
1: string uri, //url to the images
2: optional string title,
3: optional int width,
4: optional int height,
5: optional Size size,
}

struct Media {
1: string uri, //url to the thumbnail
2: optional string title,
3: optional int width,
4: optional int height,
5: optional string format,
6: optional long duration,
7: optional long size,
8: optional int bitrate,
9: optional list person,
10: optional Player player,
11: optional string copyright,
}

struct MediaContent {
1: optional list image,
2: optional Media media,
}
Protobuf:
// See README.txt for information and build instructions.

package serializers.protobuf;

option java_package = "serializers.protobuf";
option java_outer_classname = "MediaContentHolder";

message Image {
required string uri = 1; //url to the thumbnail
optional string title = 2; //used in the html ALT
optional int32 width = 3; // of the image
optional int32 height = 4; // of the image
enum Size {
SMALL = 0;
LARGE = 1;
}
optional Size size = 5; // of the image (in relative terms, provided by cnbc for example)
}

message Media {
required string uri = 1; //uri to the video, may not be an actual URL
optional string title = 2; //used in the html ALT
optional int32 width = 3; // of the video
optional int32 height = 4; // of the video
optional string format = 5; //avi, jpg, youtube, cnbc, audio/mpeg formats ...
optional int64 duration = 6; //time in miliseconds
optional int64 size = 7; //file size
optional int32 bitrate = 8; //video
repeated string person = 9; //name of a person featured in the video
enum Player {
JAVA = 0;
FLASH = 0;
}
optional Player player = 10; //in case of a player specific media
optional string copyright = 11;//media copyright
}

message MediaContent {
repeated Image image = 1;
optional Media media = 2;
}
The generated XML looks like this:

Sunday, November 16, 2008

Serialization: Protobuf vs Thrift vs Java

Here is a new project on the google code site named thrift-protobuf-compare that tries to compare Protobuf, Thrift and Java POJO serialization.
The results turned out to be very interesting.

The method is: creating three sets of objects and run the tests on them using the respective tools.
In each test the code first run a 200k iteration of the test without taking time (let the JIT warm up), asking the system to invoke the GC, and then running another 200k iteration with time measuring.

Note that the project does not do a full compare yet, and there are few ways to do each of the actions. Moreover, there is much more then the numbers (features like versioning, object merging and the RPC mechanizem). More investigation will follow.

The numeric results are in the project page, here are the graphs:
Milliseconds to create an object, smaller is better. The Protobuf results is not a mistake! It was created by the builder pattern.

Milliseconds to serialize an object to a byte array, smaller is better.
Milliseconds to deserialize an object from byte array, smaller is better.
Size of the byte array of a serialized object, smaller is better.

Results as you see are mixed, and maybe with a different implementation they might look different. But it seems that for now, Thrift performs much better in terms of CPU speed and protobuf buffer size is about 30% smaller then Thrift.

Please look at the code and post suggestions to add/change test cases.

Friday, November 14, 2008

Installing Thrift on Mac OSX 10.5


There are few dependencies in the way to install Thrift. I'll list the ones I found and the way to install it. You may have some of the dependencies and so skip a step or two.

  • X11. Doesn't sound like something Thrift would care about. It is actually a dependency of Python and you need that for Boost and you need that for thrift :-) Download it, double click to install and reboot.
  • Fink.
  • Boost. From the command line, type 'fink install boost1.33'
  • Download and unzip Thrift.
  • Run the following from the command line:
$ cd {Thrift dir}
$ cp /usr/X11/share/aclocal/pkg.m4 aclocal/
$ ./bootstrap.sh
$ ./configure --with-boost=/sw/
$ make
$ sudo make install
That's it!
Now run the thrift toturial script and make sure you read the file:
$ cd tutorial/
$ ./tutorial.thrift

Thursday, November 13, 2008

protobuf java serialization pains


The implementation in the last post is actually pretty ugly since its not generic and one needs to create a new serialization wrapper to every protobuf object.
The natural solution would be to use generics and get it over with, but it seems that protobuf makes it difficult to do.
The key problem with using the generics approach is that the method "parseFrom" that serialize the generated protobuf object is not declared in an interface or the GeneratedMessage class it inherits from. This means that one has to have the class at hand to do the deserialization, but that's the whole point! I do want to have a general serializer for and protocolbuf object.

So here is my solution, looks a bit hacky with the introspection and byte writing, but it works fine.

/**
* Manually serializing Protobuf objects
* The serialize form first has an integer SIZE which is the
* size of the test of the serialized protobuf.
* After the integer there are SIZE bites of the protobuf serialized object
*/
class ProtobufSerializer implements Externalizable{
/**
* Object to serialize
*/
private transient T _proto;
private transient String _className;
public ProtobufSerializer(){}
public ProtobufSerializer(T proto){
_proto = proto;
if(null != _proto)_className = _proto.getClass().getName();
}
public T get(){return _proto;}

/**
* If the first byte is the size of zero, the object is null
*/
public void readExternal (ObjectInput in)throws IOException, ClassNotFoundException{
int size = in.readInt();
if(0 == size)return;
byte[] array = new byte[size];
in.readFully(array, 0, size);
_className = new String(array);
size = in.readInt();
array = new byte[size];
in.readFully(array, 0, size);
try{
Class clazz = getClass().getClassLoader().loadClass(_className);
Method parseMethod = clazz.getMethod("parseFrom", array.getClass());
_proto = (T)parseMethod.invoke(clazz, array);
}
catch (Exception e){
throw new IOException("could not load class " + _className);
}
}

/**
* If the the object is null then the int zero is written to the stream
*/
public void writeExternal (ObjectOutput out)
throws IOException{
if(null == _proto){
out.writeInt(0);
return;
}
out.writeInt(_className.getBytes().length);
out.write(_className.getBytes());
ByteArrayOutputStream baos = new ByteArrayOutputStream();
_proto.writeTo(baos);
baos.close();
byte[] array = baos.toByteArray();
out.writeInt(array.length);
out.write(array);
}
}
And here is how to use it:

private ProtobufSerializer _mediaHolder;
public NewsMediaContent getNewsMediaContent ()
{
return null == _mediaHolder ? null : _mediaHolder.get();
}
public void setNewsMediaContent (NewsMediaContent media)
{
_mediaHolder = new ProtobufSerializer(media);
}
By the way, I noticed that Hadoop has a similar issue with Thrift.

Wednesday, November 12, 2008

Protobuf not Serializable ?!

I'm starting to use Google's protobuf for serialization and deserialization of objects. Its looks great, ought I'm still going to check out Facebook's Thrift, just to make sure I'm not missing something.
The code is very easy to use and much better then any of the XML to object mappings, or for that matter, XML as a data structure.
Still, there is one big cavity in protobuf. The generated objects do not implement Serializable. I hope there is a good reason for that though I didn't find one. Without knowing any better, its looks very simple to have the generated objects extend the interface.

With a large legacy codebase it is not feasible to change the bl transportation layer to use protobuf in one swift. Therefore you're stack if you have protobuf objects floating around and you need to java serialize them.

If you can embed the protobuf object in another one then its easy (though unpleasant) to solve the problem. I solved it by having a Externalizable class that wraps the protobuf object and deals with the java serialization for it.

For example, assuming NewsMediaContent is the protobuf object:

public class NewsMediaContentSerializer implements Externalizable{
private NewsMediaContent _media;
public NewsMediaContentSerializer(){}
public NewsMediaContentSerializer(NewsMediaContent media){_media = media;}
public NewsMediaContent getMedia (){return _media;}
public void readExternal (ObjectInput in) throws IOException, ClassNotFoundException{
int size = in.readInt();
if(0 == size) return;
byte[] array = new byte[size];
in.readFully(array);
_media = NewsMediaContent.parseFrom(array);
}
public void writeExternal (ObjectOutput out) throws IOException {
if(null == _media){out.writeInt(0); return;}
ByteArrayOutputStream baos = new ByteArrayOutputStream();
_media.writeTo(baos);
baos.close();
byte[] array = baos.toByteArray();
out.writeInt(array.length);
out.write(array);
}
}


And use it like this:

  public NewsMediaContent getNewsMediaContent (){
return null == _mediaHolder ? null : _mediaHolder.getMedia();
}
public void setNewsMediaContent (NewsMediaContent media){
_mediaHolder = new NewsMediaContentSerializer(media);
}

Monday, November 10, 2008

PMD


Lately I came across PMD. Looks like a nice tool one should add to the toolbox, next to FindBugs and Clover. Looks like many of the things PMD will do for you Eclipse already does and most rules are not so useful. Still, I did find few things that made it worth while to use. the Killer feature I think is the 'duplicated code finder'. Its very good, and it did find few surprises over a very large code base. I did needed about 4Gig of heap to use it.

Sunday, November 09, 2008

Scala (!)


In the last couple of days I went to few sessions at the Silicon Valley Code Camp. The sessions where nice, but the one which impressed me the most was about Scala. I heard about the language few times, and the JavaPosse are constantly talking about it.
So I went to David Pollak's talk about Scala and the Lift web framework.

The more I learned about the language the more I was impressed. Until now I thought that Groovy is the next big thing on the JVM and since LinkedIn is using Groovy I got a bit closer view on it. I must admit, I was not too impressed. Scala on the other hand definitely looks like the way to go. It is fully interoperable with Java legacy code, provide type safety which the lack of it is a big drawback for Groovy, especially when working with huge amounts of code that needs to be refactored once in a while. Having Twitter moving its core server code to Scala, and good Netbeens and Eclipse plugins are yet another good indicators.

Now I need to look for a nail to test this new hammer :-)

Monday, November 03, 2008

CrossOver Chromium / Chrom on Mac - better wait for the real thing

Just installed CrossOver on my Mac. Actually, I have nothing to do with it, it was just to check it out. The only windows only application I would like to check out it Chrom. As expected, the emulation works fine though the UI choppy. It definitely feels like an emulation and is no match to the slickness of Firefox.
I guess that if you have to have it then its better then nothing, but for a day to day usage then CrossOver would be my last resort.

Here are some snapshots of how it looks on the Mac.

Monday, October 20, 2008

Twitter as a platform


[From Wikipedia]

Twitter is a free http://en.wikipedia.org/wiki/Social_networking and micro-blogging service...
The Wikipedia definition is getting to be less accurate over time. It is getting clearer that Twitter is actually a “real-time short messaging service” or in other words a public pub/sub message broker. Obviously, one of Twitter's usages is social networking and micro-blogging service, same as http/html one of many usages is facilitating facebook communication. The messaging broker concept is rather old, though this time the producer and/or the consumer are typically human.

Twitter kills the RSS tools.

Since I've started using Twitter my typical use pattern is changing. At first I didn’t know what’s the big deal with sending short messages about what I’m doing right now, and why should one care. Actually, I still don’t get it. From using it for microblogging (“What are you doing?”), it starts to be a replacement for my RSS reader. Recently, we are seeing sites that recognized this trend appearing like mushrooms after rain. It is easier to filter out streams of Tweet news articles, sort items by timeline, and follow both high and low throughput sources. Twitter stream are sorted by time and everything there is streamlined together. Messages I personally post are mostly retweets of interesting tweets or items from other sources (like podcasts), and I use them mainly as a bookmark for myself.

Most of the news sources I follow already have a twitter feed. Twitter allows the news source to actually interact with the consumer; it is no longer a one-way communication. Soon, I assume, we’ll see retweets posted as talkbacks to articles on the article main site. I expect that there will be more similar companies providing services similar to Twitter, the protocol will be semi standard (similar to the RSS history), and it will be a major article syndication way. With clients like twhirl, which already support more then one provider, the message broker itself will be less relevant. It won’t be a surprise if some of the IT giants, like Google, will have similar service running.

Businesses?

Like Yammer for businesses, there are and will by companies that will try to provide added value on the server side. It does work, but the true innovation will come from the clients. Email for example, on the server side, didn’t change much in the last decade as a protocol and server software. Even gmail, a true innovation, is actually a client that happens to run as a web app. Same I assume will be with twitter, the interesting applications will be the ones that will produce and consume messages.

Wednesday, September 17, 2008

stackoverflow

Take a look at the new http://stackoverflow.com site. Great content, and I really liked the way they encourage good content and usage. They definitely hit the right spot with the reputation and voting system. Also, another winning OpenID usage. Obviously, I'm subscribed to the Java feed :-)

Thursday, September 11, 2008

Military History Podcast Rocks


I really liked this one, the Military History Podcast is very informative and interesting. Its not too long or too short, and I found there lots of pearls I didn't know about. Here in the US I don't do any military reserve duty anymore, kinda miss the long military discussions I used to have with my colleagues :-)

Anyway, this is a nice compensation. If you are interested in the history that made the world as it is today, and if you have somewhat interest in military, you'll definitely like this one.

Wednesday, September 10, 2008

Eclipse Birt is cool

I've starting to mess around with Eclipse BIRT on my Mac. It is a bit annoying that in their download page they offer an 'All In One' just for Windows, it gives the perception of the tool available only on this OS. There are projects in Eclipse that are not working on all platform, like the profiler, fortunately not BIRT.
Using a simple Eclipse update you can have it install, and it works without a problem. Last time I messed with it was a couple of years ago, I wonder what's new...

Anyway, go get it. Even it you're on Macs :-)

Monday, September 08, 2008

Wordle is cool !

Bounced into Wordle after hearing about then in the JavaPosse podcast. Really cool, but too bad I can't get the code and make it run out of our DB.
Here are some words related to the news stuff I'm doing now:

Sunday, September 07, 2008

Detecting documents near duplications in realtime

Crawling the web for news I found few interesting things about its topology and mainly about news on the web.
One of the bothers in serving news is dealing with duplications. There are a lot of near duplications of news articles on the web. Looking at our data its something between 20% and 40% of the articles we harvest. By duplicates I mean articles that have a significant overlap between them and that a human will notice that they are actually the same articles that might have went through some editing.
Both blogs and professional news providers have duplications and we treat them the same in this respect.
The most common reason for article duplication is a press release or an article posted by one of the major news brokers. These articles are often published word by word from news providers with some minor editing, adding images and links, etc.

There are many ways of how to deal with this problem, and it depends a lot of what kind of service you are writing. In any point in time, we have millions of potential news articles our search can come by.
On the performance side

  • Latency: we can have up to 10 milliseconds check the result set (about a hundred articles) for duplication when serving the results.
  • Throughput: tens of thousands transactions per minute per server on peak time.

To make it clear, our problem is not to find all near duplications. We just with to find near duplications in articles we serve, but it must be very fast. We might return 100 articles in one set. Comparing all of them to each other will take about 10K comparisons.
Some of conventional methods to solve near duplications I know of are using shingling and matching term frequency vectors. Shingling is great and most accurate, but is expensive. You can’t take all the articles and compare them, not mentioning keeping all in memory. Creating the shingles takes time and in large documents there may be many of them. Vectors might be less accurate for these purposes and have similar caveats.

The system is divided to two stages.

Offline: creating signatures, which are like a set of shingles.
  1. Breaking the article to natural sentences. Doing that and not using fixed word N-grams reduce the number of shingles.
  2. Stemming the content and lowercasing. Removing short words, punctuations etc.
  3. Taking a numeric value out of the shingle using CRC32. We will treat it as a unique numeric value of the sentence, allowing extremely rare errors.
  4. Sorting the set of shingles to a list and trimming it to a fixed size. We found that in large documents it’s enough to use only subset to determine near duplications as long as we are consistent with the sample we take. Using the shingles with the minimum values is ensuring we’ll always take the same sample set.
  5. Attaching the short list of numbers to the article.

Realtime: finding duplications in a set of a hundred documents. As you understand by now I actually compare the singles.
  1. Grab the sorted integer arrays representing the shingles of the two documents you compare.
  2. Compare the values of the two vectors and advance the pointers as you go. Since the vectors are sorted, you don’t need to compare each value in one vector to all values in the other vector. The comparing sorted vectors this way is very fast!
The threshold of shingles overlap is known in advance and the result should be boolean. I.e. once we know it’s a near duplication, we don’t care how close it is. You can do lots of optimizations using this knowledge. For example, if the duplication threshold is 50% similarity, you processed 60% of the terms and all matched or didn’t match then there is no sense continuing on.
Eventually, if you optimize in the right places, finding overlapping vectors in a set of hundred documents is super fast and could take less then a millisecond with a real life dataset.

Optimizing further on, we use caching. Many of the document sets we are deduping in a session are going to be evaluated again in the near future. Hashing and caching the session set gives an extra boost in response time.

Monday, August 25, 2008

The Dirtbag Diaries


As always, the Dirtbag Diaries podcast is delicious. The last episode is a bit different but as awesome as the rest, Fits definitely has a good taste in music and he shared a great selection. I mostly liked Ken Christianson's work.

Waiting for the next season!

Sunday, June 29, 2008

Look Ma! I'm on LinkedIn

No, its not about having a profile, and even not about working here. I've just started a second career in modeling. Through some flawed process which failed to pick only the better looking engineers in LinkedIn, I got my pictures up on the Work at LinkedIn site. I just hope now that it won't discourage new employees from joining.

Few words about the next one: I ware ties only when I bike, at work I prefer to attend meetings without shoes or ties.

And by the way, check out our water bottles in the LinkedIn store :-)

NullPointerException at Javac compiler


Got an NPE when compiling from the compiler. Its very annoying since its the compiler bug and very hard to get to the class that freaked it out (no appropriate message). The error is:

java.lang.NullPointerException
at com.sun.tools.javac.code.Types$IsSameTypeFcn.visitClassType
at com.sun.tools.javac.code.Type$ClassType.accept
at com.sun.tools.javac.code.Types$IsSameTypeFcn.isSameType
at com.sun.tools.javac.code.Types$IsSameTypeFcn.visitClassType
at com.sun.tools.javac.code.Type$ClassType.accept
.....

The line that promped the javac error is:
List<LinkedList<DateScoreSortable<BaseNewsArticleView>>> lists = getMyLists(...);

I know it look strange, but that's what I needed. Actually I had this code in a test class, but I do use this funky structure in the business logic. Anyway it seems to trigger an edge case in the Sun compiler. The Eclipse JDT incremental Java compiler did not have any problems with the code, it compiled and run it without problems.
I really don't have time now to submit the bug and full sample, maybe later.
By the way it happened on my Mac java version "1.5.0_13". To solved the problem I used the same data sutucture but without the generics, i.e.:
List lists = getMyLists(...);

It worked!

Tuesday, June 24, 2008

LinkedIn is 99% Java but 100% Mac

That's my second blog post on the LinkedIn engineering blog. Its mainly about the JavaPosse and LinkedIn the amazing Mac setup engineers have in LinkedIn.

Go ahead and read it there.

Creative Commons License This work by Eishay Smith is licensed under a Creative Commons Attribution 3.0 Unported License.