Tuesday, December 29, 2009

Attaching a Java debugger to the Scala REPL

Originally posted on the kaChing Eng Blog.

I'm using the Scala REPL to play around with java libraries and check their runtime behaviors. One of the things I'm using it for is to check how Voldemort's client is behaving in different setups. For one of the checks I wanted to trace the client threads with an IDE debugger.
To attach a debugger to the Scala REPL all you need to do is to export the debugger values into JAVA_OPTS:

export JAVA_OPTS='-Xdebug -Xrunjdwp:transport=dt_socket,address=1773,server=y,suspend=n'
Run your scala REPL and attach your debugger to port 1773. Done.

(*) Tested with scala 2.7

Wednesday, December 09, 2009

Baking availability SLA into the code

This is a copy of the post I made on the kaChing engineering blog.

Availability and Partition Tolerance are essential for many distributed systems. A simple (though not comprehensive) way to measure both is using response time SLAs between services as implied from Jeff Darcy's observation:

Lynch (referring to the 2002 SIGACT paper) also makes the point that unbounded delay is indistinguishable from failure. Time is therefore an essential component of these definitions (a point made even more explicitly in the Dynamo paper).
At kaChing we think that SLAs is something that should be baked into the code so the developer will have to think of while creating the service contact. For that reason we created a @ResponseTime annotation for out internal services:
@Retention(RUNTIME)
@Target({ TYPE })
public @interface ResponseTime {
  long warn();
  long error();
  TimeUnit unit() default MILLISECONDS;
}
A typical service query for online requests is annotated with
@ResponseTime(warn = 100, error = 200)
Where the time depends on the services constraints, access to resources for example. A ping query for example has a
@ResponseTime(warn = 2, error = 4)
and an offline analytics call may take hours
@ResponseTime(warn = 10, error = 15, unit = HOURS)
Nevertheless, every request should have an SLA and the developer must think of it when writing the implementation.
Once we have this tool the SLA of subsequent services (B, C & D) a service (A) needs to call can statically be computed to verify that no path in the tree of subsequent services calls exceeds the root service (A) SLA. In other other words, for service queries A, B, C & D. If A calls (B and B calls C) and in parallel A calls D then we should have SLA(A) > max (SLA(B) + SLA(C), SLA (D)).
kaChing's service query container sample the time it takes for every call using perf4j and logs times and SLA violations to local disk. If the query's time exceeds the warning threshold it will be logged accordingly, but if the error threshold is broken then the container will try to terminate the query. Terminating the query is a bit harsh, but the since our client timeout is using the SLA as well then most chances are that the client gave up on the query and either retried (service queries are Idempotent) or aborted. Another reason to shoot down a runaway query that exceeds the SLA error time is that it may be locking or consuming resources from other queries and slowing the whole system down.
The perf4j messages are also piped through a log4j socket appender (async buffers in each side of the pipe) to a central hub. The hub then does statistics on the aggregated times, it loads the SLAs of the queries and checks that the cluster is not violating its SLA. The central hub can then send a daily report on SLAs and real time alerts pinpointing a slower then expected service. Keeping the reports and comparing them to historical reports are helping to see an improvement or regression in every part of the site.

This monitoring technique is only a small portion of the automated system operation a small startup must have in order to stay flexible. Stay tuned for more monitoring automation posts.

Saturday, October 10, 2009

Looking for a passionate Systems Engineer !

Not a typical post, but I might as well use this platform to promote it. Come join the best startup ever !

kaChing is an investing talent marketplace where individual investors gain access to the best investing talent on the Web. An SEC registered investment adviser, kaChing enables customers to mirror automatically the trades of the "kaChing Geniuses" who have an Investing IQ of 140 or greater. The company's investors include Marc Andreessen, Jeff Jordan, CEO of Open Table (OPEN) and former president of PayPal, and retired partners from Benchmark Capital and Kleiner Perkins Caufield & Byers.

Join a stellar engineering team whose team members worked on massive backend infrastructure, Gmail's high performance frontend, the Native Scala Compiler with Prof. Odersky and more. We are test-driven and have a continuous integration process that lets us achieve a 5 minute build and release cycle.

Job description:

  • In charge of the backend of a revolutionary web site doing large scale high availability deployments and configuration management using open source software.
  • Keeping the system having a fully automated release cycle, continues testing and monitoring.
  • Build tools for monitoring, security, systems management and deployment if the available ones are not enough.
  • Help engineers architect the next levels of the system and making sure it stays agile and easily absorb high capacity.
Qualifications
  • Extremely smart and self motivated
  • Passion for test-driven development and automated functional QA
  • Live and breathe performance
  • Thinks out of the conventional box of IT management
  • Experience with databases, preferable MySQL
  • Good knowledge about monitoring and deployment tools
  • Good knowledge about network and distributed computing
  • Expert in at least one scripting language and know at least one more
  • Ability to work independently
  • A BS in computer science or related fields
I'll soon add a link to the official page.

Wednesday, October 07, 2009

Speeding up with Voldemort and Protobuf

In order to supply some of the analytics behind kaChing we have a some nice number crunching processes working over piles of financial data. The end results are displayed on the site (e.g. the find investors page).

The post is about one of such services which for the sake of this post I'll refer to as the SuperCruncher. In the first iteration we grabbed the data straight from the DB into the SuperCruncher. Needless to say, relational databases are not handling stress nicely and a pattern of fast iteration over all of the DB kills it very fast. Results of first iteration: 25 hours of computing.

Since we couldn't manage to squeeze more hours into the day we run a second iteration.
Voldemort & Protobuf for the rescue!
Though we have piles of data to compute, some of the data does not change though we do need to read it each time. We shoved the data into a Protobuf data structure which made the binary size considerably smaller, and pushed the protobuf into Voldemort. We used the Voldamort ProtoBufSerializer which provides an extremely simple way to store and use protobufs.
So essentially Voldemort in this case is used as a persistent cache to store normalized data. When reading the date the SuperCruncher first check it it exist in the naturally sharded Voldemort cluster and gets the delta from the DB.
Using Voldemort and Protobuf the I/O problem vanished, the SuperCruncher became four times faster (!) and bounced the performance ball back to the CPU. Cutting down CPU time is usually easier then cutting on I/O and indeed we managed to make the running time considerably faster in later iterations (hint: Joda Time & GC).

Shameless promo: kaChing is hiring!
We are looking for an world class Systems Engineer.

Our release cycle is 5 minute long, we release from the branch and have 100% tests passing on all time therefore we do not need a typical sysops person. We are looking for an excellent engineer to run, architect and automate the system.

Saturday, October 03, 2009

Scala @ Silicon Valley Code Camp

Had a great day at the Silicon Valley Code Camp. Gave couple of talks about Scala and met lots of interesting people. Here are the slides for the talks I gave.

Absorbing Scala Into Java Ecosystem

A First Look at Scala on Google App Engine

Friday, October 02, 2009

kaChing on Yahoo! Homepage


println("Hello kaChing!")

Sunday, September 20, 2009

Off to a new adventure

Had a great couple of years at LinkedIn and now I'm off to a new adventure in the valley of silicon. This time its a great web startup called KaChing which I'm sure you'll hear more about in the next few months.

So I might be a bit quiet for a while working with great technology but if you happen to be around come and meet me at the Silicon Valley Code Camp!
I'll go back now to work on the Voldamort & Protobuf integration.

Monday, August 10, 2009

Protobuf JSON Serializer is great, but a bit slow

Run a small benchmark with the fresh new protobuf-java-format. The ProtobufJsonSerializer code is nice and trivial, greatly increasing the portability of protobuf and its capabilities to communicate with services that can't use the protobuf native parsing libraries.
Alas, it also comes with some price. Unfortunately, this new (still early) version of the library is very slow compared to other serialization libraries. See the full benchmarking wiki page for more info.

               Obj create, Serialization, Deserializtn,  Total Time, Size
protobuf , 421.14250, 3738.75000, 2471.25000, 6631.14250, 217
json (jackson), 235.45000, 3196.50000, 4934.25000, 8366.20000, 304
JsonMarshaller, 237.79000, 18107.00000, 24715.75000, 43060.54000, 298
protobuf-json , 431.94000, 19933.25000, 102577.75000,122942.94000, 389





EasyMock IArgumentMatcher with Scala Howto

Its pretty trivial if you're a Scala savvy, but for the learners it might be a bit tricky. First thing, you should probably first get familiar with the java way of using IArgumentMatcher. To use it you should have a static method that calls EasyMock.reportMatcher with your implementation of IArgumentMatcher. To archive that you must create the method in an object. Its a workaround for not having static methods in Scala. Then implement the matcher, pattern matching is lots of help here, and remember that in Scala AnyRef is equivalent to Java's Object.

object TestMyClazz {
/**
* for EZMock argument matcher
*/
def eqMyClazz(obj: MyClazz) : MyClazz = {
EasyMock.reportMatcher(new MyClazzMatcher(obj))
null
}

class MyClazzMatcher(obj: MyClazz) extends IArgumentMatcher {
def matches(actual: AnyRef): Boolean = actual match {
case actual: MyClazz => //check things...
case null => //is null ok?
case _ => false
}
def appendTo(buffer: StringBuffer) : Unit = buffer append "some text..."
}
}
Done, now you just need to call eqMyClazz() with your expected value.
someObj.useMyClazz(eqMyClazz(expectedObjectOfMyClazz))

Speaking at the QCon San Francisco '09 "Absorbing Scala"

Funny thing happened to me,
I checked out the new Google search engine on their sandbox url.
On of the things I searched for is my own name (very modest of me, I know...). I was a bit surprised to see I'm going to talk at QCon San Francisco 2009 about "Absorbing Scala". I remember discussing it a long while ago but it kinda trailed off and I forgot about it. They did got my current title wrong, not that it matters too much.

Anyway, its a yet another practical use of google search :-)

Hope there'll be more Scala nuts in the crowd to quiet down the other Language Zealot (by the way, I do love the groovy).

See you there!

Scala & EasyMock take II

Now that I'm using Scala's Manifest more, here is how my testing with EasyMock code looks like. Instead of

val mocked1 = EasyMock.createMock(classOf[MyClazz])
val mocked2 = EasyMock.createMock(classOf[MyOtherClazz])
Its
val mocked1 = mock[MyClazz]
val mocked2 = mock[MyOtherClazz]
And
def mock[A](implicit m: Manifest[A]) = EasyMock.createMock(m.erasure).asInstanceOf[A]
Running the test using
run(mocked1, mocked2){
//code using mocks here
}
And
def run(mocks : AnyRef*)(block : => Unit){
replay(mocks : _*)
block
verify(mocks : _*)
reset(mocks : _*)
}

Thursday, July 16, 2009

Scala and EasyMock

Writing tests is an integral part of coding. If you're writing unit tests without using a Mock library like EasyMock or jMock you better check one of them out.

Didn't test EasyMock to its extreme but until now I didn't see any problem. Using partial functions as follows looks like a nice pattern to use EasyMock. It takes care of the EZMock in the useMock utility method and creates a nice separation between the "expected" block and "executor" one:

  /**
* 1. gets a mockable class
* 2. creates a mock object out of it
* 3. executes the "expected" block
* 4. replays the mock
* 5. runs the execution code
* 6. verify the mock
*/
private def useMock[S](mockable : Class[S])(run : S => Unit)(expected : S => Unit ) {
val mock = createMock(mockable)
expected(mock)
replay(mock.asInstanceOf[Object])
run(mock)
verify(mock.asInstanceOf[Object])
}
Using it goes like this:
    val mockRunner = useMock (classOf[ToMock]) { mock =>
val toTest = new ToTest(mock)
toTest.doSomethingWithToMock()//should call method1() and method2() on mocked
} _
//change some conditions
mockRunner {mock =>
mocked.method1()
mocked.method2()
}

Wednesday, July 15, 2009

Speaking at the SiliconValley CodeCamp '09

The two talks at the SiliconValley CodeCamp 2009 (October 3rd & 4th, 2009).
Naturally the talks are Scala and Google App Engine centric. Please post suggestions and comments. Here are the abstracts:

A First Look at Scala on Google App Engine
GAE is a great Scala environment, especially since its coding patterns are pushing the programmer to do functional oriented programming.

In this technical talk we will discuss using Scala In Google App Engine, we’ll go through using Scala along with:

  • GAE’s basic services
  • Java Data Objects (JDO) – GAE’s interface to the BigTable based Datastore. How to use Scala syntax for JDO annotation and class declaration
  • Google Web Toolkit (GWT) – interfacing with GWT Java only garden
  • ANT – loosing IDE dependency, compiling/running your Scala app from the command line without Eclipse’s Plugins
  • Discuss what you cannot do with Scala

Absorbing Scala into Java Ecosystem
Scala runs on the JVM, can use and be used by Java code almost transparently. Its Java speed and focus on concurrency well position it for demanding server side applications.
This session is for those who consider using Scala in their existing Java projects. We’ll discuss how to smoothly integrate Scala into an existing Java build, testing, development and runtime systems.

In this session we will talk about how to deal with the learning curve, IDE integrations and the peopleware aspects of introducing Scala to your organization.

The session will include examples and anecdotes from the LinkedIn teams who currently use Scala in production.

Tuesday, July 14, 2009

Scala @ LinkedIn

I should actually do a full post on Scala at LinkedIn, but today I only wish to report that we had a very nice Scala BASE meeting at LinkedIn. We actually had a much better place today (a full building to ourselves). See you again in two months !


Monday, July 13, 2009

Social goes vertical

Since I found comments on few posts on other social I'm starting to use DISQUS for comments. I like the way social activity goes vertical across different sites as a large set of mashups having every solution excel in a very focused part of the overall social network we live in. No doubt, Google about to focus on this trend with Wave and Friend Connect.

Saturday, July 11, 2009

Scala on Google App Engine playing it nice with Java GWT and JDO

Few days ago I gave a short review on the SF Bay Area Google App Engine Developers meetup about Scala on the Google App Engine. If you would like to dive into it, here are some of the details with examples taken from the code of the newspipes app available at GitHub.
Scala playes out very nicely with GAE with the same pros and cons a Java application would have. There is one thing Scala can't do which is to take part in any GWT related code. It has nothing to do with GAE since GWT is designed to work only with Java code. If you wish to use GWT for the front end and still use Scala for the back end - no problems. You write your GWT code with Java, write the GWT service interface with Java so the GWT compiler will be happy, and implement the service with Scala. The service implementation is hooked up with GWT in the web.xml as the http end point.

Next step will be using JDO. If your persistent class is used by GWT then you must use Java. Else you may use Scala which, typical to Scala, makes the code much smaller. For example:

import javax.jdo.annotations.{Extension, Persistent, IdGeneratorStrategy, PrimaryKey, PersistenceCapable, IdentityType}

@PersistenceCapable{val identityType = IdentityType.APPLICATION}
class SearchKeyword(
@PrimaryKey
@Persistent{val valueStrategy = IdGeneratorStrategy.IDENTITY}
@Extension{val vendorName="datanucleus", val key="gae.encoded-pk", val value="true"}
var key: String,
@Persistent var value: String,
@Persistent var count: Int)
Now we need to compile it all. Using the Eclipse plugins might be nice but you should have a proper build file if you want to do more then "Hello World". This build will run test, deploy in production and test environment and handle Java, Scala, GWT and JDO compilations. On top of it, I use IntelliJ IDEA which has the best Scala support at the moment (competition is great!) and you really don't want to be strained into an IDE.
To do the Scala part first run the Scalac compiler to the same place you'll compile the Java code a step after:
  <target name="compile" depends="copyjars" description="Compiles Java source and copies other source files to the WAR.">
<mkdir dir="${dstdir}" />
<copy todir="${dstdir}">
<fileset dir="${srcdir}">
<exclude name="${javafiles}" />
<exclude name="${scalafiles}" />
</fileset>
</copy>
<scalac
destdir="${dstdir}"
scalacdebugging="yes">
<src path="${srcdir}"/>
<classpath refid="project.classpath"/>
</scalac>
<javac srcdir="${srcdir}"
destdir="${dstdir}"
classpathref="project.classpath"
source="1.5"
target="1.5"
nowarn="true"
debuglevel="lines,vars,source"
debug="on" />
</target>
Then comes the JDO enhancement part which takes the compiled source code and does its stuff. Note that it does not care at this point where the classes came from (Java or Scala) since it works on the *.class files the compilers placed in ${dstdir} (war/WEB-INF/classes).
  <target name="datanucleusenhance" depends="compile"
description="Performs JDO enhancement on compiled data classes.">
<enhance_war war="war" />
</target>
That's about it.

Thursday, July 02, 2009

Microbenchmarking Scala vs Java

Following Nick's post I came up to check the numbers and investigate a possible improvement using Scala 2.8 @specialized annotation.
Final numbers are at the end of the post.
I added few changes:
First I thought it would be better if both Scala and Java would sort the same size of array ;-) In Nick's example Java sorted 10,000 elements and Scala sorted 100,000.
The other was to do few iterations int he same JVM run to let JIT kick in. So now the Scala code looks like this:

package quicksort
import java.lang.Long.MAX_VALUE

object QuicksortScala {

def quicksort(xs: Array[Int]) {

def swap(i: Int, j: Int) {
val t = xs(i); xs(i) = xs(j); xs(j) = t
}

def sort1(l: Int, r: Int) {
val pivot = xs((l + r) / 2)
var i = l;
var j = r
while (i <= j) {
while (xs(i) < pivot) i += 1
while (xs(j) > pivot) j -= 1
if (i <= j) {
swap(i, j)
i += 1
j -= 1
}
}
if (l < j) sort1(l, j)
if (j < r) sort1(i, r)
}
sort1(0, xs.length - 1)
}

def main(args : Array[String]) {
var time = MAX_VALUE
for(i <- 0 to 100) (time = Math.min(time, doSort))
println("Scala time = " + time)
}

def doSort() = {
var a : Array[Int] = new Array[Int](10000000)
var i : Int = 0
for (e <- a) {
a(i) = i*3/2+1;
if (i%3==0) a(i) = -a(i);
i = i+1
}
val t1 = System.currentTimeMillis();
quicksort (a)
val t2 = System.currentTimeMillis();
t2 - t1
}
}
And here is the Java code:
package quicksort;

public class QuicksortJava {

public void swap(int[] a, int i, int j) {
int temp = a[i];
a[i] = a[j];
a[j] = temp;
}

public void quicksort(int[] a, int L, int R) {
int m = a[(L + R) / 2];
int i = L;
int j = R;
while (i <= j) {
while (a[i] < m)
i++;
while (a[j] > m)
j--;
if (i <= j) {
swap(a, i, j);
i++;
j--;
}
}
if (L < j)
quicksort(a, L, j);
if (R > i)
quicksort(a, i, R);
}

public void quicksort(int[] a) {
quicksort(a, 0, a.length - 1);
}

public static void main(String[] args) {
QuicksortJava sorter = new QuicksortJava();
long time = Long.MAX_VALUE;
for(int i = 0; i < 100; i++)
time = Math.min(time, sorter.doSort());

System.out.println("java time = " + time);
}

private long doSort(){
// Sample data
int[] a = new int[10000000];
for (int i = 0; i < a.length; i++) {
a[i] = i * 3 / 2 + 1;
if (i % 3 == 0)
a[i] = -a[i];
}

long t1 = System.currentTimeMillis();
quicksort(a);
long t2 = System.currentTimeMillis();
return t2 - t1;
}
}
Decompiling the Scala code shows that there is no benefit of using the @specialized annotation since the Scala compiler compiles all the ints to primitives. Here is the Scala class decompiled code:
package quicksort;
import java.rmi.RemoteException;
import scala.*;
import scala.runtime.*;
public final class QuicksortScala$ implements ScalaObject{
public QuicksortScala$(){}
private final void sort1$1(int l, int r, int ai[]) {
do{
int pivot = ai[(l + r) / 2];
int i = l;
int j = r;
do{
if(i > j) break;
for(; ai[i] < pivot; i++);
for(; ai[j] > pivot; j--);
if(i <= j) {
swap$1(i, j, ai);
i++;
j--;
}
} while(true);
if(l < j) sort1$1(l, j, ai);
if(j < r)l = i;
else return;
} while(true);
}
private final void swap$1(int i, int j, int ai[]){
int t = ai[i];
ai[i] = ai[j];
ai[j] = t;
}
public long doSort(){
ObjectRef a$1 = new ObjectRef(new int[0x989680]);
IntRef i$1 = new IntRef(0);
(new BoxedIntArray((int[])a$1.elem)).foreach(new anonfun.doSort._cls1(a$1, i$1));
long t1 = System.currentTimeMillis();
quicksort((int[])a$1.elem);
long t2 = System.currentTimeMillis();
return t2 - t1;
}
public void main(String args[]){
LongRef time$1 = new LongRef(0xffffffffL);
Predef$.MODULE$.intWrapper(0).to(100).foreach(new anonfun.main._cls1(time$1));
Predef$.MODULE$.println((new StringBuilder()).append("Scala time = ").append(BoxesRunTime.boxToLong(time$1.elem)).toString());
}
public void quicksort(int xs$1[]){
sort1$1(0, xs$1.length - 1, xs$1);
}
public int $tag() throws RemoteException {
return scala.ScalaObject.class.$tag(this);
}
public static final QuicksortScala$ MODULE$ = this;
static { new QuicksortScala$(); }
}
I tried to compile and run the code in Scala 2.8 anyway and the results where exactly the same as with running it under Scala 2.7.5.
The numbers:
Scala: 1355ms
Java: 1265ms

==> Java was 7% faster then Scala (Closer gap then the 33% in the previous benchmark).

UPDATE
Ismael found a bug in the Scala Code, I updated Nick as well. The line
      if (j < r) sort1(i, r)
should have been
      if (i < r) sort1(i, r)
Changing the line made Java and Scala be equal in performance.
==> The Scala code above is no slower or faster then Java (as should be in this specific case).

Saturday, June 20, 2009

Google App Engine Data Store Api is definitely beta

Was very surprised to see that when querying the data store service for a key and the key does not exist the service actually throws an exception instead of returning null or indicating the absence of value in another way. Definitely need a code review from Josh Bloch. Josh wrote in his Effective Java "Exceptions ... should never be used for ordinary control flow".
In Scala it actually makes things very ugly, instead of doing

ds.get(queryKey) match {
case null => {
logger.info("did not got it: " + queryKey.toString)
...
}
case entity: Entity if (null != entity) => {
logger.info("got it: " + entity.toString)
...
}
}
I must go with something like
try {
val entity = ds.get(queryKey)
logger.info("got it: " + entity.toString)
...
}
catch {
case ex: EntityNotFoundException => {
logger.info("did not got it: " + queryKey.toString)
...
}
}
Which may totally disrupt a nice chain of pattern matching.
Keeping an eye on Issue 1961

Sunday, June 14, 2009

No Scala for GWT

I'm working through my buzzword complaint application which naturally includes Cloud computing (Google App Engine), GWT, Scala and the rest of social Web2.0 hyped BS. Wow, that was a keyword loaded sentence ;-)

GWT has a nasty exception to my experience so far that you can replace any Java code with Scala. When you try to have a GWT EntryPoint be implemented in Scala you get this GWT compilation error

Checking rule <generate-with class='com.google.gwt.user.rebind.ui.ImageBundleGenerator'>
Checking if all subconditions are true <all>
<when-assignable class='com.google.gwt.user.client.ui.ImageBundle'>
[ERROR] Unable to find type 'com.newspipes.client.Newspipes'
[ERROR] Hint: Check that the type name 'com.newspipes.client.Newspipes' is really what you meant
[ERROR] Hint: Check that your classpath includes all required source roots
The error is a bit confusing since the class is in the classpath and you can see the compiled *.class under WEB-INF/classes. The GWT compiler compiles Java source directly to Javascript and it check that the sources it compiles are *.java files. So instead of "Unable to find type..." it actually means "Unable to find java source file of type...".

Tuesday, June 09, 2009

Unexpected repeated execution in Scala

The following Scala Easter egg is the one of the most dangerous "features" of the language.

I know there might be flaming involved, but be sure its not my intention. I like Scala a lot and I'm happily using it in production code. This post is a result of the following discussions: Seq repeated execution unexpected behavior and Strange behavior of map function depending on first argument

The following code is using a method that returns a Seq[Int] containing random numbers and acts on them (prints the first one).

object SeqTest {
def main(args: Array[String]) {
val randomInts : Seq[Int] = mkRandomInts()
println(randomInts.first)
println(randomInts.first)

val randomIntsList = randomInts.toList
println(randomIntsList.first)
println(randomIntsList.first)

def mkRandomInts() = {
val randInts = for {
i <- 1 to 3
val rand = i + (new Random).nextInt
} yield rand
randInts
}
}
As you can see from the output blow the sequence is reevaluated each time it is accessed, returning different random number. Once the code transforms the sequence into a List the results are stable.
-1867060800
312920158
-133186413
-133186413
Decompiling the code shows that mkRandomInts returns a scala.RandomAccessSeq.Projection which is a result of the Range we created using
i <- 1 to 3
As Jorge’s explained it: the "gotcha" is that, because of the way Scala's collections work, Range's laziness is "contagious" when you use functional for-comprehensions (for ... yield ...) and a Range is the first thing in the comprehension.
I'm not sure the Scaladoc explains that so well.

Martin and others explained that it comes to conserve memory since we don’t really want to have a list in size of 100000 when doing (0 to 100000). This is all nice and true, but it is not a good reason to have a repeated execution each time the code access a data structure derived from iteration on a range.

The example above is not about clean/nice/efficient code its about the principle of least surprise (POLS). This behavior will cause (in my case did cause) very hard to track bugs. It is also an undocumented behavior, in spite of it existing in a very common pattern equivalent to java’s
for (int i = 0; i < 100000; i++)
The implicit side effects of such optimization may be unacceptable. For example, the code executed in the for loop might change state in DB or file system or be CPU intensive. In the latter case, it would be very hard to understand why the application is so slow when repeatedly accessing elements of a sequence.

If Seq would have force() on it then a protective act would know to call it in such cases, but it does not have it (only RandomAccessSeq.Projection got it).
For example, every java InputStream has close() and java programmers are accustomed to close any stream they use in a finally "just in case", even if in some of them (e.g. StringBufferInputStream) close does nothing. But if we'll educate programmers to force a Seq anywhere we see it we create more boilerplate mess and we don’t want it in Scala :-)

Having such case forces the programmer to know about it and be ready => more potential bugs that will happen => learning curve is even higher.

To conclude, the problem is two fold:
First calling a chunk of code that (without explicit instruction) gets executed only when its derived output is accessed.
Second even if lazy is cool and expected, repeated execution is not. We have lazy data structures all around but they usually cache the data once they fetch it or in the case of lazy iterators (like in jdbc), you need to explicitly recreate them.

Tuesday, May 19, 2009

Eight Scala or Scala related talks in the coming JavaOne

There will be eight Scala or Scala related talks in the coming ScalaOne!
Oops, JavaOne


//Could be done nicer but (_ + _) wanted to be included
(0 /: (JavaOneTalks map {case ScalaRelated => 1; case _ => 0}))(_ + _)
  1. Actor-Based Concurrency in Scala
  2. The Feel of Scala
  3. Lift: The Best Way to Create Rich Internet Applications with Scala
  4. State: You're Doing It Wrong -- Alternative Concurrency Paradigms on the JVM™ Machine
  5. Performance Comparisons of Dynamic Languages on the Java™ Virtual Machine
  6. Alternative Languages on the JVM™ Machine
  7. Toward a Renaissance VM
  8. Script Bowl 2009: A Scripting Languages Shootout
res0: Int = 8
The weekend after JavaOne there is also the Scala Lift Off on Saturday, 6 June 2009, San Francisco

And not too far ahead there is the Silicon Valley Code Camp (10/3-10/4) with at least three Scala talks (I guess there will be more).

Monday, May 11, 2009

Scala and XML - Part 1

As you know, Scala has a great syntax for XML. Here are some lessons from working with it. Here is a yet another Scala XML sample which creates html.

import scala.xml.XML._
import scala.xml.NodeBuffer
import scala.xml.dtd.{DocType, PublicID}
object ScalaHtml {
val words = List("one", "two", "three")
val url = "http://www.scala-lang.org/"
def main(args : Array[String]) : Unit = {
val page =
<html>
<a href={url}>Scala</a>
<h1>My words</h1>
<ul>{listOfWords(words)}</ul>
</html>;
save("save.html", page)

val doctype = DocType("html",
PublicID("-//W3C//DTD XHTML 1.0 Strict//EN",
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"),
Nil)
saveFull("save.full.html", page, false, doctype)
}
def listOfWords(words: List[String]) = {
val result = new NodeBuffer
for(word <- words) {result &+ (<li>{word}</li>)}
result
}
}

The output is
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<a href="http://www.scala-lang.org/">Scala</a>
<h1>My words</h1>
<ul><li>one</li><li>two</li><li>three</li></ul>
</html>
Where the first line (with the DOCTYPE) will be generated when using the second save option.
Notes:
[1] When using a variable as an attribute value (see the <a> tag) you must not wrap it with double quotes. The underlying code is doing something like this:
scala.xml.MetaData $md = new UnprefixedAttribute("href", url(), $md);
Object _tmp1 = null;
NodeBuffer $buf = new NodeBuffer();
$buf.$amp$plus(new Elem(null, "a", $md, Predef$.MODULE$.$scope(), $buf));

[2] You can't create a partial node. I.e. you can't create the begin tag of a node in one command, then do some other stuff and at the end close it. Scala node creation is atomic and immutable, if you wish to embed on the fly date then it must be simple text or Node* (for example a single Node or NodeBuffer). You can't have any statements int the curly brackets though you may have a call to a method that does that in its own block.
Addendum [3] An emphasis on the string as a value in the curly brackets. Some (like myself) placed there an Int or other primitives expecting something like string concatenation. Well, it won't work and you'll get this error:
error: overloaded method constructor UnprefixedAttribute with alternatives    [scalac](String,Option[Seq[scala.xml.Node]],scala.xml.MetaData)scala.xml.UnprefixedAttribute  (String,String,scala.xml.MetaData)scala.xml.UnprefixedAttribute  (String,Seq[scala.xml.Node],scala.xml.MetaData)scala.xml.UnprefixedAttribute cannot be applied to (java.lang.String,Int,scala.xml.MetaData)
[scalac] Cluster #{summary.id}
[scalac] ^

Saturday, May 09, 2009

Iterating over Map with Scala

Here is a summary of the iteration over a Scala Map discussion, started by Ikai at the Scala mailing list.
Here is the Map we wish to iterate over.

scala> val m1 = Map[Int, Int](1->2, 2->3, 4->50)                
m1: scala.collection.immutable.Map[Int,Int] = Map(1 -> 2, 2 -> 3, 4 -> 50)
The options are:
Accessing the Tuple2 object structure.
scala> m1 foreach ( (t2) => println (t2._1 + "-->" + t2._2))
1-->2
2-->3
4-->50
Using case on the Tuple2
scala> m1 foreach {case (key, value) => println (key + "-->" + value)}
1-->2
2-->3
4-->50
Using a for loop
scala> for ((key, value) <- m1) println (key + "-->" + value) 
1-->2
2-->3
4-->50
Advanced: using unapply to expand the Tuple and use it in the match:
scala> object -> { def unapply[A,B](x : (A,B)) : Option[(A,B)] = Some(x) }
defined module $minus$greater

scala> m1 foreach {case k->v => println (k + "-->" + v)}
1-->2
2-->3
4-->50
I don't know why the next one does not work, any idea?
scala> m1 foreach (println (_._1 + "-->" + _._2))           
:6: error: missing parameter type for expanded function ((x$1, x$2) => x$1._1.$plus("-->").$plus(x$2._2))
m1 foreach (println (_._1 + "-->" + _._2))
^
If you know of other options, please add.

Monday, May 04, 2009

Beware of Scala’s type inference !

Scala's type inference could be unpleasant, creating problems that did not exist in Java (no inference) or dynamically typed languages.

To be clear, Scala's type inference is awesome! Yet with great powers come great responsibility and the library writer, and to some extent the consumer too, should know what to beware of the rough edges.

Examine this library method which returns a Map

class MyLib {
def getMap = Map("a"->"b")
}
And the client code that reads the map
class MyClient {
def useLib {
val lib = new MyLib
val map = lib.getMap
useMap(map)
}
def useMap(map: Map[String, String]) = println("useMap " + map)
}
Looks nice and definitely works. After a while the library is using a mutable HashMap for some imperative reason
import scala.collection.mutable.HashMap
class MyLib {
def getMap = {
var map = new HashMap[String, String]()
map += ("a"->"b")
map += ("c"->"d")
map
}
}
Looks harmless enough, but when we'll run the client we'll get
java.lang.NoSuchMethodError: MyLib.getMap()Lscala/collection/immutable/Map;
MyClient.useLib(MyClient.scala)
What happened? The client used the Map trait but the code compiled with a dependency on the underlying implementation. Here is the library java representation using javap
public class MyLib extends java.lang.Object implements scala.ScalaObject{
public test1.MyLib();
public scala.collection.mutable.HashMap getMap();
public int $tag() throws java.rmi.RemoteException;
}
While if we'll decompiling with JAD gives us
import java.rmi.RemoteException;
import scala.*;
import scala.collection.Map;
public class MyClient implements ScalaObject{
public MyClient(){}
public void useMap(Map map){
Predef$.MODULE$.println((new StringBuilder()).append("useMap ").append(map).toString());
}
public void useLib() {
MyLib lib = new MyLib();
scala.collection.immutable.Map map = lib.getMap();
useMap(map);
}
public int $tag() throws RemoteException{
return scala.ScalaObject.class.$tag(this);
}
}
So although the useMap method use scala.collection.Map the compiler in useLib references scala.collection.immutable.Map. It seems that it it could use scala.collection.Map and by that dodging some of the problem.
If we'll change the library to
import scala.collection.mutable.HashMap
import scala.collection.Map
class MyLib {
def getMap: Map[String, String] = {
var map = new HashMap[String, String]()
map += ("a"->"b")
map += ("c"->"d")
map
}
}
Which provides the interface (using javap)
public class test1.MyLib extends java.lang.Object implements scala.ScalaObject{
public test1.MyLib();
public scala.collection.Map getMap();
public int $tag() throws java.rmi.RemoteException;
}
which gives the library writer flexibility to change the Map implementation without breaking the client's code. Another solution could be a defensive client
import scala.collection.Map
class MyClient {
def useLib {
val lib = new MyLib
val map : Map[String, String] = lib.getMap
useMap(map)
}
def useMap(map: Map[String, String]) = println("useMap " + map)
}
It looks ugly but it would not break the client when the library changes the map implementation.

Coding conventions should mandate explicit return value for external API. Obviously it's also a good practice for readability and self documented code. This issue is especially relevant for large projects that break up to binary dependent modules.

Addendum

In view of the comments I need to add a clarification:
Java will also fail at build time (and runtime). To be more clear, if in Java you change the source one class it does not changes the compilation of other classes (though it could break their compilation). I don't know much about dynamic languages, but I believe the case is similar there (please correct me if I'm wrong). On the other hand, if in Scala you use implicit all around then changing one the source code of a library class does change the compilation output of the consumer class and this is the dangerous part.

In many projects there are binary dependencies between libraries / modules. Actually most projects are depending on some sort of external library, linking to its jar. Compiled versions of the modules are kept in a repository and unless you change their code they are will not be rebuilt. Lets say I wish to upgrade a version of a module v1.0 to v2.0 used by a client v1.0. I will build it with all the others and run some tests, and if all goes well then I assume that the module is backwards compatible and add it to the binary repository. Not the folks that build the product are taking together all the jar files and they break on runtime!

Now, this is not a big deal as itself. The bigger deal is that the module is not really backward compatible and you may not have all the sources handy to rebuild them from scratch as in the unfortunate case of using not open source library, but even if you do have the sources its a pain.

Saturday, May 02, 2009

Scala and Ant JUnit Batchtest

When absorbing Scala into an existing large project at work, you would like to play it nicely with the build and tooling system (and engineers).
Many of the common testing tools out there are based on Ant, JUnit and some test manager, at LinkedIn we use Hudson (great tool by the way). Writing your tests in JUnit and Scala is not a problem, and you might want to check out some of the Scala testing specific libraries which plays nicely with JUnit.
A small problem you may encounter is with the JUnit task batchtest tag since we used it like this:

<attribute name="file-pattern" default="**/Test*.java, **/Test*.scala"/>
<junit fork="..." forkmode="..." dir="...">
...
<batchtest fork="..." todir="...">
...
<fileset dir="@{test-src-dir}" includes="${test.package.path}@{file-pattern}"/>
</batchtest>
</junit>
And it does not work for Scala sources (only for Java) because the tag "generates a test class name for each resource that ends in .java or .class.". Actually it might be just as well since unlike Java, Scala does not force classes names to match the file names they are declared in. The solution was to match against the classes in the build path since both Scala and Java compile into the same build directory into *.class files. Note we added the exclude pattern *$*.class that counts out inner classes.

<attribute name="file-pattern" default="**/Test*.class"/>
<attribute name="excludes" default="**/*$*.class"/>
<junit fork="..." forkmode="..." dir="...">
...
<batchtest fork="..." todir="...">
...
<fileset dir="@{test-build-dir}" includes="${test.package.path}@{file-pattern}" excludes="@{excludes}"/>
</batchtest>
</junit>

Friday, May 01, 2009

Scala & Java interoperability: statics

Scala and Java interoperability is great. In most cases its stemless and its a great way to introduce Scala into existing code base. Actually, it has great benefits as I'm slowly absorbing Scala to one of the modules at LinkedIn. Artifacts can inherit and call each other in the same compilation unit in a transparent way. Well... almost

There is a small surprising factor when you get to statics and it has two parts.
Part 1: Scala <- Java
A Java class that extends another Java class that contains static artifacts can use those statics since they are inherited with the super class. A classic example is JUnit TestCase by the most popular Java test framework. Since I am trying to use the existing JUnit based test framework my Scala test classes are extending TestCase, and like in the Java tests I wish to use the assert* methods in Assert. But here comes the surprise, Scala will not recognize those.
Consider the following example. Here is a simple Java class with two methods, one of them is static:

package test1.java;
public class SuperTest{
protected String superMethod() {return "super";}
public static String superStatic() {return "super static";}
}
And a Scala class extending it
package test1
import test1.java.SuperTest
class Test extends SuperTest{
def useSuper = println(superMethod)
def useSuperStatic = println(superStatic)
}
Which gives us
.../src/test1/Test.scala:7: error: not found: value superStatic
def useSuperStatic = println(superStatic)
Note that scalac didn't have problem with using superMethod, its only superStatic who had the problem. Adding this import would solve the problem
import test1.java.SuperTest._


Part 2: Scala -> Java
There are no statics in Scala's syntax. The closest thing to static is Scala's Object which is a singleton. Actually, Scala compiles the object's artifacts to statics so from Java's point of few a Scala object is a final class (i.e. can't extend) with static members and methods. For example
package test1
object Test{
def scalaStatic = "scala static"
}
gives us
javap bin/test1/Test
Compiled from "Test.scala"
public final class test1.Test extends java.lang.Object{
public static final java.lang.String scalaStatic();
public static final int $tag() throws java.rmi.RemoteException;
}
javap bin/test1/Test$
Compiled from "Test.scala"
public final class test1.Test$ extends java.lang.Object implements scala.ScalaObject{
public static final test1.Test$ MODULE$;
public static {};
public test1.Test$();
public java.lang.String scalaStatic();
public int $tag() throws java.rmi.RemoteException;
}
So we can do the following from Java
package test1.java;
import test1.Test;
public class JavaTest{
public String usingScalaStatic() {return Test.scalaStatic();}
}
It works fine and is a good showcase to how to integrate nicely between the languages. But now assume that someone decides to add a Scala class named Test alongside the Test object
package test1
class Test{
def hi = println("hi")
}

object Test{
def scalaStatic = "scala static"
}
Scala is very happy about it, our Test class has a companion object. But for some reason Java breaks!
src/test1/java/JavaTest.java:5: cannot find symbol
symbol : method scalaStatic()
location: class test1.Test
public String usingScalaStatic() {return Test.scalaStatic();}
But we didn't change the object, only added the class. As we learned at the previous post, the companion class messes things a bit.
javap bin/test1/Test$ bin/test1/Test 
Compiled from "Test.scala"
public final class test1.Test$ extends java.lang.Object implements scala.ScalaObject{
public static final test1.Test$ MODULE$;
public static {};
public test1.Test$();
public java.lang.String scalaStatic();
public int $tag() throws java.rmi.RemoteException;
}

Compiled from "Test.scala"
public class test1.Test extends java.lang.Object implements scala.ScalaObject{
public test1.Test();
public void hi();
public int $tag() throws java.rmi.RemoteException;
}
so now we no longer have a static scalaStatic method. Scala code does not care about it, but it does matter when we want to integrate with Java. A solution could be something like
package test1.java;
import test1.Test$;
public class JavaTest{
public String usingScalaStatic() {return (new Test$()).scalaStatic();}
}
But its ugly and even worse, we create another instance of the object Test which is supposed to be a singleton! I wonder why the constructor is not private. Another, maybe better but not less ugly is
public class JavaTest{
public String usingScalaStatic() {return Test$.MODULE$.scalaStatic();}
}
Which works as well since MODULE$ is the member that keeps the static reference to the singleton. Needless to say, this is a nasty side effect. Hopefully it will fix at Scala 2.8 with the main bug.

Scala's main issue

From Chris Gioran: For those arriving here from googling up this issue, I would like to point out that it has been fixed @2.8.0
Now main methods defined in companion objects are generated static and will work as expected. The ticket has also been closed as fixed.

I love Scala, I really do and I think its the next big thing. My main issue with the language is its unclear and sometimes misleading compiler errors.

My working practice in learning new technology, and I think its a common case, is to read a bit about it and then try it out. If something goes wrong then google about it using the error message, read a bit more, and master it as I go. It means error messages are actually as important as (and maybe more then) the documentation, especially for beginners.

Since Scala is a relatively young and very rich language, there are not enough writing on the web about its rough spots. In that light it is even more important to provide clear documentation about these spots. To be sure, Scala has its share of rough spots and I don't mean by that that other languages has less of them. To paraphrase on Robert Glass "Scala is a very bad language, but all the others are so much worse".

Now to the point. I was trying to write this very simple Scala app with a main method:
package test1
import test1.java.SuperTest

class Test extends SuperTest{
def useSuper {
print(superMethod())
}
}

object Test{
def main(args: Array[String]): Unit = {
val t = new Test
print(t.useSuper)
}
}
And tried to run the main method on it but got
java.lang.NoSuchMethodException: test1.Test.main([Ljava.lang.String;)
. Ok, I said, I know that the companion object is compiled to Test$ and the main method
is defined in the object, not in the class. So I tried to run Test$ but then I got
main() method in test1.Test$ is not declared static
. Checked the class and found that its true:
$ javap build/test1/Test$
Compiled from "Test.scala"
public final class test1.Test$ extends java.lang.Object implements scala.ScalaObject{
public static final test1.Test$ MODULE$;
public static {};
public test1.Test$();
public void main(java.lang.String[]);
public int $tag()       throws java.rmi.RemoteException;
}
Wow! what's going on here? All the code example I found claim that that's the way to go. After banging my head a bit I removed the class definition and left the object to be a plain object (not a companion object anymore), and it worked! It means that there is an undocumented side effect to having a companion (as all of us married folks know). After figuring out the solution I googled a bit more and found that its a bug and it might be solved in Scala 2.8.0. But since its open for more then 16 months now, and the compiler knows its an edge case I think this kind of issue should have a special compiler warning.

Thursday, April 09, 2009

resolving WstxUnexpectedCharException

Just got into this exception when parsing news articles from the web:

Caused by: com.ctc.wstx.exc.WstxUnexpectedCharException: 
Illegal character ((CTRL-CHAR, code 19))
at [row,col {unknown-source}]: [1186,417]
at com.ctc.wstx.sr.StreamScanner.throwInvalidSpace
at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary
at com.ctc.wstx.sr.BasicStreamReader.finishToken
at com.ctc.wstx.sr.BasicStreamReader.next
at org.codehaus.stax2.ri.Stax2EventReaderImpl.peek
The problems appeared to be a control character in one of the non English articles. To solve the problem simply remove the control chars from the text using:
str.replaceAll("\\p{Cntrl}", "")

Wednesday, April 08, 2009

Protocol Buffers forward + backward compatibility demo using Scala and Voldemort

After the long benchmarking session which is still not over, I came to the understanding that engineers are clinging too much for numbers and making them the first and only impression when evaluating a library.
So here is a small demo of one of the nicer protobuf features: forward and backward compatibility.
[flame warning]show me how you do it with json[end flame warning].
Ok, first check out the protobuf-object-competability-example project:

git clone git://github.com/eishay/protobuf-object-competability-example.git

Open the file protobuf/user.proto and make it look like this:
package test;

option java_package = "test";
option java_outer_classname = "UserPBO";

option optimize_for = SPEED;

message User {
required uint32 id = 1;
optional string name = 2;
repeated string email = 3;
}
Now, lets compile it
protoc --java_out=src protobuf/user.proto
ant compile
Good! we're ready. Run the Voldemort server and a scala interactive client
bin/voldemort-server.sh . &
bin/voldemort-scala-shell.sh
Now we're starting the actual demo. Let us call this session the sign in service
Welcome to Scala version 2.7.3.final (Java HotSpot(TM) Client VM, Java 1.5.0_16).
Type in expressions to have them evaluated.
Type :help for more information.

scala> import voldemort._
import voldemort._

scala> import test.UserPBO._
import test.UserPBO._

//create a new protobuf object from scratch
scala> val user = User.newBuilder.setId(1).setName("Joe Smith").addEmail("joe@gmail.com").addEmail("joe2@yahoo.com").build
user: test.UserPBO.User =
id: 1
name: "Joe Smith"
email: "joe@gmail.com"
email: "joe2@yahoo.com"

//create a new voldemort client
scala> val vclient = new VClient[String, User] ("proto-store", "tcp://localhost:6666")
[2009-04-08 23:49:14,179] INFO Client /127.0.0.1:59028 connected. (voldemort.server.socket.SocketServer)
Established connection to proto-store via tcp://localhost:6666
vclient: voldemort.VClient[String,test.UserPBO.User] =
store name : proto-store
bootstrap url : tcp://localhost:6666
key serializer: StringSerializer
val serializer: ProtoBufSerializer

//push the user object to voldemort
scala> vclient put (user.getId.toString, user)
Now lets open another session in a new terminal, don't close the sign in service session yet!
But before opening the new session, we just got a notice that the User object can not store the list of emails any longer and from now on it stores a new membership object!
So our new protobuf object looks like this:
package test;

option java_package = "test";
option java_outer_classname = "UserPBO";

option optimize_for = SPEED;

//new membership class!
message Membership {
enum Type {
REGULAR = 0;
PRO = 1;
}
required bool active = 1;
optional Type type = 2 [default = REGULAR];
}

//wow, where are the email list??
message User {
required uint32 id = 1;
optional string name = 2;
optional Membership membership = 4;
}
Don't forget to compile using protoc/ant. OK, now its time to open the new session which we call membership service. Remember, the sign in service has an old definition of the user object and membership service has a new one.
Welcome to Scala version 2.7.3.final (Java HotSpot(TM) Client VM, Java 1.5.0_16).
Type in expressions to have them evaluated.
Type :help for more information.

scala> import voldemort._
import voldemort._

scala> import test.UserPBO._
import test.UserPBO._

//create the client
scala> val vclient = new VClient[String, User] ("proto-store", "tcp://localhost:6666")
Established connection to proto-store via tcp://localhost:6666
vclient: voldemort.VClient[String,test.UserPBO.User] =
store name : proto-store
bootstrap url : tcp://localhost:6666
key serializer: StringSerializer
val serializer: ProtoBufSerializer

//get the user we created int the sign in service
//note that it doesn't recognize the email list, but it still keeps it around
scala> val user = vclient get "1"
version(0:1)
user: test.UserPBO.User =
id: 1
name: "Joe Smith"
3: "joe@gmail.com"
3: "joe2@yahoo.com"

//append to the user object the membership details
scala> val newUser = User.newBuilder(user).setMembership(Membership.newBuilder.setActive(true).setType(Membership.Type.PRO).build).build
newUser: test.UserPBO.User =
id: 1
name: "Joe Smith"
membership {
active: true
type: PRO
}
3: "joe@gmail.com"
3: "joe2@yahoo.com"

//push the result back into voldmort
scala> vclient put (newUser.getId.toString, newUser)
Now lets go back to our sign in service and do this:

scala> //gets the user back from voldemort.
//It can still recognize all the good members it is used to
//as for the new ones, it can't recognize them, but it does not care
val user2 = vclient get "1"
version(0:2)
user2: test.UserPBO.User =
id: 1
name: "Joe Smith"
email: "joe@gmail.com"
email: "joe2@yahoo.com"
4: "\b\001\020\001"


Conclusions: with protobuf, when you change an object you should only update services who may use the changed members. All other services, even if they do use that object, should not care about it.
Disclaimer: This post does not intend to be a full protobuf tutorial, it focuses on a single protobuf feature and omits the rest.

Thursday, March 26, 2009

Benchmarking is tricky

Once again I found a flaw in the benchmarking. Thanks for Ismael Juma's challenge, I fixed the benchmark to me more fair. It seamed like protobuf is taking the charge but then again, I decided to provide a fresh object to serialize for each serializer each time and json came up to the top again. Please everyone review the fairness of the code.
Thanks also to Chris Pettitt who pointed me to the -XX:CompileThreshold flag that would help the JIT get into business sooner then later, it might have helped changing the results. The full results are in the java benchmarking serialization project wiki page.

Monday, March 23, 2009

Protocol Buffers pitfalls

There are few rough edges to protobuf and Java.
One I noticed today was that if you have an enum field and do not define a default value, protobuf auto generated code sets a default for you (the first enum value). I expected it to have null but this mistake costed me with a nasty bug.
Conclusion: you must set a default for protobuf enums as in:

  enum Player {
UNKNOWN = 0;
MP3 = 1;
VIDEO = 2;
}
optional Player player = 10 [default = UNKNOWN];
The other one I saw is with repeated string option like:
repeated string person = 9;

If you set no value if the list then on serialization it blows away with an NPE when it tries to check the encoding of the string.
Conclusion: Avoid. If you must then embed it in another object which you can then use as a list item.

Sunday, March 22, 2009

Moved new benchmarking discussion to project wiki

There are may more interesting benchmarking results. Please check them out in the project wiki. Special thanks to David Bernard for the last updates.
Its probably best that further updates and discussions will be held on the project wiki and google group. I had the first opportunity to use the Google Charts API and found them to be rather slick. Check out the wiki to see the full results.

Friday, March 20, 2009

Listen to JavaPosse Podcast, now also at LinkedIn

Just pushed the java posse podcast to the Java Posse LinkedIn group.

You can listen to the podcast from the page. Thanks for Armin and Scott for embedding the mp3 player!

Tuesday, March 17, 2009

More on benchmarking java serialization tools

The serialization benchmarking discussed in previous posts is getting to be more interesting. Thanks to all who looked at the code, contributed, suggested and pointed bugs. Tree major contributions are from cowtowncoder who fixed the stax code, Chris Pettitt who added the json code and David Bernard for the xstream and java externalizable. Most of the code is at the google code svn repository.
The charts are scaled and some are chopped. So if you’re interested in exact numbers, here they are:

Library, Object create, Serializaton, Deserialization, Serilized Size
java , 113.23390, 17305.80500, 72637.29300, 845
xstream default , 116.40035, 119932.61000, 171796.68850, 931
json , 112.58555, 3324.76450, 5318.12600, 310
stax , 113.05025, 6172.06000, 9566.96200, 406
java (externalizable) , 99.76580, 6250.40100, 18970.58100, 315
thrift , 174.72665, 4635.35750, 5133.24450, 314
scala , 66.10890, 27047.10850, 155413.44000, 1473
protobuf , 250.37140, 3849.69050, 2416.94800, 217
xstream with conv , 115.22810, 13492.50250, 47056.58750, 321

Serialize size (bytes), less is better.
May very a lot depending on number of repetitions in lists, usage of number compacting in protobuf, strings vs numerics and more. Interesting point is Scala and Java which holds the name of the classes in the serialized form. I.e. longer class names = larger serialized form. In Scala its worse since the Scala compiler creates more implicit classes then java.

Deserialization in nanoseconds. The most expensive operation. Note that the xstream and Scala lines got trimmed.

Serialization (nanoseconds), way faster then deserialization.

Object creation, not so meaningful since it takes in average 100 nano to create an object. The surprise comes from protobuf which takes a very long time to create an object. Its the only point in this set of benchmarks where it didn't perform as well as thrift. Scala (and to a lesser point - java) on the other hand is fast, seems like its a good language to handle in memory data structures but when coming to serialization you might want to check the alternatives.

Wednesday, March 04, 2009

Thrift vs Protocol Buffers in Python

I've read Justin's post about thrift and protocol buffers and verified the results. I also found it hard to understand why protobuf is considerably slower then thrift.
In the example Justin did not add the line

option optimize_for = SPEED;
but it appears that it does not have any effect on performance. A bit strange since it definitely appears in the protobuf python docs.
Anyway, as stated in the java protobuf/thrift post it seems that at least in java protobuf performance is better then thrift, and there there is a great performance improvement with the "optimize_for" option.

The test without speed optimization:
5000 total records (0.577s)

get_thrift (0.031s)
get_pb (0.364s)

ser_thrift (0.277s) 555313 bytes
ser_pb (1.764s) 415308 bytes
ser_json (0.023s) 718640 bytes
ser_cjson (0.028s) 718640 bytes
ser_yaml (6.903s) 623640 bytes

ser_thrift_compressed (0.329s) 287575 bytes
ser_pb_compressed (1.758s) 284423 bytes
ser_json_compressed (0.067s) 292871 bytes
ser_cjson_compressed (0.075s) 292871 bytes
ser_yaml_compressed (6.949s) 291236 bytes

serde_thrift (0.725s)
serde_pb (3.156s)
serde_json (0.055s)
serde_cjson (0.045s)
serde_yaml (20.339s)
And with speed optimization:
5000 total records (0.577s)

get_thrift (0.031s)
get_pb (0.364s)

ser_thrift (0.275s) 555133 bytes
ser_pb (1.752s) 415166 bytes
ser_json (0.023s) 718462 bytes
ser_cjson (0.028s) 718462 bytes
ser_yaml (6.925s) 623462 bytes

ser_thrift_compressed (0.330s) 287673 bytes
ser_pb_compressed (1.767s) 284419 bytes
ser_json_compressed (0.067s) 293012 bytes
ser_cjson_compressed (0.078s) 293012 bytes
ser_yaml_compressed (7.038s) 290980 bytes

serde_thrift (0.723s)
serde_pb (3.125s)
serde_json (0.056s)
serde_cjson (0.046s)
serde_yaml (20.318s)
As noted before, there is no noticeable difference. If would be interesting to run the same test in java.
Anyway, the conclusion is that the language and probably the data structure counts when coming to decide which serialization method to pick and one language does not necessarily infer to the next.

Tuesday, March 03, 2009

Protobuf String serialization

Some of the reasons I heard about choosing XML for protobuf is human readability. Actually protobuf has a human readable string representation, perhaps more readable then XML. The cost of course is time and space, but when testing on low scale where the storage is a relational DB then I use the text representation for debug purposes.
The memory cost is heavy, about 500% more then binary representation and 30% more then java serialization (which tells us something about java serialization).

Note that the numbers may vary according to the object size and the data types it use (numbers/chars). The one I tested with is mostly double, uint32 and sfixed64, which are better represented in binary then text so the 500% is more understood.

Thursday, February 19, 2009

Tweeting from the Scala interpreter

Verifying ricky_clarkson's post. Nice!


.oO(esmith@esmith-md jtwitter) wget
http://www.winterwell.com/software/jtwitter/jtwitter.jar
--17:04:20-- http://www.winterwell.com/software/jtwitter/jtwitter.jar
=> `jtwitter.jar'
Resolving www.winterwell.com... done.
Connecting to www.winterwell.com[91.197.32.158]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 197,281 [application/java-archive]

100%[==========>] 197,281 175.94K/s ETA 00:00

17:04:22 (175.94 KB/s) - `jtwitter.jar' saved [197281/197281]

.oO(esmith@esmith-md jtwitter) scala -classpath jtwitter.jar
Welcome to Scala version 2.7.3.final (Java HotSpot(TM)
Client VM, Java 1.5.0_16).
Type in expressions to have them evaluated.
Type :help for more information.

scala> import scala.collection.jcl.Conversions._
import scala.collection.jcl.Conversions._

scala> import winterwell.jtwitter.Twitter
import winterwell.jtwitter.Twitter

scala> val twitter = new Twitter("eishay", "*****")
twitter: winterwell.jtwitter.Twitter =
winterwell.jtwitter.Twitter@d5cac4

scala> twitter.getFriendsTimeline() take 10 map (s => s.getUser +
":" + s) foreach println
Slashdot:Security Researcher Kaminsky Pushes DNS Patching
http://tinyurl.com/cffent
Joe Nuxoll:Hey - everyone going to the Java Posse Roundup 2009,
please follow the twitter account @JPR09
WebGuild:Why Sun's New Cloud CTO Targeting Migration of Legacy
Apps First: By Chris Preimesberger Sun's new cloud computi..
http://tinyurl.com/c7xhv9
WebGuild:Mobile Ads Stick: According to a study released today
by eMarketer, and reported on paidContent.org, mobile ads ..
http://tinyurl.com/ary96n
WebGuild:Webstock 09 : Russ Weakley: Works at the Australian
Museum. Had an idea for the museum web site 4 years ago and
.. http://tinyurl.com/demodm
WebGuild:Cloud Computing Is a Tool, Not a Strategy: This week
IÕm listening in on HP talk to some of its customers about
.. http://tinyurl.com/d42jsa
popurls:Leaked Photo of the Next-Generation Mac Mini?
http://pop-go.com/15r
ZDNet Blogs:The message from GSMA Barcelona: fragmentation -
http://tinyurl.com/cymzj4
linkedin_news:"Facebook retains terms of service after users
voice concerns" by usa today [Facebook 100% ] http://tinyurl.com/cvqtnj
Technology Geek:Lifehacker - Best Live CDs? [Hive Five Call For Contenders] http://bit.ly/nElqh

scala> twitter.setStatus("Tweeting from the Scala interpreter")
res7: winterwell.jtwitter.Twitter.Status = Tweeting from the Scala interpreter

Creative Commons License This work by Eishay Smith is licensed under a Creative Commons Attribution 3.0 Unported License.