Tuesday, December 29, 2009

Attaching a Java debugger to the Scala REPL

Originally posted on the kaChing Eng Blog.

I'm using the Scala REPL to play around with java libraries and check their runtime behaviors. One of the things I'm using it for is to check how Voldemort's client is behaving in different setups. For one of the checks I wanted to trace the client threads with an IDE debugger.
To attach a debugger to the Scala REPL all you need to do is to export the debugger values into JAVA_OPTS:

export JAVA_OPTS='-Xdebug -Xrunjdwp:transport=dt_socket,address=1773,server=y,suspend=n'
Run your scala REPL and attach your debugger to port 1773. Done.

(*) Tested with scala 2.7

Wednesday, December 09, 2009

Baking availability SLA into the code

This is a copy of the post I made on the kaChing engineering blog.

Availability and Partition Tolerance are essential for many distributed systems. A simple (though not comprehensive) way to measure both is using response time SLAs between services as implied from Jeff Darcy's observation:

Lynch (referring to the 2002 SIGACT paper) also makes the point that unbounded delay is indistinguishable from failure. Time is therefore an essential component of these definitions (a point made even more explicitly in the Dynamo paper).
At kaChing we think that SLAs is something that should be baked into the code so the developer will have to think of while creating the service contact. For that reason we created a @ResponseTime annotation for out internal services:
@Retention(RUNTIME)
@Target({ TYPE })
public @interface ResponseTime {
  long warn();
  long error();
  TimeUnit unit() default MILLISECONDS;
}
A typical service query for online requests is annotated with
@ResponseTime(warn = 100, error = 200)
Where the time depends on the services constraints, access to resources for example. A ping query for example has a
@ResponseTime(warn = 2, error = 4)
and an offline analytics call may take hours
@ResponseTime(warn = 10, error = 15, unit = HOURS)
Nevertheless, every request should have an SLA and the developer must think of it when writing the implementation.
Once we have this tool the SLA of subsequent services (B, C & D) a service (A) needs to call can statically be computed to verify that no path in the tree of subsequent services calls exceeds the root service (A) SLA. In other other words, for service queries A, B, C & D. If A calls (B and B calls C) and in parallel A calls D then we should have SLA(A) > max (SLA(B) + SLA(C), SLA (D)).
kaChing's service query container sample the time it takes for every call using perf4j and logs times and SLA violations to local disk. If the query's time exceeds the warning threshold it will be logged accordingly, but if the error threshold is broken then the container will try to terminate the query. Terminating the query is a bit harsh, but the since our client timeout is using the SLA as well then most chances are that the client gave up on the query and either retried (service queries are Idempotent) or aborted. Another reason to shoot down a runaway query that exceeds the SLA error time is that it may be locking or consuming resources from other queries and slowing the whole system down.
The perf4j messages are also piped through a log4j socket appender (async buffers in each side of the pipe) to a central hub. The hub then does statistics on the aggregated times, it loads the SLAs of the queries and checks that the cluster is not violating its SLA. The central hub can then send a daily report on SLAs and real time alerts pinpointing a slower then expected service. Keeping the reports and comparing them to historical reports are helping to see an improvement or regression in every part of the site.

This monitoring technique is only a small portion of the automated system operation a small startup must have in order to stay flexible. Stay tuned for more monitoring automation posts.

Creative Commons License This work by Eishay Smith is licensed under a Creative Commons Attribution 3.0 Unported License.