Tuesday, June 09, 2009

Unexpected repeated execution in Scala

The following Scala Easter egg is the one of the most dangerous "features" of the language.

I know there might be flaming involved, but be sure its not my intention. I like Scala a lot and I'm happily using it in production code. This post is a result of the following discussions: Seq repeated execution unexpected behavior and Strange behavior of map function depending on first argument

The following code is using a method that returns a Seq[Int] containing random numbers and acts on them (prints the first one).

object SeqTest {
def main(args: Array[String]) {
val randomInts : Seq[Int] = mkRandomInts()
println(randomInts.first)
println(randomInts.first)

val randomIntsList = randomInts.toList
println(randomIntsList.first)
println(randomIntsList.first)

def mkRandomInts() = {
val randInts = for {
i <- 1 to 3
val rand = i + (new Random).nextInt
} yield rand
randInts
}
}
As you can see from the output blow the sequence is reevaluated each time it is accessed, returning different random number. Once the code transforms the sequence into a List the results are stable.
-1867060800
312920158
-133186413
-133186413
Decompiling the code shows that mkRandomInts returns a scala.RandomAccessSeq.Projection which is a result of the Range we created using
i <- 1 to 3
As Jorge’s explained it: the "gotcha" is that, because of the way Scala's collections work, Range's laziness is "contagious" when you use functional for-comprehensions (for ... yield ...) and a Range is the first thing in the comprehension.
I'm not sure the Scaladoc explains that so well.

Martin and others explained that it comes to conserve memory since we don’t really want to have a list in size of 100000 when doing (0 to 100000). This is all nice and true, but it is not a good reason to have a repeated execution each time the code access a data structure derived from iteration on a range.

The example above is not about clean/nice/efficient code its about the principle of least surprise (POLS). This behavior will cause (in my case did cause) very hard to track bugs. It is also an undocumented behavior, in spite of it existing in a very common pattern equivalent to java’s
for (int i = 0; i < 100000; i++)
The implicit side effects of such optimization may be unacceptable. For example, the code executed in the for loop might change state in DB or file system or be CPU intensive. In the latter case, it would be very hard to understand why the application is so slow when repeatedly accessing elements of a sequence.

If Seq would have force() on it then a protective act would know to call it in such cases, but it does not have it (only RandomAccessSeq.Projection got it).
For example, every java InputStream has close() and java programmers are accustomed to close any stream they use in a finally "just in case", even if in some of them (e.g. StringBufferInputStream) close does nothing. But if we'll educate programmers to force a Seq anywhere we see it we create more boilerplate mess and we don’t want it in Scala :-)

Having such case forces the programmer to know about it and be ready => more potential bugs that will happen => learning curve is even higher.

To conclude, the problem is two fold:
First calling a chunk of code that (without explicit instruction) gets executed only when its derived output is accessed.
Second even if lazy is cool and expected, repeated execution is not. We have lazy data structures all around but they usually cache the data once they fetch it or in the case of lazy iterators (like in jdbc), you need to explicitly recreate them.

5 comments:

Jorge Ortiz June 9, 2009 at 10:46 PM  

In Scala 2.8, all collections will have a "force" method that returns a strict version of the collection.

To avoid this problem in the future, never use (a to b) with "for ... yield ..." comprehension, but only with "for ... { ... }" comprehensions.

Scala allows for three kinds of evaluation:

List.range(a, b) - strict evaluation
Stream.range(a, b) - lazy evaluation, evaluated once and cached
(a to b) - lazy evaluation, evaluated each time it is used

You can choose which is most appropriate for your situation.

martin June 10, 2009 at 2:56 AM  

I think the problem here is that the 'a to b' form is so much more convenient, so people use it without keeping in mind that this is a by-name collection.

I am a bit doubtful about making `a to b' lazy by caching its results. The problem here is that we might get space leaks and also that lazy evaluation and side effects do not mix very well.

However, one might consider giving 'a to b' a special status, somewhere between a strict collection and a view. We could say that 'a to b' itself is call by name, and that operations like filter, take, takeWhile also yield by-name collections. However, map and flatMap could yield strict collections. That means that as long as you manipulate ranges, they are lazy, but as soon as you use them to build up another collection, this collection becomes strict.

Eishay Smith June 10, 2009 at 6:18 AM  

Another rather funny possible scenario that might happen when the for loop logic mutates resources and return a summary of the action for logging. For example create files, write stuff into them, and return the list of file names. Now, in development we'll print the list of files to the log log.debug(fileNames) but in production we'll turn the logging off.
The developer will see that files are not being created as expected and will turn the log on to see what's wrong and it suddenly will work again.

It reminds me my car that stops making those noises when I'm showing it to the mechanic.

By the end of the day we'll have a very frustrated engineer.

Robert M. June 10, 2009 at 6:57 AM  

I'm not at all sure why Range is non-strict at all — could someone explain it to me? Sure, there's no allocation of objects going on when one is created beyond the Range object itself, but all its elements are nonetheless known and in principle "present". It's perfectly possible to create a constant-space strict implementation of Range which would extend, say, LinearSequence instead of a View class. And in the (IMO) rare case where you wanted a lazy range, you could explicitly create one from the strict object.

I guess I'm just not seeing how Range being lazy provides much value in the face of the confusion it seems to cause.

Daniel June 10, 2009 at 7:59 AM  

Alas, if "a to b".map or .flatMap yielded worked any differently, it would break some of my code. I depend on it being a non-strict, non-cached projection.

I wouldn't mind having a Projection.range(a, b) at all to replace it, as long as something did.

Creative Commons License This work by Eishay Smith is licensed under a Creative Commons Attribution 3.0 Unported License.