Monday, August 20, 2012

Getting to know CanBuildFrom (without a PhD)

Recently I needed to write a pretty simple function: given a sequence of (name, value) pairs, return a sequence of (name, some collection of all the values having that name) pairs. Here is a simple implementation:

def combineValues(pairs: Seq[(String, String)]): Seq[(String, Seq[String])] = {
    val result = LinkedHashMap[String, List[String]]()

    for ((name, value) ← pairs)
        result += name → (value :: result.getOrElse(name, Nil))

    result.toList
}

Then I realized that I not only needed the result as a Seq[(String, Seq[String])], but also as a Seq[(String, Array[AnyRef])], for compatibility with a Java API. I could of course transform one into the other, but I wondered whether there might be a way to directly return the desired collection type (without duplicating the function!).

Let’s think a little bit about what this entails: the revised combineValues needs to create new collections of a type specified by the caller. So first the type needs to be specified, which means that the function needs to take so-called type parameters: one for the resulting collection type, and one for the item type:

import language.higherKinds // so that Scala 2.10 doesn't warn
def combineValues[U, T[_]](pairs: Seq[(String, String)]): Seq[(String, T[U])]

The type parameters are specified with the funny [U, T[_]] declaration before the usual function parameters. You have probably seen this syntax on generic classes. But functions can have type parameters too!

Also, the function now says it returns values of type T[U]. The idea here is that we will return collections of type T where T stands for things like Seq, Array and Set containing elements of type U, where U stands for String or AnyRef.

Do you see what we’ve done here? We have changed the signature to be more abstract: instead of specifying beforehand that we are dealing with Seq or Array or String, we use “variables” (T and U). In case you are wondering, there is nothing magic about the names T and U, we could have used Coll and Elem instead. It’s a matter of taste.

A couple of remarks on the syntax:

  • It’s not possible to just specify [T[U]] as a single type parameter: the two parameters must be separate.
  • Because T is a type which itself takes a parameter, we must say so with T[_] (T[Whatever] works too).

Here is how you call the function with explicit type parameters (sometimes, type parameters can be inferred by the compiler, but not here):

combineValues[String, List](seq)

Now the function needs to make use of these type parameters to create the new collections. How do we do that?

Naively, let’s try creating new empty collections of type T[U] using a constructor:

new T[U] // incorrect Scala

or, using a companion object’s apply:

T[U]() // incorrect Scala

Unfortunately, neither even compiles. How would something like this work anyway? You could imagine that the runtime type information is available to combineValues and then use reflection to create new instances. But:

  • Scala erases parametrized types during compilation, and type parameters are not automatically available at runtime (although there are ways to obtain them).
  • The collection type parameter could be a trait, like Seq, in which case you would have to find a reasonable concrete class implementing that trait. How would you go about making that determination?
  • This would result in instantiating empty instances of the resulting collection. How would you add elements to that collection? Array and List in particular aren’t Growable collections. So you would have to find a constructor taking all the items of the collection instead. That’s possible, but not guaranteed to work at runtime for all types.
  • This might not be very efficient due to the use of reflection.

So besides the fact that it does not work out of the box, this approach is not as simple as it first looked. We are missing something, and that something is a factory for the collection to create. And Scala already has a trait for that kind of factories in the Scala standard library: it is called CanBuildFrom.

This trait is a bit of a funny thing: it is part of many Scala collections function signatures, and for that reason has been the target of criticism, with the result that Scala 2.8’s Scaladoc hides it from signatures by default. Some have argued that it’s not necessary to understand it to use Scala collections, and I agree with that (in general, Scala collections just work even if you don’t have any idea that CanBuildFrom exists).

This said, CanBuildFrom itself is a pretty simple thing conceptually: if you have an instance of CanBuildFrom for a given collection type, you can call apply on it to get a Builder for that collection. Once you have the builder, you just add elements to it, and finally obtain the resulting collection. The trait itself is parametrized, to specify a source collection type, an element type, and a resulting collection type: CanBuildFrom[-From, -Elem, +To].

Here, what we want is a CanBuildFrom[T[U], U, T[U]]. (The first type parameter, the From collection, does not seem to be consequential here.) In short it’s a factory able to return a Builder for a collection T[U].

A version of our function receiving a CanBuildFrom looks like this:

def combineValues[U, T[_]](pairs: Seq[(String, String)],
        cbf: CanBuildFrom[T[U], U, T[U]]): Seq[(String, T[U])] = {
    val result = LinkedHashMap[String, Builder[String, T[U]]]()

    for ((name, value) ← pairs)
        result.getOrElseUpdate(name, cbf()) += value

    result map { case (k, v) ⇒ k → v.result } toList
}

Note where the factory is called: cbf(), and where the resulting collection is obtained: v.result.

But how does the caller even obtain a CanBuildFrom? You have not only to pass T and U (the type parameters) but also somehow figure out where to locate that factory.

Of course, and this is why the Scala collections and CanBuildFrom are designed this way in the first place, there is a twist: implicit parameters. The idea is that instead of finding and passing the factory explicitly to the function, the compiler does that for you. So we change the signature as follows:

def combineValues[U, T[_]](pairs: Seq[(String, String)])
        (implicit cbf: CanBuildFrom[T[U], U, T[U]]): Seq[(String, T[U])]

Notice how we changed cbf from a regular parameter to an implicit parameter with the implicit keyword, and moved it to a second parameter list (yes, Scala supports more than one parameter list).

When the compiler sees a call to such a function, it does something called an implicit search. This is the process whereby it locates an appropriate value for the implicit parameter, based on the function scope and parameters (see more on implicit search). Importantly, it looks at the type of the implicit parameter. Here this means that the value must be of type CanBuildFrom, with the proper type parameters too.

The Scala collections have such factories already, and they are nicely made available to implicit search when you use collection types such as Seq and Array. So the compiler will find them without trouble.

So say we call the function like this:

combineValues[AnyRef, Array](seq)
combineValues[String, List](seq)

The compiler will respectively search for the following CanBuildFrom instances, and pass them to the function:

CanBuildFrom[Array[AnyRef], AnyRef, Array[AnyRef]]
CanBuildFrom[List[String], String, List[String]]

And it will just work!

The result? We have a achieved more than what we were looking for: we have a function able to return a result made of any existing collection available in the standard library, or even outside of the standard library, for that matter. The resulting collection can be anything you want as long as there is a CanBuildFrom for the resulting type. Most notably, Array is a plain Java array, and it works because Scala provides a CanBuildFrom for it. It’s a powerful approach.

From the caller side, things remain easy: you call your function as usual except that you specify the type of the result you want.

Is there magic here? No: it is all done thanks to the clever design of implicit parameters and their use within the standard Scala collections. And the great thing is that the mechanics of it are available to anybody, not just to the compiler or Scala standard library maintainers.

What to think of the type parameters and the implicit CanBuildFrom factory? They certainly lead to more abstract programming, but when put in simple terms, like “here is a function which can return a result based on any collection type you want, as long as the compiler finds a factory for that type”, I think it sounds quite reasonable.

2 comments:

Erem said...

Thanks for this article. It massively improved my understanding of CanBuildFrom. Given its role now it makes a lot of sense to me that it's type parameters are identical to Function2.

Anonymous said...

Thanks for the post. One remark though:

The parametrized version of combineValues does not typecheck as is. Individual pairs have to be of type (String, U) and not (String, String).

Otherwise, `value' is of type String, which cannot be assigned to U.

The other option is to stick to String, and only parametrize on the type constructor T, but then you always get back a Seq[(String, T[String])].