Scala 2.13 Collections Rework

Tuesday 28 February 2017

Stefan Zeiger

BLOG

In October of 2015 Martin Odersky asked for strawman proposals for a new collections library design for Scala 2.13, which eventually led to the project that we are currently working on, based on his latest proposal. This was not the first redesign for the Scala collections. The current design was first implemented in Scala 2.8 along with the required improvements to type inference in the Scala compiler. It can generally be considered a success, providing powerful and flexible abstractions that bring together immutable and mutable collections, both sequential and parallel, with a high amount of shared interfaces and implementations. However, it does exhibit some symptoms of second-system syndrome that have been problematic in practice.

Goals

Before looking at details of these problems and possible solutions in the new design, let’s start with the broader goals for the new design:

Simplify the API for users: This includes doing common operations without CanBuildFrom, pruning back the inheritance tree and a better separation of immutable and mutable collection operations.
Simplify the API for implementors: Implementing a new collection type is far from trivial at the moment. Setting up the implicits to get the desired CanBuildFrom instance in all cases (both through static implicits lookup and at run-time) is tricky. Inheriting lots of method implementations by default appears like a benefit at first but you still need to manually check (and possibly override) all methods whose default implementation has unsatisfactory performance for a specific collection.
Design for better performance: The best algorithms are not very useful if you ignore lower-level performance concerns such as specialization for primitive types or method dispatch in the JVM. For example, replacing Scala collections in Slick’s AST by a custom collection implementation improved the performance of Slick’s query compiler by 25% (from reducing overhead and implementing efficient operations alone, without any use of specialization). While we do not expect the same improvements for a more generic collection library, it does show that there is room for improvement.
Provide source compatibility with 2.12 where it makes sense but allow breaking changes where required: We expect the majority of collection usage to be compatible between 2.12 and 2.13 and the majority of the remaining incompatibilities to be automatically fixable with ScalaFix.

Traversable and Iterable

Leaving the “generic” abstractions (which encompass both sequential and parallel collections) and the implementation traits (like TraversableLike and IterableLike) aside, we have Traversable at the root of the current collections hierarchy. Its only real subtype that is used by collection implementations is Iterable. The difference is that Traversable only provides internal iteration, i.e. a foreach method, whereas Iterable gives you the more powerful external iteration via an Iterator. The Traversable abstraction has not carried its weight in the current library and will likely not resurface in the new design. Everything we want to do can be expressed with Iterable.

We are also looking at other opportunities to remove collection traits. While each one of them is there for a good reason, their interactions and the sheer number create a huge amount of complexity. For example, this is the class declaration of the standard List class:

sealed abstract class List[+A] extends AbstractSeq[A]
                               with LinearSeq[A]
                               with Product
                               with GenericTraversableTemplate[A, List]
                               with LinearSeqOptimized[A, List[A]]
                               with scala.Serializable {
  ...
}

If you traverse all these supertypes you find no less than 37 (!) linear supertypes in total.

CanBuildFrom

One of the most powerful but also most controversial features of the current collections library is CanBuildFrom:

   def map[B, That](f: A => B)(implicit bf: CanBuildFrom[Repr, B, That]): That

This is the standard map method as declared in TraversableLike. It is so inscrutable to beginners that simplified use case signatures were added to the API documentation in order to better convey the intended meaning:

   def map[B](f: A => B): $Coll[B]

While use case signatures help you when you look at the scaladocs, you still have to deal with the real definitions in IDE code-completion and in compiler errors.

Here is the same method as defined in IterablePolyTransforms in the new design:

  def map[B](f: A => B): C[B]

The method has the expected signature, there is no compile-time or run-time overhead to find the right CanBuildFrom, and the type constructor C is refined to a concrete collection type in almost every use case, so you see the expected definitions in scaladocs, code-completion and error messages.

Of course, CanBuildFrom was added in Scala 2.8 for a good reason. It allows a single definition of a standard method like map to work for regular, unconstrained collections and for constrained collections that require an implicit evidence for their element type. For example, we can naturally define class BitSet extends Set[Int], but what does it mean to call map on a BitSet?

val s: BitSet = ...
val c1 = s.map(i => i+1)
val c2 = s.map(i => i.toString)

The current Scala collections design makes it possible to not only build a BitSet for c1 and a HashSet for c2 at run-time but also compute these types statically at compile-time. It gets slightly more complicated when you take code such as

val s: BitSet = ...
val c1 = (s: Set[Int]).map(i => i+1)

You wouldn’t expect c1 to be of type BitSet but is it still backed by a BitSet at run-time or do you get a default Set implementation? In this case it is the latter but there are inconsistencies between static and dynamic lookup of Builder types in the current library that can lead to unexpected types or implementations.

As shown above, the new design does not have an implicit evidence parameter for map, so you are assured that you always get the same implementation no matter if you statically see your BitSet as a BitSet or a Set[Int]. This “same implementation” in the case of BitSet means Set though. A collection implementation is free to pick any collection type to build (between what its supertype promises and what it implements itself, of course) as long as it is unconstrained. In order to call BitSet.map and get another constrained BitSet out of it, we need to overload the map method:

class BitSet extends Set[Int] {
  // inherited:
  //def map[B](f: Int => B): Set[B]

  def map(f: Int => Int): BitSet
}

Thanks to an improvement to type inference it is possible to call the overloaded method with a lambda without explicit type annotations in Scala 2.12:

val s: BitSet = ...

// Scala 2.11 and earlier would have required:
val c1 = s.map((i: Int) => i+1)

// This works in 2.12+:
val c2 = s.map(i => i+1)

While BitSet is a rather esoteric collection type, the same principle of implicit constraints on element types T applies to other collection types as well:

BitSet: T <:< Int (“must be an Int”)
All Map types: T <:< (_, _) (“must be a Tuple2”)
All sorted collections (like TreeSet): Ordering[T] (“must have an Ordering”)
String (not a collection type per se but it gets many collection methods as extension methods): T <:< Char (“must be a Char”)
Array (same situation as String): ClassTag[T] (“must have a ClassTag”)

This covers most uses of CanBuildFrom in a much simpler way. The most important use cases that cannot be supported directly by the new design are collection.breakOut (for building a different collection type directly without an extra conversion step) and to (for converting to a different collection type for which a CanBuildFrom is available). The new design provides alternative ways to achieve the same effects.

FromIterable

The to method still exists in the new design but it is only a decorator around FromIterable, the basic abstraction that is implemented by every unconstrained collection type’s companion object:

trait IterableOps[+A] extends Any {
  def to[C[X] <: Iterable[X]](fi: FromIterable[C]): C[A @uncheckedVariance] =
    fi.fromIterable(coll)
  ...
}

trait FromIterable[+C[X] <: Iterable[X]] {
  def fromIterable[B](it: Iterable[B]): C[B]
}

Note that the fi parameter is not implicit (like the CanBuildFrom in the current library), so you now call to with a value and have the type inferred instead of calling it with a type and having the value filled in by implicit lookup:

val s: Set[Int] = ...
val v = s.to(Vector) // this used to be to[Vector]

This has the advantage that we can overload to to cover maps (and maybe also other constrained collection types) which is not possible in the current design (where to only works for unary type constructors).

Views

Views in the current collections library are one of the lesser-used features yet they add a lot of complexity to the implementation. The new design separates immutable views from mutable views (the latter have not yet been implemented), with only two types of immutable views: indexed and non-indexed. This is a huge simplification over the current design where a view type is specific to its underlying collection type.

Conceptually a View is now a reified operation over an Iterator. Like a Java 8 Stream it has terminal operations which run the operation represented by the current View on a fresh Iterator (e.g. foreach and foldLeft) and intermediate operations which create a new View (e.g. filter and flatMap). This provides clear semantics even when used together with mutable collections: Composing views with intermediate operations is always independent of concurrent modifications to the underlying collection. Only when you call a terminal operation will the current state of the collection be used (by calling iterator on it).

Views are also the recommended replacement for collection.breakOut. For example,

val s: Seq[Int] = ...
val set: Set[String] = s.map(_.toString)(collection.breakOut)

can be expressed with the same performance characteristics as:

val s: Seq[Int] = ...
val set = s.view.map(_.toString).to(Set)

If you combine multiple operations they are executed lazily:

val s: Seq[Int] = ...

// First build a filtered Seq, then a mapped Seq, then a Set
val set1 = s.filter(_ > 0).map(_.toString).to(Set)

// Build a Set directly by evaluating filter and map together:
val set2 = s.view.filter(_ > 0).map(_.toString).to(Set)

Laziness

Aside from views the current collections library also supports lazy collections but they do not get a first class treatment. While we do have Stream, it is only lazy in the tail but still strict in the head element and the implementation needs to special-case pretty much everything because the basic abstraction for building new collections (which is used by all of the default implementations that strict collection types can safely inherit) is CanBuildFrom / Builder which uses strict, push-based collection building.

By combining FromIterable with the new View design we can turn this around to provide pull-based building which can be used by strict and lazy collections alike. For example, this is the default implementation of map in IterablePolyTransforms which does not need to be overridden in LazyList to provide the expected laziness:

  def map[B](f: A => B): C[B] = fromIterable(View.Map(coll, f))

Any operation that is available on View can get a default implementation like this based on rebuilding the current collection type from a View operation. Naturally a LazyList can be “built” from an Iterable by getting an Iterator and only pulling elements one by one as they are needed, so we automatically get a lazy implementation of map.

Language Integration

You may have wondered about the asymmetry in the default types that are in scope through Predef or the scala package object:

scala> classOf[Set[_]]
res1: Class[Set[_]] = interface scala.collection.immutable.Set

scala> classOf[Map[_, _]]
res2: Class[Map[_, _]] = interface scala.collection.immutable.Map

scala> classOf[Seq[_]]
res3: Class[Seq[_]] = interface scala.collection.Seq

Unlike all other collection types, Seq is not the immutable version but the generic one that encompasses both, mutable and immutable sequences. The reason for this is that the Scala specification represents varargs as type scala.Seq (and it should not have to rely on any type outside the top-level scala package). Since varargs in Java (and on the JVM) are really mutable arrays, the default Seq has to allow mutable collections.

In practice though, this kind of array can be treated as quasi-immutable, so we plan to add a new ImmutableArray wrapper in the new design which can be used for varargs, thereby removing the need for a non-immutable default Seq type.

Outlook

We are currently working on the new design as part of SCP-007 which is funded by Scala Center. At this stage all work is happening in the collection-strawman repository with the goal of refining and completing it to the point where we are confident that it can and should become a part of Scala 2.13. The core project team consists of Julien Richard-Foy (from Scala Center), Martin Odersky, Rex Kerr and myself (as a member of the Scala team at Lightbend who maintain the Scala compiler).

If you’d like to get involved, now is the time to weigh in on the discussions that are happening on the pull requests and issues. We also have a chat room and a Scala Contributors discourse discussion thread.

A few topics still need further exploration:

Currently there is no support for specialization of collections. It would be nice to allow this in the new design if we can do it without too much of an impact on the majority of non-specialized collections.
We need a story for parallel collections. They will be moved into a separate module in Scala 2.13 as part of the ongoing modularization of the standard library but it is not clear yet how closely they will be integrated into the new design.
The scala-java8-compat module provides better integration of collections with Java Streams. Some basic parts like the specialized Stepper types (which unify Java Iterators, Scala Iterators and Java Spliterators) may find their way into the standard library.
Apart from ScalaFix we should explore other migration options, for example using IntelliJ IDEA’s refactoring API or providing compatibility libraries that allow you to cross-build most sources against the old and new collection libraries.

The current roadmap calls for the basic design to be completed in Q1 2017 so we can focus on migration options in Q2 and come to a decision whether to adopt the new design in Scala 2.13. If all goes according to plan, the remaining implementation work should be finished by the end of the year in time for Scala 2.13.0-RC1.