Saturday 12 March 2011

Troubles with Parallelism

A couple of weeks ago I attended the Java Posse Round-up, during which, I spent a couple of hours with Joel Neely working with the new parallel collections in Scala.  As a task we looked at the parallel string splitting that was defined by Guy Steele during his talk at the Strange Loop conference.  The goal was to demonstrate how algebraic properties of an algorithm can be used to partition and combine a data set so that it can be parallelised.

The algorithm uses a divide and conquer approach where the combination of the partial results is associative, i.e. the grouping of the operations to recombine the results doesn't matter as long as the overall order is maintained (there are more technical descriptions).  If we take the following string as an example:

"Here is a sesquipedalian string of words"

And it is partitioned arbitrarily resulting in:

"Here is a ", "sesquipeda", "lian strin", "g of words"

We can represent the individual parts as objects that look like the following:

Segment("Here", List("is", "a"), "")
Segment("lian", List(), "strin")
Segment("g", List("of"), "words")

A segment consists of a leading and trailing part of a word and a list of all of the words that are enclosed by spaces within the partition.  A chunk is a piece of text that contains no white space separators.  These individual parts can be recombined using a polymorphic function, lets call it "assoc".  The function is defined as follows:

case class Segment(prefix: String,
                   words: List[String],
                   trailer: String) extend WordState {
  override def assoc(other: WordState) = {
    other match {
      case c:Chunk => 
        Segment(prefix, words, trailer + c.part)
      case s:Segment => 
                words ::: maybeWord(trailer + s.prefix) ::: s.words, 

And for the Chunk:

case class Chunk(part: String) extends WordState {
  override def assoc(other: WordState) = {
    other match {
      case c:Chunk => Chunk(part + c.part)
      case s:Segment => Segment(part + s.prefix, s.words, s.trailer)

Where "maybeWord" is defined as:

def maybeWord(s:String) = if (s.isEmpty) List.empty[String] else List(s)

One of the nice features of this algorithm is that it is now possible to process the input sequentially using a fold right operation:

val wordStates =
wordStates.foldRight(Chunk.empty)((x, y) => x.assoc(y)).toList()

Or in parallel using Scala 2.9's parallel arrays:

val wordStates =
wordStates.aggregate(Chunk.empty)(assoc, assoc).toList()

The ".par" method on array constructs a ParallelArray from the input array.  From that point on the functions applied to the collection are executed in parallel using the Fork/Join framework.  Once we had this all in place, we wanted to benchmark it.  I created a very simple imperative implementation to serve as a baseline and a simple performance test harness.  I ran a number of iterations using a copy of "Alice in Wonderland" as an input data set.  I had access to a Core 2 Duo (2 Cores) laptop and a Nehalem EX (8 Cores + HT) workstation on which I ran the benchmarks.  The results were very surprising:

Core 2 Duo laptop, 2 Cores @ 2.40 GHz
Parallel - opsPerSecond: 6.89
Sequential - opsPerSecond: 49.81
Brute Force - opsPerSecond: 1388.89

Intel Xeon Workstation, 8 Cores +HT @ 2.40 GHz
Parallel - opsPerSecond: 120.48
Sequential - opsPerSecond: 73.26
Brute Force - opsPerSecond: 1886.79
On the 2 core laptop the parallel version runs quite a bit slower, with 16 cores thrown at it, it just scrapes up to twice as fast as the sequential version.  However in both cases it was absolutely dwarfed by the performance of the simple imperative version.  Once I started digging using the profiler the reasons for this became obvious.  The problem is Scala's immutable lists and the concatenation operator ':::'.  In Scala the list on the left hand is copied, this is O(n) operation depending on the size of the list.  So when the algorithm is parallelised the complexity increases from O(n) for the sequential version to O(n^2) for the parallel one.  It's worth noting that in Fortress (which the initial algorithm was presented in) all list mutation operations are O(log n).

Sigh, it's back to Com. Sci. 101.  An O(n^2) algorithm run on multiple cores is still O(n^2).  The number of cores just serves a factor, so doesn't alter the complexity.  My concerns with the current trend in software that is obsessed with parallelism and multi-core, is that the basics are being forgotten,  a scientific/experimental approach to building faster software is not being applied and there is no demonstration of mechanical sympathy.  The reason the simple imperative version is so much quicker than the sequential function version (both are O(n)) is that it is far more memory friendly i.e. no allocation/garbage reducing the amount data that needs to be pushed back and forth through the various CPU cache levels (and an old school decrement and compare trick).  The functional version generates quite a bit of garbage and losses on memory efficiency.

Tuesday 1 March 2011

Speaking at LJC about concurrency

I'm speaking at the next LJC event on concurrency (  I'll be doing the same talk that I gave with Martin Thompson at QCon.  While it is the same material, the talk tends to take a different turn each time I give it.