Saturday, 12 March 2011

Troubles with Parallelism

A couple of weeks ago I attended the Java Posse Round-up, during which, I spent a couple of hours with Joel Neely working with the new parallel collections in Scala.  As a task we looked at the parallel string splitting that was defined by Guy Steele during his talk at the Strange Loop conference.  The goal was to demonstrate how algebraic properties of an algorithm can be used to partition and combine a data set so that it can be parallelised.

The algorithm uses a divide and conquer approach where the combination of the partial results is associative, i.e. the grouping of the operations to recombine the results doesn't matter as long as the overall order is maintained (there are more technical descriptions).  If we take the following string as an example:

"Here is a sesquipedalian string of words"

And it is partitioned arbitrarily resulting in:

"Here is a ", "sesquipeda", "lian strin", "g of words"

We can represent the individual parts as objects that look like the following:

Segment("Here", List("is", "a"), "")
Chunk("sesquipeda")
Segment("lian", List(), "strin")
Segment("g", List("of"), "words")

A segment consists of a leading and trailing part of a word and a list of all of the words that are enclosed by spaces within the partition.  A chunk is a piece of text that contains no white space separators.  These individual parts can be recombined using a polymorphic function, lets call it "assoc".  The function is defined as follows:

case class Segment(prefix: String,
words: List[String],
trailer: String) extend WordState {
override def assoc(other: WordState) = {
other match {
case c:Chunk =>
Segment(prefix, words, trailer + c.part)
case s:Segment =>
Segment(prefix,
words ::: maybeWord(trailer + s.prefix) ::: s.words,
s.trailer)
}
}
}

And for the Chunk:

case class Chunk(part: String) extends WordState {
override def assoc(other: WordState) = {
other match {
case c:Chunk => Chunk(part + c.part)
case s:Segment => Segment(part + s.prefix, s.words, s.trailer)
}
}
}

Where "maybeWord" is defined as:

def maybeWord(s:String) = if (s.isEmpty) List.empty[String] else List(s)

One of the nice features of this algorithm is that it is now possible to process the input sequentially using a fold right operation:

val wordStates = s.map(processChar).toArray
wordStates.foldRight(Chunk.empty)((x, y) => x.assoc(y)).toList()

Or in parallel using Scala 2.9's parallel arrays:

val wordStates = s.toCharArray.par.map(processChar)
wordStates.aggregate(Chunk.empty)(assoc, assoc).toList()

The ".par" method on array constructs a ParallelArray from the input array.  From that point on the functions applied to the collection are executed in parallel using the Fork/Join framework.  Once we had this all in place, we wanted to benchmark it.  I created a very simple imperative implementation to serve as a baseline and a simple performance test harness.  I ran a number of iterations using a copy of "Alice in Wonderland" as an input data set.  I had access to a Core 2 Duo (2 Cores) laptop and a Nehalem EX (8 Cores + HT) workstation on which I ran the benchmarks.  The results were very surprising:

Core 2 Duo laptop, 2 Cores @ 2.40 GHz
==
Parallel - opsPerSecond: 6.89
Sequential - opsPerSecond: 49.81
Brute Force - opsPerSecond: 1388.89

Intel Xeon Workstation, 8 Cores +HT @ 2.40 GHz
==
Parallel - opsPerSecond: 120.48
Sequential - opsPerSecond: 73.26
Brute Force - opsPerSecond: 1886.79
On the 2 core laptop the parallel version runs quite a bit slower, with 16 cores thrown at it, it just scrapes up to twice as fast as the sequential version.  However in both cases it was absolutely dwarfed by the performance of the simple imperative version.  Once I started digging using the profiler the reasons for this became obvious.  The problem is Scala's immutable lists and the concatenation operator ':::'.  In Scala the list on the left hand is copied, this is O(n) operation depending on the size of the list.  So when the algorithm is parallelised the complexity increases from O(n) for the sequential version to O(n^2) for the parallel one.  It's worth noting that in Fortress (which the initial algorithm was presented in) all list mutation operations are O(log n).

Sigh, it's back to Com. Sci. 101.  An O(n^2) algorithm run on multiple cores is still O(n^2).  The number of cores just serves a factor, so doesn't alter the complexity.  My concerns with the current trend in software that is obsessed with parallelism and multi-core, is that the basics are being forgotten,  a scientific/experimental approach to building faster software is not being applied and there is no demonstration of mechanical sympathy.  The reason the simple imperative version is so much quicker than the sequential function version (both are O(n)) is that it is far more memory friendly i.e. no allocation/garbage reducing the amount data that needs to be pushed back and forth through the various CPU cache levels (and an old school decrement and compare trick).  The functional version generates quite a bit of garbage and losses on memory efficiency.

Unknown said...

Are Fortress lists persistent or mutable? If persistent, how do they work exactly?

There seem to be some folklore on persistent catenable (in better-than-linear-time) lists, for example:

http://www.eecs.usma.edu/webs/people/okasaki/focs95/index.html

I do not quite see how your conclusion follows. Is not the problem simply that modern programmers rush to use data structures before bothering to check their complexity guarantees?

And of course functional data structures are not automatically "better" than imperative ones - you get persistence, but you pay in complexity. For applications that do not need persistence it is a net loss.

The cost is not usually *linear* complexity - most of the time the trade-off is between constant time for an imperative DS and logarithmic time for its functional counterpart.

Michael Barker said...

> Are Fortress lists persistent or mutable? If persistent, how do they work exactly?

The "PureList" implementation in Fortress used in the example is immutable and uses Finger Trees implementation. All mutation operations are O(log n), which will make the overall algorithm O(n log n).

> Is not the problem simply that modern programmers rush to use data structures before bothering to check their complexity guarantees?

I think the problem is worse than that. I feel that algorithmic complexity is being forgotten altogether, or sacrificed in the name of parallelism.

Another problem is the lack of appreciation for other costs, such as memory allocation, garbage etc. If the sequential version of the algorithm that uses the fold right operation (O(1) concatenation operation) could be scaled in a linear fashion across the 16 cores I had available, it would still only have ~2/3 of the performance of the imperative version. That seems to be awfully high price to pay for a multi-threaded solution.

The final issue I have is that the argument being presented is that the functional/persistent collection/implicit multi-threaded approach is the only approach to take and imperative techniques are fundamentally broken. To quote the talk: "DO loops are so 1950s!", "JavaTM -style iterators are so last millennium!". However very few people are publishing real results from the application of these techniques thereby demonstrating that the approach is better.

While I think that there is some value in the approach, probably for a good number of problems, I believe (like Fred Brookes) there is no silver bullet. My argument is that developers should treat all the possible approaches, be it imperative, functional, whatever, as tools that work well for certain jobs and use empirical evidence when deciding which one to apply to a given problem.

thermz said...

Sorry for the stupid question but.. why the scala "concat" operator will increase complexity from N to N^2 in parallel processor?

I really don't get it

Michael Barker said...

Yeah, I didn't really explain my reasoning particularly well. I'm in the process of drafting another blog that goes into more detail on how the complexity is calculated. Hopefully that will better explain it.

Unknown said...

On list concatenation: let us consider concatenating 4 lists.

a ++ b ++ c ++ d

a ++ b takes L(a) + L(b)
(a ++ b) ++ c takes L(a) + L(b) + L(c)
(a ++ b ++ c) ++ d takes L(a) + L(b) + L(c) + L(d)

Then the operation takes 3L(a) + 3L(b) + 2L(c) + L(d). We could have one-element lists. The worst-case bound is then O(n^2) where n is the sum of the lengths of all lists.

That being said about regular lists, I am now reading Okasaki's _Purely Functional Data Structures_. It describes a CatenableList collection that supports head, tail, cons and append in O(1) amortized time.

Moreover, the join part of the algorithm in the article does not require persistence. You can easily devise and use a mutable implementation of catenable lists with O(1) worst-case bounds for all operations.

Unknown said...

Correction:

Sorry for my complexity explanation being confusing and
incorrect. To clarify, we are considering a simple list append
operation similar to the following:

append [] b = b
append (x:xs) b = x : append xs b

The cost C of appending two lists a and b is proportional to the
length L of the first list, roughly:

C(a ++ b) = L(a)

Now consider:

C(a ++ b) = L(a)
C(a ++ b ++ c) = C(a ++ b) + L(a ++ b)
C(a ++ b ++ c ++ d) = C(a ++ b ++ c) + L(a ++ b ++ c)

Then

C(a ++ b ++ c ++ d)
= L(a) + L(a ++ b) + L(a ++ b ++ c)
= 3L(a) + 2L(b) + L(c)

Suppose that we have a list of length N partitioned into
one-element lists that are joined with append. Then

C = (n-1) + (n-2) + ... + 1 = n * (n - 1) / 2 ~ O(n^2)

Unknown said...

@Mike, I strongly disagree with your approach and conclusions.

Approach: you use your *very poor* (in terms algorithmic complexity) parallel and sequential functional implementations to benchmark with a good O(n) sequential imperative implementation. First, much better functional implementations can be devised. Second, parallel does not imply functional and vice versa - this algorithm can use imperative join and still exploit parallelism to run faster on multicores.

Conclusions: from the fact that a quadratic algorithm runs slower than a linear one, you conclude that functional implementations are inferior because of constant-factor slowdowns, such as generating too much garbage. How valid is that conclusion?

And how do you justify this:

If the sequential version of the algorithm that uses the fold right operation (O(1) concatenation operation) could be scaled in a linear fashion across the 16 cores I had available, it would still only have ~2/3 of the performance of the imperative version

You have a string of length N, and K cores. Split the string in K pieces, use the sequential, imperative algorithm on each in parallel: this takes O(N/K). Then sequentially perform K joins at the cost of O(1) each (joins also can be parallelized but we don't need it). We get O(N/K + K), no? Which means for very large N and K we gain a lot for this algorithm compared to using O(N) sequential one.

Michael Barker said...

First off I made a mistake with my calculation of the complexity. The Scala implementation is O(n log n) and the Fortress one is O(n). I'm writing a blog post the details the math, so I won't explain the working here.

Have you watched the Guy Steele presentation? It's fairly important as it sets the context for the original post.

There was strong proposition made that:

- Iterators and accumulators (common imperative constructs) are bad and shouldn't be used.
- A functional style using algebraic operations should be adopted such that compiler and library implementations have "wiggle room" to run the code in parallel if required.

I have 2 arguments against this approach:

1. Algorithm complexity when using Fork/Join is often a hairy business and simple operations, e.g. list concatenation, can increase complexity in unexpected ways.

2. Memory efficiency is incredibly important (perhaps even more important that being parallel) not necessarily in terms of complexity. I need to spend more time explaining this (perhaps another post).

> Approach: you use your *very poor* (in terms algorithmic complexity) parallel and sequential functional implementations

I agree with with assertion that the parallel implementation is very poor complexity-wise. It was a straight forward implementation of the algorithm presented by Guy Steele using the default collections supplied by my language implementation (the approach suggested, i.e. let the implementation do the optimisation).

I disagree about the sequential implementation, it's O(n).

> First, much better functional implementations can be devised.

I agree, I working on one the uses mutable intermediate results and significantly reduces the memory allocation cost in a couple of other ways, but the results aren't encouraging. Will post more results soon.

> Second, parallel does not imply functional and vice versa - this algorithm can use imperative join and still exploit parallelism to run faster on multicores.

True, but this was not the approach I'm arguing against. There's common argument (one I keep hearing from very smart people) that we must go parallel and use functional constructs so that language implementations can do all of the hard work.

> Conclusions: from the fact that a quadratic algorithm runs slower than a linear one, you conclude that functional implementations are inferior because of constant-factor slowdowns, such as generating too much garbage. How valid is that conclusion?

I didn't really present the conclusion well. I saw an algorithm presented as example of using functional constructs to allow a simple task to be parallelised. However, it was trivial to write a imperative single threaded implementation that was faster and way more economical. I saw 2 reasons for this, hidden complexity (actually only in the Scala implementation, not in the Fortress one) and inefficient use of memory.

I've been hearing the argument for the necessity to use functional constructs and parallelism for some time and this was the first example I've seen where the approach has been applied to a more general task. However when I compared it too an imperative approach the argument didn't stand up.

Michael Barker said...

> If the sequential version of the algorithm that uses the fold right operation (O(1) concatenation operation) could be scaled in a linear fashion across the 16 cores I had available, it would still only have ~2/3 of the performance of the imperative version.

The sequential implementation ran at 73 ops/sec, if that scaled linearly across 16 cores it would run at 1168 ops/sec. The imperative version ran at 1886 ops/sec. 1168/1886 * 100% = 62%, that's fairly close to 2/3.

> You have a string of length N, and K cores. Split the string in K pieces, use the sequential, imperative algorithm on each in parallel: this takes O(N/K). Then sequentially perform K joins at the cost of O(1) each (joins also can be parallelized but we don't need it). We get O(N/K + K), no? Which means for very large N and K we gain a lot for this algorithm compared to using O(N) sequential one.

Maybe, most algorithm performance is dominated by the numbers of cache misses that occur. So if the algorithm is able to balance the cache miss costs across the cores then it'll probably work, but that's harder than it sounds. The code and test framework is up on Github (https://github.com/mikeb01/jpr11-dojo), feel free to knock up and implementation and post the results. I have an algorithm in my head that should be able to do this reasonably efficiently for most cases, but has a horrible O(n^2) for the worst (very rare) case.