Comments on Bad Concurrency: Troubles with Parallelism

> If the sequential version of the algorithm th...

2011-04-05T21:35:45.480+01:00

> If the sequential version of the algorithm that uses the fold right operation (O(1) concatenation operation) could be scaled in a linear fashion across the 16 cores I had available, it would still only have ~2/3 of the performance of the imperative version.

The sequential implementation ran at 73 ops/sec, if that scaled linearly across 16 cores it would run at 1168 ops/sec. The imperative version ran at 1886 ops/sec. 1168/1886 * 100% = 62%, that's fairly close to 2/3.

> You have a string of length N, and K cores. Split the string in K pieces, use the sequential, imperative algorithm on each in parallel: this takes O(N/K). Then sequentially perform K joins at the cost of O(1) each (joins also can be parallelized but we don't need it). We get O(N/K + K), no? Which means for very large N and K we gain a lot for this algorithm compared to using O(N) sequential one.

Maybe, most algorithm performance is dominated by the numbers of cache misses that occur. So if the algorithm is able to balance the cache miss costs across the cores then it'll probably work, but that's harder than it sounds. The code and test framework is up on Github (https://github.com/mikeb01/jpr11-dojo), feel free to knock up and implementation and post the results. I have an algorithm in my head that should be able to do this reasonably efficiently for most cases, but has a horrible O(n^2) for the worst (very rare) case.

First off I made a mistake with my calculation of ...

2011-04-05T21:35:18.497+01:00

First off I made a mistake with my calculation of the complexity. The Scala implementation is O(n log n) and the Fortress one is O(n). I'm writing a blog post the details the math, so I won't explain the working here.

Have you watched the Guy Steele presentation? It's fairly important as it sets the context for the original post.

There was strong proposition made that:

- Iterators and accumulators (common imperative constructs) are bad and shouldn't be used.
- A functional style using algebraic operations should be adopted such that compiler and library implementations have "wiggle room" to run the code in parallel if required.

I have 2 arguments against this approach:

1. Algorithm complexity when using Fork/Join is often a hairy business and simple operations, e.g. list concatenation, can increase complexity in unexpected ways.

2. Memory efficiency is incredibly important (perhaps even more important that being parallel) not necessarily in terms of complexity. I need to spend more time explaining this (perhaps another post).

> Approach: you use your *very poor* (in terms algorithmic complexity) parallel and sequential functional implementations

I agree with with assertion that the parallel implementation is very poor complexity-wise. It was a straight forward implementation of the algorithm presented by Guy Steele using the default collections supplied by my language implementation (the approach suggested, i.e. let the implementation do the optimisation).

I disagree about the sequential implementation, it's O(n).

> First, much better functional implementations can be devised.

I agree, I working on one the uses mutable intermediate results and significantly reduces the memory allocation cost in a couple of other ways, but the results aren't encouraging. Will post more results soon.

> Second, parallel does not imply functional and vice versa - this algorithm can use imperative join and still exploit parallelism to run faster on multicores.

True, but this was not the approach I'm arguing against. There's common argument (one I keep hearing from very smart people) that we must go parallel and use functional constructs so that language implementations can do all of the hard work.

> Conclusions: from the fact that a quadratic algorithm runs slower than a linear one, you conclude that functional implementations are inferior because of constant-factor slowdowns, such as generating too much garbage. How valid is that conclusion?

I didn't really present the conclusion well. I saw an algorithm presented as example of using functional constructs to allow a simple task to be parallelised. However, it was trivial to write a imperative single threaded implementation that was faster and way more economical. I saw 2 reasons for this, hidden complexity (actually only in the Scala implementation, not in the Fortress one) and inefficient use of memory.

I've been hearing the argument for the necessity to use functional constructs and parallelism for some time and this was the first example I've seen where the approach has been applied to a more general task. However when I compared it too an imperative approach the argument didn't stand up.

@Mike, I strongly disagree with your approach and ...

2011-04-05T16:08:44.685+01:00

@Mike, I strongly disagree with your approach and conclusions.

Approach: you use your *very poor* (in terms algorithmic complexity) parallel and sequential functional implementations to benchmark with a good O(n) sequential imperative implementation. First, much better functional implementations can be devised. Second, parallel does not imply functional and vice versa - this algorithm can use imperative join and still exploit parallelism to run faster on multicores.

Conclusions: from the fact that a quadratic algorithm runs slower than a linear one, you conclude that functional implementations are inferior because of constant-factor slowdowns, such as generating too much garbage. How valid is that conclusion?

And how do you justify this:

If the sequential version of the algorithm that uses the fold right operation (O(1) concatenation operation) could be scaled in a linear fashion across the 16 cores I had available, it would still only have ~2/3 of the performance of the imperative version

You have a string of length N, and K cores. Split the string in K pieces, use the sequential, imperative algorithm on each in parallel: this takes O(N/K). Then sequentially perform K joins at the cost of O(1) each (joins also can be parallelized but we don't need it). We get O(N/K + K), no? Which means for very large N and K we gain a lot for this algorithm compared to using O(N) sequential one.

Correction: Sorry for my complexity explanation b...

2011-04-05T15:52:12.924+01:00

Correction:

Sorry for my complexity explanation being confusing and
incorrect. To clarify, we are considering a simple list append
operation similar to the following:

append [] b = b
append (x:xs) b = x : append xs b

The cost C of appending two lists a and b is proportional to the
length L of the first list, roughly:

C(a ++ b) = L(a)

Now consider:

C(a ++ b) = L(a)
C(a ++ b ++ c) = C(a ++ b) + L(a ++ b)
C(a ++ b ++ c ++ d) = C(a ++ b ++ c) + L(a ++ b ++ c)

Then

C(a ++ b ++ c ++ d)
= L(a) + L(a ++ b) + L(a ++ b ++ c)
= 3L(a) + 2L(b) + L(c)

Suppose that we have a list of length N partitioned into
one-element lists that are joined with append. Then

C = (n-1) + (n-2) + ... + 1 = n * (n - 1) / 2 ~ O(n^2)

On list concatenation: let us consider concatenati...

2011-04-05T15:41:35.425+01:00

On list concatenation: let us consider concatenating 4 lists.

a ++ b ++ c ++ d

a ++ b takes L(a) + L(b)
(a ++ b) ++ c takes L(a) + L(b) + L(c)
(a ++ b ++ c) ++ d takes L(a) + L(b) + L(c) + L(d)

Then the operation takes 3L(a) + 3L(b) + 2L(c) + L(d). We could have one-element lists. The worst-case bound is then O(n^2) where n is the sum of the lengths of all lists.

That being said about regular lists, I am now reading Okasaki's _Purely Functional Data Structures_. It describes a CatenableList collection that supports head, tail, cons and append in O(1) amortized time.

Moreover, the join part of the algorithm in the article does not require persistence. You can easily devise and use a mutable implementation of catenable lists with O(1) worst-case bounds for all operations.

Yeah, I didn't really explain my reasoning par...

2011-03-28T16:21:35.227+01:00

Yeah, I didn't really explain my reasoning particularly well. I'm in the process of drafting another blog that goes into more detail on how the complexity is calculated. Hopefully that will better explain it.

Sorry for the stupid question but.. why the scala ...

2011-03-28T09:34:50.173+01:00

Sorry for the stupid question but.. why the scala "concat" operator will increase complexity from N to N^2 in parallel processor?

I really don't get it

> Are Fortress lists persistent or mutable? If ...

2011-03-14T17:50:58.024+00:00

> Are Fortress lists persistent or mutable? If persistent, how do they work exactly?

The "PureList" implementation in Fortress used in the example is immutable and uses Finger Trees implementation. All mutation operations are O(log n), which will make the overall algorithm O(n log n).

> Is not the problem simply that modern programmers rush to use data structures before bothering to check their complexity guarantees?

I think the problem is worse than that. I feel that algorithmic complexity is being forgotten altogether, or sacrificed in the name of parallelism.

Another problem is the lack of appreciation for other costs, such as memory allocation, garbage etc. If the sequential version of the algorithm that uses the fold right operation (O(1) concatenation operation) could be scaled in a linear fashion across the 16 cores I had available, it would still only have ~2/3 of the performance of the imperative version. That seems to be awfully high price to pay for a multi-threaded solution.

The final issue I have is that the argument being presented is that the functional/persistent collection/implicit multi-threaded approach is the only approach to take and imperative techniques are fundamentally broken. To quote the talk: "DO loops are so 1950s!", "JavaTM -style iterators are so last millennium!". However very few people are publishing real results from the application of these techniques thereby demonstrating that the approach is better.

While I think that there is some value in the approach, probably for a good number of problems, I believe (like Fred Brookes) there is no silver bullet. My argument is that developers should treat all the possible approaches, be it imperative, functional, whatever, as tools that work well for certain jobs and use empirical evidence when deciding which one to apply to a given problem.

Are Fortress lists persistent or mutable? If persi...

2011-03-13T22:58:31.965+00:00

Are Fortress lists persistent or mutable? If persistent, how do they work exactly?

There seem to be some folklore on persistent catenable (in better-than-linear-time) lists, for example:

http://www.eecs.usma.edu/webs/people/okasaki/focs95/index.html

I do not quite see how your conclusion follows. Is not the problem simply that modern programmers rush to use data structures before bothering to check their complexity guarantees?

And of course functional data structures are not automatically "better" than imperative ones - you get persistence, but you pay in complexity. For applications that do not need persistence it is a net loss.

The cost is not usually *linear* complexity - most of the time the trade-off is between constant time for an imperative DS and logarithmic time for its functional counterpart.