Firstly an apology. On my previous blog, I mentioned that a string splitting algorithm implemented in Scala had a complexity of $O(n^2)$. One commenter mentioned that they did not understand how I came to that calculation. I though I should revisit my guess work and actually do a more thorough analysis. What I found was interesting, I had overstated the complexity, in reality it was $O(n.log_{2}(n))$. I've included my working here.

Complexity of the Scala Implementation

In the Scala implementation the list concatenation operation is $O(n)$. I've simplified the model such that the factors applied are different, but that shouldn't have a bearing on the complexity. Fork/Join algorithms work on a divide and conquer model, in its simplest form looks much like a binary tree. To calculate the complexity of the reduction part of a Fork/Join algorithm we need to sum the the cost of all of operations to reduce the dataset to a single result. If we start with the base set of partial results, for the sake of an example assume there are 8, then the first step it to reduce them to 4. Then second step takes the 4 partial results to create 2. The third and final step takes the 2 results to create the final solution.

So if we have a dataset of size $n$ and were are using Scala's default list implementation, the cost to perform the reduction is:

$$1\frac{n}{2} + 2\frac{n}{4} + 4\frac{n}{8} + ... + \frac{2^k}{2}.\frac{n}{2^k}$$

where $k = \log_{2}(n)$. At step k, $\frac{n}{2^k}$ represents the number of operations, $\frac{2^k}{2}$ is the cost of each operation. We can eliminate $2^k$ and express the sum using sigma notation:

$$\sum_{k = 1}^{\log_{2}(n)} \frac{n}{2}$$

Applying the sum we get:

$$\frac{n}{2}.\log_{2}(n)$$

This gives a complexity of $O(n.\log_{2}(n))$. It is not nearly as bad as the $O(n^{2})$ I mentioned the previous blog. It is still worth avoiding as the benefit of applying a fixed number of multiple cores (i.e applying a constant factor to the cost) will be outweighed by the non-linear increase in cost as the dataset increases.

Complexity of the Fortress Implementation

However, the Fortress version presented by Guy Steele doesn't have $O(n)$ complexity for each of the reductions. It uses a PureList based on finger trees which has $O(\log_{2}(n))$ mutation operations.

The cost of the computation at each step breaks down differently, summing the cost of the computation looks like the following:

$$log_{2}(1)\frac{n}{2} + log_{2}(2)\frac{n}{4} + log_{2}(4)\frac{n}{8} + ... log_{2}(2^{k - 1})\frac{n}{2^k}$$

where $k = \log_{2}(n)$. At step k, $\frac{n}{2^k}$ represents the number of operations, $log_{2}(\frac{2^k}{2})$ is the cost of each operation, give the sum of:

$$T(n) = \sum_{k = 1}^{\log_{2}(n)} log_{2}(2^{k - 1})\frac{n}{2^k}$$

Fortunately this simplifies:

The pink link is the upper bound on the complexity $nS$ and the blue line is the sum of $T(n)$. So, contra to my statement in the previous post using an $O(log_{2}(n))$ operation to perform the reduction part of a Fork/Join algorithm won't introduce any complexity issues.

I learnt quite a lot in working through the complexity calculations, the most important of which, is not to jump to making a statement about the complexity of an algorithm. Often it's not a simple as you may think and it requires a little bit of rigour.

As to the original problem, I'm continuing to experiment with different parallel implementations to see how they perform on a couple of different multi-core systems. The fact remains that the simple imperative version is still much faster than the parallel implementation. I am working a couple of different approaching using mutable intermediate results.

For those who are interested, here is the math. I would like to say I did this all myself, but had a lot of help from the Internet elves at http://math.stackexchange.com/.

$T_n=nS_i(x)$ for $i=\lfloor \log_2(n)\rfloor$ and $x=\frac12$, where, for every $i$ and $x$,

Fortunately this simplifies:

$$T(n) = \sum_{k = 1}^{\log_{2}(n)} (k - 1)\frac{n}{2^k}$$

The following series $\displaystyle S = \sum_{k >= 1}^{\infty} \frac{k - 1}{2^k}$ converges to 1. So as $ n \rightarrow \infty$, $ T(n) \rightarrow nS$. There is some math (see below) to show that this translates into $O(n)$ complexity, however I find that it is easier to represent visually.The pink link is the upper bound on the complexity $nS$ and the blue line is the sum of $T(n)$. So, contra to my statement in the previous post using an $O(log_{2}(n))$ operation to perform the reduction part of a Fork/Join algorithm won't introduce any complexity issues.

I learnt quite a lot in working through the complexity calculations, the most important of which, is not to jump to making a statement about the complexity of an algorithm. Often it's not a simple as you may think and it requires a little bit of rigour.

As to the original problem, I'm continuing to experiment with different parallel implementations to see how they perform on a couple of different multi-core systems. The fact remains that the simple imperative version is still much faster than the parallel implementation. I am working a couple of different approaching using mutable intermediate results.

For those who are interested, here is the math. I would like to say I did this all myself, but had a lot of help from the Internet elves at http://math.stackexchange.com/.

$T_n=nS_i(x)$ for $i=\lfloor \log_2(n)\rfloor$ and $x=\frac12$, where, for every $i$ and $x$,

$$S_i(x)=\sum_{k=1}^i(k-1)x^k=\sum_{k=0}^{i-1}kx^{k+1}=x^2U'_i(x),\quad U_i(x)=\sum_{k=0}^{i-1}x^{k}.$$

The function $x\mapsto U_i(x)$ is the sum of a geometric series, hence, for every $x\ne1$,

$$U_i(x)=\frac{1-x^i}{1-x},\qquad U'_i(x)=\frac1{(1-x)^2}(1+ix^i-x^i-ix^{i-1}).$$

Using $x=\frac12$, this yields

$$T_n=n(1-(i+1)2^{-i}).$$

Since $i=\lfloor \log_2(n)\rfloor$, one gets

$$n-2(\log_2(n)+1)<T_n<n-\log_2(n),$$

hence $T_n=n-O(\log_2(n))$. The sequence of general term $(T_n-n)/\log_2(n)$ has the interval $[-2,-1]$ as limit set.