Saturday, 31 December 2011

Concurrent Sequencing

A few weeks ago one of the users of the Disruptor posted some worrying benchmarks:

ThreePublisherToOneProcessorSequencedThroughputTest
  run 0: BlockingQueue=645,161 Disruptor=1,772 ops/sec
  run 1: BlockingQueue=1,250,000 Disruptor=20,000,000 ops/sec
  run 2: BlockingQueue=1,250,000 Disruptor=56 ops/sec

It appears under heavy contention with fewer available cores than busy threads the Disruptor can perform terribly. After a bit of investigation I managed to isolate the problem. One of the most complex parts of the Disruptor is the multi-threaded claim strategy. It is the only place in the Disruptor where - out of necessity - we break the single-writer principal.

The approach that we used was very simple. Each thread claims a slot in the ring buffer using AtomicLong.incrementAndGet(). This ensures that each claim will return a unique sequential value. The complexity arrives when the multiple threads try to publish their sequence. We require that all events placed in the ring buffer must be made available to event processors in a strictly sequential order. To ensure this behaviour we have a method called serialisePublishing(). Our simple implementation would have the thread that is publishing busy spin until the last published sequence (the cursor) is one less the value being published.

This works because each sequence published is unique and strictly ascending. For example if one thread wants to publish the value 8, it will spin until the cursor value reaches 7. Because no other thread will be trying publish value 8 it can make progress and ensuring the sequential publishing behaviour in the process. However, this busy spin loop causes problems when there are more threads than cores. The threads that need wait for the prior sequences to be published can starve out the thread that should be updating the cursor. This leads to the unpredictable results shown above.

We need a better solution. In an ideal world there would be a Java API that would compile down to the Intel MONITOR/MWAIT instructions, unfortunately they're limited to Ring 0, so require a little kernel assistance to be useful. Another instruction (but unavailable in Java) would be the Intel PAUSE instruction that could used in the middle of the spin loop. One of the problems with busy loops on modern processors is in order keep the pipeline full the CPU may speculatively execute the condition at the top of the loop, causing an unnecessarily high number of instructions to fill the CPU pipeline. This can starve other logical threads of CPU resources. The PAUSE instruction on hyper-threaded Intel processors can improve this situation.

Java has neither of those, so we need to go back and address the shape of the serialisePublishing method. For the redesign I drew some inspiration from Cliff Click's non-blocking hash map. There 2 aspects of his design that are very interesting:

  • No locks
  • No CAS retries

While the first is obvious, the second is trickier. Any one familiar CAS operations will have seen the traditional loop until success approach to handling concurrent updates. For example the incrementAndGet method inside the AtomicLong in Oracle's JVM uses a similar loop. It could look something like1:

While there are no locks here it is not necessarily wait-free. It is theoretically possible, if a number of threads are trying to increment the value, for one (or more) of the threads to get stuck unable to make progress if other threads are constantly winning the race to the atomic operation. For an algorithm to be wait free all calls must complete within a fixed number of operations. One way to get closer to a true wait free algorithm is to design the algorithm such that a failure of a CAS operation is a signal to exit rather than to retry the operation. Cliff Click's approach was to model the algorithm using a state machine, where all states are valid and a transition between states is typically a CAS operation. E.g. image a state machine with 3 states {A, B, C} and 2 transitions {A->B, B->C}. If a instance of the state machine is in state A and 2 threads try to apply the transition A->B only one will succeed. For the thread that fails to apply its CAS operation retrying the operation makes no sense. The instance has already transitioned to state B. In fact the failure of CAS operation is an indication that the instance is already in the desired state. The thread can exit if B is the desired state of the action or try to apply the B->C transition if that is what's required.

How does this apply to our concurrent sequencing problem? We could allow threads to continue to make progress while waiting for other threads to catch by maintaining a list of sequences that are pending publication. If a thread tries to publish a sequence that is greater than 1 higher than current cursor (i.e. it would need to wait for another thread to publish its sequence) it could place that sequence into the pending list and return. The thread that is currently running behind would publish its own sequence, then check the pending list and publish those sequences before exiting.

To represent this as a state machine we would have 3 states {unpublished, pending, published} and 2 transitions {unpublished->pending, pending->published}. In recognition of the fact that computing resources are finite, we have a guard condition on the unpublished->pending transition. I.e. a limit on number of sequences we allow in the pending state. Because each sequence is unique, the transition unpublished->pending does not require a CAS operation. The pending list is represented as an AtomicLongArray and the transition is a simple AtomicLongArray.set() where the index is the sequence modulo the size of the pending list2. The final transition pending->published is where the CAS operation comes in. The thread will first try to publish its own sequence number. If that passes then the thread will try to publish the next value from the pending list. If the CAS fails the thread leaves the method. The failure means that the value is already published or will be by some other thread.

Running the multi-publisher performance test on my 2-Core laptop (where at least 4 threads would normally be required):

ThreePublisherToOneProcessorSequencedThroughputTest
  run 0: BlockingQueue=5,832,264 Disruptor=6,399,590 ops/sec
  run 1: BlockingQueue=5,521,506 Disruptor=6,470,816 ops/sec
  run 2: BlockingQueue=5,373,743 Disruptor=6,931,928 ops/sec

Sanity restored.

This update will be included in the 2.8 release of the Disruptor as the default implementation of the MultiThreadedClaimStrategy. The old implementation will still be available as MultiThreadedLowContentionClaimStrategy. Those who have plenty of cores where the publishers aren't often contented may find the old implementation faster, which it should be as it is simpler and requires fewer memory barriers. I'm going to continue to revise and work on this code. While improved, it is not truly wait free. It is possible for one of the threads to get stuck doing all of the publishing.

1 The AtomicLong.getAndIncrement() does use a slightly different loop structure by the semantics are the same.
2 Actually it's a mask of the sequence and the pending list size minus 1. This is equivalent when the size of the pending list is a power of 2.

Friday, 23 December 2011

Blog Rename and Video Links

I've decided to rename my blog.  I plan to focus my blog efforts more on concurrency than any other particular subject and I thought it would be fun to pay homage to the bad science blog by Ben Goldacre.

It will probably kill my traffic for a while, until I get the few blog aggregators that carry my blog to update their feeds.  So welcome to my rebranded blog, I'll start posting again in the next few weeks.  In the mean time here are some links to videos from the various conferences that I spoke at over the past month or 2.


  • LJC @Playfish - Beginner's Guide to Concurrency (the trial run)
  • JAX London - Beginner's Guide to Concurrency
  • Devoxx - A tools in action session on the Disruptor (this was a experiment that didn't work very well, an attempt at live coding)

Saturday, 3 September 2011

Upcoming Speaking Events

I've had a few speaking events confirmed for the end of this year:




Trisha and I are hoping to preview our talk to the LJC sometime in October.

Thursday, 18 August 2011

ALL of the hotspot flags

Ever wanted to know what flags could be set and their default values for the Hotspot VM?  Try this:

java -XX:+PrintFlagsFinal


Wednesday, 20 July 2011

Now Offically an OpenJDK Contributor

Being someone who's interested in Open Source and very focused on performance, I've been looking at the Java & the JVM and wondering about what I would most like to see added.  IMHO one of the fundamental missing pieces is support for complex stack allocated types.  This is not a new idea.  C# calls them structs. In C, C++ and Go stack & heap allocation is orthogonal to the definition of the type.

An implementation for Java is described by John Rose (some time ago) on his blog.  While this is presented as a feature that would benefit alternative languages on the JVM, I think this could bring something significant to Java as well (not that I actually know what such a language feature would look like, I'm not a language designer).  It would improve the efficiency of a some very useful design patterns that tend to be avoided (by me anyway) due to the cost of heap allocation.  E.g. a number of classes in LMAX's code base consist of aggregations of 64 bit integers.

public class OrderInstruction {
    public OrderInstruction(long accountId,
                            long instrumentId, 
                            long clientOrderId,
                            long stopPrice,
                            long price,
                            long quantity) {
        // ...
    }
}

While very efficient, it can lead to a number of errors as it is very easy to screw up an assignment wouldn't prevent assigning the accountId to the price.  However, with support for tuples/value types/structs then it would be possible to do the following (I've borrowed from the C# syntax):

public struct AccountId {
    public AccountId(long id) {
        // ...
    }
}

public struct InstrumentId {
    public InstrumentId(long id) {
        // ...
    }
}

public class OrderInstruction {
    public OrderInstruction(AccountId accountId,
                            InstrumentId instrumentId, 
                            ClientOrderId clientOrderId,
                            Price stopPrice,
                            Price price,
                            Quantity quantity) {
        // ...
    }
}

Given any reasonable implementation the structs Account, InstrumentId, etc, would compile down to a long (and be no less efficient).  It would enforce type safety preventing accidental incorrect assignment.  Another benefit is that is allows seamless extension of the types, with far less impact.  E.g. if we need to extend the AccountId to include a realm field, no change to the OrderInstruction would be required.

public struct AccountId {
    public AccountId(int realm, long id) {
        // ...
    }
}

This pattern is known as Tiny Types.  Sometimes it's the small things that matter.

The other nice feature would support for more complex and suitably efficient numeric types, e.g:

public struct Rational {
    public Rational(long numerator, long denominator) {
        // ....
    }
}

// This would a favourite at LMAX
public struct Decimal {
    public Decimal(long value, int exponent) {
        // ...
    }
}

We could even steal some features from Google Go:

public struct Slice {
    public Slice(byte[] array, int offset, int length) {
        // ....
    }
}

Pairs, Tuples and Option types could also be implemented efficiently too.  Instances of these types would be passed and assigned by copy.  The memory for these types would be either allocated on the stack or inline within a object when laid out on the heap. This could have (I hope) a significant improvement in spatial cache locality and reduce the pressure on the garbage collector.

I think this feature is so important that I have started implementing it myself as part of the MLVM project.  My first patches have now been accepted into MLVM.  I'm not sure if or when this will end up in the JDK proper or if the Java language will support it, but it has to start somewhere.

Wednesday, 22 June 2011

Disruptor, Now Open Source!

The Disruptor, the design pattern at the core of the financial exchange I'm helping build at LMAX has been open sourced.  If you've seen the presentation that Martin Thompson and I gave at QCon or the one I did for the LJC @Skillsmatter, this is the ring-buffer based code that we've been banging on about for a while.

What is the Disruptor?

At its simplest, it's an alternative to a queue, but can also can be thought of as a simple actor-like concurrency framework.

Why?

Performance, it's about 8-9 times higher throughput and 3 orders of magnitude lower latency than Java's ArrayBlockingQueue.  There's a technical article on the Google code site with more details of the implementation and comprehensive performance test results.

Tuesday, 31 May 2011

Another LJC Lightning Talk

I'm doing a lightning talk (5 minutes) for the LJC tomorrow evening.  I'm going to give a quick run down of JSR 334/Project Coin as part of our involvement in the JCP.

Tuesday, 10 May 2011

Yay for the LJC!

Congratulations to the London Java Community (disclaimer: I'm an associate member); they have just been elected to the JCP SE/EE Executive Committee.  This is great opportunity for the community to be more involved with the JCP.

Saturday, 9 April 2011

More Complexity and Fork/Join

Firstly an apology.  On my previous blog, I mentioned that a string splitting algorithm implemented in Scala had a complexity of $O(n^2)$.  One commenter mentioned that they did not understand how I came to that calculation.  I though I should revisit my guess work and actually do a more thorough analysis.  What I found was interesting, I had overstated the complexity, in reality it was $O(n.log_{2}(n))$.  I've included my working here.

Complexity of the Scala Implementation

In the Scala implementation the list concatenation operation is $O(n)$.  I've simplified the model such that the factors applied are different, but that shouldn't have a bearing on the complexity.  Fork/Join algorithms work on a divide and conquer model, in its simplest form looks much like a binary tree.  To calculate the complexity of the reduction part of a Fork/Join algorithm we need to sum the the cost of all of operations to reduce the dataset to a single result.  If we start with the base set of partial results, for the sake of an example assume there are 8, then the first step it to reduce them to 4.  Then second step takes the 4 partial results to create 2.  The third and final step takes the 2 results to create the final solution.


So if we have a dataset of size $n$ and were are using Scala's default list implementation, the cost to perform the reduction is:
$$1\frac{n}{2} + 2\frac{n}{4} + 4\frac{n}{8} + ... + \frac{2^k}{2}.\frac{n}{2^k}$$
where $k = \log_{2}(n)$.  At step k, $\frac{n}{2^k}$ represents the number of operations, $\frac{2^k}{2}$ is the cost of each operation. We can eliminate $2^k$ and express the sum using sigma notation:
$$\sum_{k = 1}^{\log_{2}(n)} \frac{n}{2}$$
Applying the sum we get:
$$\frac{n}{2}.\log_{2}(n)$$
This gives a complexity of $O(n.\log_{2}(n))$.  It is not nearly as bad as the $O(n^{2})$ I mentioned the previous blog.  It is still worth avoiding as the benefit of applying a fixed number of multiple cores (i.e applying a constant factor to the cost) will be outweighed by the non-linear increase in cost as the dataset increases. 

Complexity of the Fortress Implementation

However, the Fortress version presented by Guy Steele doesn't have $O(n)$ complexity for each of the reductions.  It uses a PureList based on finger trees which has $O(\log_{2}(n))$ mutation operations.


The cost of the computation at each step breaks down differently, summing the cost of the computation looks like the following:
$$log_{2}(1)\frac{n}{2} + log_{2}(2)\frac{n}{4} + log_{2}(4)\frac{n}{8} + ... log_{2}(2^{k - 1})\frac{n}{2^k}$$
where $k = \log_{2}(n)$.  At step k, $\frac{n}{2^k}$ represents the number of operations, $log_{2}(\frac{2^k}{2})$ is the cost of each operation, give the sum of:
$$T(n) = \sum_{k = 1}^{\log_{2}(n)} log_{2}(2^{k - 1})\frac{n}{2^k}$$
Fortunately this simplifies:
$$T(n) = \sum_{k = 1}^{\log_{2}(n)} (k - 1)\frac{n}{2^k}$$
The following series $\displaystyle S = \sum_{k >= 1}^{\infty} \frac{k - 1}{2^k}$ converges to 1.  So as $ n \rightarrow \infty$, $ T(n) \rightarrow nS$.  There is some math (see below) to show that this translates into $O(n)$ complexity, however I find that it is easier to represent visually.


The pink link is the upper bound on the complexity $nS$ and the blue line is the sum of $T(n)$.  So, contra to my statement in the previous post using an $O(log_{2}(n))$ operation to perform the reduction part of a Fork/Join algorithm won't introduce any complexity issues.

I learnt quite a lot in working through the complexity calculations, the most important of which, is not to jump to making a statement about the complexity of an algorithm.  Often it's not a simple as you may think and it requires a little bit of rigour.

As to the original problem, I'm continuing to experiment with different parallel implementations to see how they perform on a couple of different multi-core systems.  The fact remains that the simple imperative version is still much faster than the parallel implementation.  I am working a couple of different approaching using mutable intermediate results.

For those who are interested, here is the math.  I would like to say I did this all myself, but had a lot of help from the Internet elves at http://math.stackexchange.com/.


$T_n=nS_i(x)$ for $i=\lfloor \log_2(n)\rfloor$ and $x=\frac12$, where, for every $i$ and $x$,

$$S_i(x)=\sum_{k=1}^i(k-1)x^k=\sum_{k=0}^{i-1}kx^{k+1}=x^2U'_i(x),\quad U_i(x)=\sum_{k=0}^{i-1}x^{k}.$$
The function $x\mapsto U_i(x)$ is the sum of a geometric series, hence, for every $x\ne1$,
$$U_i(x)=\frac{1-x^i}{1-x},\qquad U'_i(x)=\frac1{(1-x)^2}(1+ix^i-x^i-ix^{i-1}).$$
Using $x=\frac12$, this yields
$$T_n=n(1-(i+1)2^{-i}).$$
Since $i=\lfloor \log_2(n)\rfloor$, one gets
$$n-2(\log_2(n)+1)<T_n<n-\log_2(n),$$
hence $T_n=n-O(\log_2(n))$. The sequence of general term $(T_n-n)/\log_2(n)$ has the interval $[-2,-1]$ as limit set.




Saturday, 12 March 2011

Troubles with Parallelism

A couple of weeks ago I attended the Java Posse Round-up, during which, I spent a couple of hours with Joel Neely working with the new parallel collections in Scala.  As a task we looked at the parallel string splitting that was defined by Guy Steele during his talk at the Strange Loop conference.  The goal was to demonstrate how algebraic properties of an algorithm can be used to partition and combine a data set so that it can be parallelised.

The algorithm uses a divide and conquer approach where the combination of the partial results is associative, i.e. the grouping of the operations to recombine the results doesn't matter as long as the overall order is maintained (there are more technical descriptions).  If we take the following string as an example:

"Here is a sesquipedalian string of words"

And it is partitioned arbitrarily resulting in:

"Here is a ", "sesquipeda", "lian strin", "g of words"

We can represent the individual parts as objects that look like the following:

Segment("Here", List("is", "a"), "")
Chunk("sesquipeda")
Segment("lian", List(), "strin")
Segment("g", List("of"), "words")

A segment consists of a leading and trailing part of a word and a list of all of the words that are enclosed by spaces within the partition.  A chunk is a piece of text that contains no white space separators.  These individual parts can be recombined using a polymorphic function, lets call it "assoc".  The function is defined as follows:

case class Segment(prefix: String,
                   words: List[String],
                   trailer: String) extend WordState {
  override def assoc(other: WordState) = {
    other match {
      case c:Chunk => 
        Segment(prefix, words, trailer + c.part)
      case s:Segment => 
        Segment(prefix, 
                words ::: maybeWord(trailer + s.prefix) ::: s.words, 
                s.trailer)
        }
    }
}

And for the Chunk:

case class Chunk(part: String) extends WordState {
  override def assoc(other: WordState) = {
    other match {
      case c:Chunk => Chunk(part + c.part)
      case s:Segment => Segment(part + s.prefix, s.words, s.trailer)
    }
  }
}

Where "maybeWord" is defined as:

def maybeWord(s:String) = if (s.isEmpty) List.empty[String] else List(s)

One of the nice features of this algorithm is that it is now possible to process the input sequentially using a fold right operation:

val wordStates = s.map(processChar).toArray
wordStates.foldRight(Chunk.empty)((x, y) => x.assoc(y)).toList()

Or in parallel using Scala 2.9's parallel arrays:

val wordStates = s.toCharArray.par.map(processChar)
wordStates.aggregate(Chunk.empty)(assoc, assoc).toList()

The ".par" method on array constructs a ParallelArray from the input array.  From that point on the functions applied to the collection are executed in parallel using the Fork/Join framework.  Once we had this all in place, we wanted to benchmark it.  I created a very simple imperative implementation to serve as a baseline and a simple performance test harness.  I ran a number of iterations using a copy of "Alice in Wonderland" as an input data set.  I had access to a Core 2 Duo (2 Cores) laptop and a Nehalem EX (8 Cores + HT) workstation on which I ran the benchmarks.  The results were very surprising:

Core 2 Duo laptop, 2 Cores @ 2.40 GHz
==
Parallel - opsPerSecond: 6.89
Sequential - opsPerSecond: 49.81
Brute Force - opsPerSecond: 1388.89

Intel Xeon Workstation, 8 Cores +HT @ 2.40 GHz
==
Parallel - opsPerSecond: 120.48
Sequential - opsPerSecond: 73.26
Brute Force - opsPerSecond: 1886.79
On the 2 core laptop the parallel version runs quite a bit slower, with 16 cores thrown at it, it just scrapes up to twice as fast as the sequential version.  However in both cases it was absolutely dwarfed by the performance of the simple imperative version.  Once I started digging using the profiler the reasons for this became obvious.  The problem is Scala's immutable lists and the concatenation operator ':::'.  In Scala the list on the left hand is copied, this is O(n) operation depending on the size of the list.  So when the algorithm is parallelised the complexity increases from O(n) for the sequential version to O(n^2) for the parallel one.  It's worth noting that in Fortress (which the initial algorithm was presented in) all list mutation operations are O(log n).

Sigh, it's back to Com. Sci. 101.  An O(n^2) algorithm run on multiple cores is still O(n^2).  The number of cores just serves a factor, so doesn't alter the complexity.  My concerns with the current trend in software that is obsessed with parallelism and multi-core, is that the basics are being forgotten,  a scientific/experimental approach to building faster software is not being applied and there is no demonstration of mechanical sympathy.  The reason the simple imperative version is so much quicker than the sequential function version (both are O(n)) is that it is far more memory friendly i.e. no allocation/garbage reducing the amount data that needs to be pushed back and forth through the various CPU cache levels (and an old school decrement and compare trick).  The functional version generates quite a bit of garbage and losses on memory efficiency.




Tuesday, 1 March 2011

Speaking at LJC about concurrency

I'm speaking at the next LJC event on concurrency (http://www.meetup.com/Londonjavacommunity/events/16701804/).  I'll be doing the same talk that I gave with Martin Thompson at QCon.  While it is the same material, the talk tends to take a different turn each time I give it.

Saturday, 5 February 2011

A Non-Blocking ConcurrentHash(Map|Trie)

Introducing ConcurrentHashTrie

While spending some time looking at Clojure's concurrency model, I did a little bit of research into their persistent collection implementation that uses Bagwell's Hash Array Mapped Tries.  After a little bit of thought it occurred to me that you could apply the persistent collection model to an implementation of ConcurrentMap.  The simple solution would be to maintain a reference to a persistent map within a single AtomicReference and apply a compareAndSet (CAS) operation each time an update to the map was performed.  In Clojure terms, a ref and swap! operation.  However a typical use of ConcurrentMap is for a simple cache in a web application scenario.  With most web applications running with fairly significant numbers of threads (e.g. 100's) it's conceivable that the level of contention on a single compare and set operation would cause issues.  Non-blocking concurrent operations like compare and set work really well at avoiding expensive user/kernel space transitions, but tend to break down when the number of mutator threads exceeds the number of available cores*.  In fact thread contention on any concurrent access control structure is a killer for performance.  The java.util.ConcurrentHashMap uses lock-striping as a mechanism to reduce contention on writes.  If absolutely consistency across the ConcurrentMap was sacrificed (more information below), then a form of "CAS striping" could be applied to the ConcurrentHashTrie to reduce contention.  This is implemented by replacing the AtomicReference with an AtomicReferenceArray and using a hash function to index into the array.

The Implementation

There are a number of implementations of the Bagwell Trie in JVM languages, however I couldn't find an implementation in plain old Java, so I wrote one myself.  For the persistent collection it follows a fairly standard polymorphic tree design.

The core type of Node has 3 implementations.  The key one is the BitMappedNode, which handles the vast majority of branch nodes and implements the funky hash code partition and population count operations that make up the Bagwell Trie structure.  LeafNode holds the actual key/value pairs and ListNode is there to handle full collisions.  The key mutator methods of ConcurrentMap: put, putIfAbsent, remove and replace are implemented using CAS operations on the rootTable, which is an AtomicReferenceArray.

As mentioned earlier, the ConcurrentHashTrie is not consistent for operations that occur across the entire structure.  While at first look this seems terrible, in practical terms it is not really an issue.  The 2 main operations this impacts are size and iteration.  The size operation in the ConcurrentHashTrie is not O(1), but is O(n) where n is the size of the rootTable (or the number of stripes).  Because it has to iterate across all of the nodes in the rootTable (it doesn't need to traverse down the trees) and add all of their sizes it possible for the size of a Node to change after it has been read but before the result has been calculated.  Anyone that has worked a concurrent structure before has probably found that the size value is basically useless.  It is only ever indicative, as the size could change right after the method call returns, so locking the entire structure while calculating the size doesn't bring any benefit.  Iteration follows the same pattern.  It is possible that a node could have changed after being iterated over or just before being reached.  However, it doesn't really matter as it is not really any different to the map changing just before iteration started or just after it completes (as long as iteration doesn't break part way through).  Note that Cliff Click's NonBlockingHashMap exhibits similar behaviour during iteration and size operations.

Performance, The Good, The Bad and the... well mostly Bad

Cliff Click kindly included a performance testing tool with his high scale library which I've shamelessly ripped off and used to benchmark my implementation.  Apparently he borrowed some code from Doug Lea to implement it.  I changed a sum total of 1 line.  Writing benchmarks, especially for concurrent code, is very tough (probably harder than writing the collection itself), so borrowing from the experts gives me some confidence that the numbers I produce will be useful.

So onto the results:




Do'h!!

Quite a bit slower than the java.util.ConcurrentHashMap.  I didn't even bother comparing to the NonBlockingHashMap from the high scale library, the numbers would be too embarrassing.

The ConcurrentHashTrie2 is an alternative implementation that I experimented with.  I suspected the polymorphic behaviour of the tree nodes was causing a significant performance hit due to a high number of v-table calls.  The alternate implementation avoids the v-table overhead by packing all three behaviours into a single class (in a slight dirty fashion).  I stole a couple of bits from the level variable to store a discriminator value.  The Node class switches on the discriminator to determine the appropriate behaviour.  As the results show, it didn't help much.  I ran a number of other permutations and the results were largely the same.

Conclusion

So a question remains, is the approach fundamentally flawed or is my code just crap?  I suspect the major cost is caused by heavy amount of memory allocation, copying and churn through the CPU cache caused by the path-copy nature of mutation operations.  Also the read path isn't particularly quick.  With a reasonably full map, reads will likely need to travels a couple of levels down the tree.  With the traditional map, its a single hash and index operation.

* I really need a proper citation for this.  However imagine 16 cores and a 100 threads trying to apply a compare and set operation to the same reference.  Unlike a lock which will park the threads that fail to acquire the lock, non-blocking algorithms require the thread that fails to apply its change to discard its result and recompute its result, effectively spinning until is succeeds.  With more CPU bound threads than cores, its possible that the system will end up thrashing.