Bad Concurrency

Tuesday, 14 February 2012

Speaking at QCon London

Along with Martin Thompson, I'm speaking about non-blocking concurrency at QCon London in March.

Friday, 10 February 2012

Slides From Recent Presentations

My slides from JAX London - Beginner's Guide to Hardcore Concurrency:

Beginners guide-concurrency

View more presentations from Michael Barker

Video: LJC@Playfish, JAX London

My slides from Devoxx - Disruptor Tools In Action:

Disruptor tools in action

View more presentations from Michael Barker

Video: Devoxx (Payment required)

Sunday, 8 January 2012

Building a CPU Topology on MacOS X

Within a project I've been working on I've had the need to simulate the capabilities of Linux's /proc/cpuinfo on Mac OS. Specifically I needed to build a topology of the CPUs on a given system. I.e. I need to map the operating system's processors to hardware threads, then build a picture of which cores and sockets those threads reside. For example my Mac looks something like:

CPU0 (Thread 0) ---+
                   |---> Core 0 ---+
CPU1 (Thread 1) ---+               |
                                   | ----> Socket 0
CPU2 (Thread 2) ---+               |
                   |---> Core 1 ---+
CPU3 (Thread 3) ---+

While this sounds very simple, it's actually fraught with a number of little niggles. Not only did it require getting down and dirty with a bit of C++ and X86 Assembly, it also required writing a MacOS kernel extension.

The first step was to understand what information was available from the CPU. Intel exposes an instruction called CPUID. The is the primary mechanism for getting information about the CPU. There is a raft of information available from listing of the CPU features available (e.g. hyperthreading) to sizes of the various levels of cache and the associated cache lines. To access the CPUID instruction we need a little bit of inline assembler.

The code shows how to get the vendor string from the CPU. On my Mac I get the following:

// Output:
Vendor String: GenuineIntel

For those unfamiliar with Intel inline assembly, the Intel CPU defines a number of registers. The ones used for the CPUID instruction are EAX, EBX, ECX, and EDX (referenced as RAX, RBX, etc if using 64 bit instructions via the REX extension). These used for both input and output. An inline asm segment consists of 3 parts. The first part is the instruction to be executed. In this case the "cpuid" instruction. The second line defines the output parameters. The snippet "=a" (data[0]) means store the result in the EAX register in the variable data[0]. The "=a" refers to the 2nd letter of the register designation. The 3rd and final section are the input parameters. The CPUID instruction takes 2 parameters, one in EAX and one in ECX.

The particular CPUID reference that provides information needed to building the topology is 0xB (11) - the extended topology enumeration leaf. The data returned from this instruction is:

     0                   1                   2                   3   
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
EAX | Shift |                     Reserved                          |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
EBX |   No. Process at this level   |            Reserved           |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
ECX |   Level No.   |  Level type   |            Reserved           |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
EDX |                           x2APIC ID                           |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The extended topology enumeration leaf is one of the CPUID indexes that also makes use of ECX as an input parameter. This indicates the level of the CPU that you wish to work at. I started at level 0 and worked my way up. The 'Level type' describes whether the current level is a CPU thread (1) or a physical core (2) or invalid (0). All values greater than 2 are reserved, presumably for later use. The 2 other useful values are the x2APIC ID and Shift. The x2APIC ID is a unique identifier for the smallest logic unit in the CPU. The Shift value is used to group x2APIC ID values into units at the next logical level. This is done by shifting the x2APIC ID right by the value specified by Shift. For example on my using the following code on my workstation (2 sockets, 8 cores, 16 threads):

Outputs the following:

Shift: 1, Count: 2, Level Id: 0, Level Type: 1, x2APIC: 33, Group: 16
Shift: 5, Count: 8, Level Id: 1, Level Type: 2, x2APIC: 33, Group: 1

This indicates that the hardware thread indicated by APIC 33 has the core id of 16 and socket id of 1. The socket and core ids aren't really references, but values that indicate threads have have the same id share that unit. This gets me most of the way there, I can group all of my threads into a hierarchy of cores and sockets. However there is one small niggle remaining. I need this code to run all of the CPUs on my system. On Linux this is reasonably simple, you count the number of CPUs then iterate through all of them and use the pthread_setaffinity_np(...) to specify specify which CPU to run the CPUID instructions on.

However, on Mac OS X actual thread affinity is not supported; just a mechanism logically grouping or separating threads, powerful in its own right, but not what I'm after here. This is where the Kernel module comes in. The XNU Kernel defines a funky little method called mp_rendevous(...). This method takes in a number function pointers. The key one (action_func), is run once on each of the CPUs. So to get the topology information for all of the CPUs we can use it like so:

Because the method mp_rendevous() is defined in the kernel the code above needs to be packaged as a kernel extension. Even then, getting access to the method is interesting. It's not defined in a header that can be easily included, however it is available at link time when compiling a kernel module. Therefore in order compile correctly, it's necessary to provide your own forward declaration of the method. The same is true of the cpu_number(). Calling into the kernel method from user space requires use of the IOKit framework, but I'll leave the details of that as an exercise for the reader.

Saturday, 31 December 2011

Concurrent Sequencing

A few weeks ago one of the users of the Disruptor posted some worrying benchmarks:

ThreePublisherToOneProcessorSequencedThroughputTest
  run 0: BlockingQueue=645,161 Disruptor=1,772 ops/sec
  run 1: BlockingQueue=1,250,000 Disruptor=20,000,000 ops/sec
  run 2: BlockingQueue=1,250,000 Disruptor=56 ops/sec

It appears under heavy contention with fewer available cores than busy threads the Disruptor can perform terribly. After a bit of investigation I managed to isolate the problem. One of the most complex parts of the Disruptor is the multi-threaded claim strategy. It is the only place in the Disruptor where - out of necessity - we break the single-writer principal.

The approach that we used was very simple. Each thread claims a slot in the ring buffer using AtomicLong.incrementAndGet(). This ensures that each claim will return a unique sequential value. The complexity arrives when the multiple threads try to publish their sequence. We require that all events placed in the ring buffer must be made available to event processors in a strictly sequential order. To ensure this behaviour we have a method called serialisePublishing(). Our simple implementation would have the thread that is publishing busy spin until the last published sequence (the cursor) is one less the value being published.

This works because each sequence published is unique and strictly ascending. For example if one thread wants to publish the value 8, it will spin until the cursor value reaches 7. Because no other thread will be trying publish value 8 it can make progress and ensuring the sequential publishing behaviour in the process. However, this busy spin loop causes problems when there are more threads than cores. The threads that need wait for the prior sequences to be published can starve out the thread that should be updating the cursor. This leads to the unpredictable results shown above.

We need a better solution. In an ideal world there would be a Java API that would compile down to the Intel MONITOR/MWAIT instructions, unfortunately they're limited to Ring 0, so require a little kernel assistance to be useful. Another instruction (but unavailable in Java) would be the Intel PAUSE instruction that could used in the middle of the spin loop. One of the problems with busy loops on modern processors is in order keep the pipeline full the CPU may speculatively execute the condition at the top of the loop, causing an unnecessarily high number of instructions to fill the CPU pipeline. This can starve other logical threads of CPU resources. The PAUSE instruction on hyper-threaded Intel processors can improve this situation.

Java has neither of those, so we need to go back and address the shape of the serialisePublishing method. For the redesign I drew some inspiration from Cliff Click's non-blocking hash map. There 2 aspects of his design that are very interesting:

No locks
No CAS retries

While the first is obvious, the second is trickier. Any one familiar CAS operations will have seen the traditional loop until success approach to handling concurrent updates. For example the incrementAndGet method inside the AtomicLong in Oracle's JVM uses a similar loop. It could look something like¹:

While there are no locks here it is not necessarily wait-free. It is theoretically possible, if a number of threads are trying to increment the value, for one (or more) of the threads to get stuck unable to make progress if other threads are constantly winning the race to the atomic operation. For an algorithm to be wait free all calls must complete within a fixed number of operations. One way to get closer to a true wait free algorithm is to design the algorithm such that a failure of a CAS operation is a signal to exit rather than to retry the operation. Cliff Click's approach was to model the algorithm using a state machine, where all states are valid and a transition between states is typically a CAS operation. E.g. image a state machine with 3 states {A, B, C} and 2 transitions {A->B, B->C}. If a instance of the state machine is in state A and 2 threads try to apply the transition A->B only one will succeed. For the thread that fails to apply its CAS operation retrying the operation makes no sense. The instance has already transitioned to state B. In fact the failure of CAS operation is an indication that the instance is already in the desired state. The thread can exit if B is the desired state of the action or try to apply the B->C transition if that is what's required.

How does this apply to our concurrent sequencing problem? We could allow threads to continue to make progress while waiting for other threads to catch by maintaining a list of sequences that are pending publication. If a thread tries to publish a sequence that is greater than 1 higher than current cursor (i.e. it would need to wait for another thread to publish its sequence) it could place that sequence into the pending list and return. The thread that is currently running behind would publish its own sequence, then check the pending list and publish those sequences before exiting.

To represent this as a state machine we would have 3 states {unpublished, pending, published} and 2 transitions {unpublished->pending, pending->published}. In recognition of the fact that computing resources are finite, we have a guard condition on the unpublished->pending transition. I.e. a limit on number of sequences we allow in the pending state. Because each sequence is unique, the transition unpublished->pending does not require a CAS operation. The pending list is represented as an AtomicLongArray and the transition is a simple AtomicLongArray.set() where the index is the sequence modulo the size of the pending list². The final transition pending->published is where the CAS operation comes in. The thread will first try to publish its own sequence number. If that passes then the thread will try to publish the next value from the pending list. If the CAS fails the thread leaves the method. The failure means that the value is already published or will be by some other thread.

Running the multi-publisher performance test on my 2-Core laptop (where at least 4 threads would normally be required):

ThreePublisherToOneProcessorSequencedThroughputTest
  run 0: BlockingQueue=5,832,264 Disruptor=6,399,590 ops/sec
  run 1: BlockingQueue=5,521,506 Disruptor=6,470,816 ops/sec
  run 2: BlockingQueue=5,373,743 Disruptor=6,931,928 ops/sec

Sanity restored.

This update will be included in the 2.8 release of the Disruptor as the default implementation of the MultiThreadedClaimStrategy. The old implementation will still be available as MultiThreadedLowContentionClaimStrategy. Those who have plenty of cores where the publishers aren't often contented may find the old implementation faster, which it should be as it is simpler and requires fewer memory barriers. I'm going to continue to revise and work on this code. While improved, it is not truly wait free. It is possible for one of the threads to get stuck doing all of the publishing.

¹ The AtomicLong.getAndIncrement() does use a slightly different loop structure by the semantics are the same.
² Actually it's a mask of the sequence and the pending list size minus 1. This is equivalent when the size of the pending list is a power of 2.

Friday, 23 December 2011

Blog Rename and Video Links

I've decided to rename my blog. I plan to focus my blog efforts more on concurrency than any other particular subject and I thought it would be fun to pay homage to the bad science blog by Ben Goldacre.

It will probably kill my traffic for a while, until I get the few blog aggregators that carry my blog to update their feeds. So welcome to my rebranded blog, I'll start posting again in the next few weeks. In the mean time here are some links to videos from the various conferences that I spoke at over the past month or 2.

LJC @Playfish - Beginner's Guide to Concurrency (the trial run)
JAX London - Beginner's Guide to Concurrency
Devoxx - A tools in action session on the Disruptor (this was a experiment that didn't work very well, an attempt at live coding)

Saturday, 3 September 2011

Upcoming Speaking Events

I've had a few speaking events confirmed for the end of this year:

Trisha Gee and I are speaking at JAX London on November 2nd.
I'm presenting the disruptor in a half hour "Tools in Action" slot at Devoxx 2011 on November 14th.
I'm speaking on Technology and Performance Folklore at StackOverflow Dev Days on November 15th.

Trisha and I are hoping to preview our talk to the LJC sometime in October.

Thursday, 18 August 2011

ALL of the hotspot flags

Ever wanted to know what flags could be set and their default values for the Hotspot VM? Try this:

java -XX:+PrintFlagsFinal