Bad Concurrency

I Heard a Rumour...

2020-04-11T21:01:00.003+01:00

Where Aeron catches up on the goss.

The Need For Naming

A few months ago a pull request appeared on Aeron's Github site that added the ability to request Aeron to resolve or re-resolve host names to IP addresses. In cloud environments, especially when using Kubernetes, when nodes fail and restart it is not uncommon for a node with the same host name to restart with a different IP address. Unfortunately for Aeron this could make life difficult as it would resolve IP addresses up front and stick with it for the life time of the media driver. This is particularly bad when we consider nodes that are part of Aeron Cluster, where we expect nodes to come and go from the cluster over time.

It became very clear that we needed a plan that would allow Aeron to use logical names instead of IP addresses as endpoint identifiers and re-resolve those addresses appropriately. We didn't end up using the supplied pull request and came with an alternative solution that was a better fit with some of Aeron's longer term goals (I say we, it was mostly Todd Montgomery - I just did the C port).

As DNS can often be a source odd network latency issues, we didn't want a name resolution solution that was entirely reliant on default system name resolution. So we have also included a mechanism for resolving names that works entirely within Aeron.

Re-Resolution

The first thing we needed to tackle was re-resolving IP addresses when peer nodes went away and came back with a different address. Fortunately we already have a indicators within the existing protocol that allows the media driver to detect when nodes have died. Aeron continually sends data frames or heartbeats (sender to receiver) and status messages (receiver to sender) during normal running. We can use the absence of these messages as a mechanism to detect that a node (that is identified by name rather than IP address) needs to be re-resolved.

Periodically the sender and receiver will scan their endpoints to see if any having been missing regular updates and if those endpoints were identified by a name, trigger a re-resolution. The simple solution here would be to in-place re-resolve the name to an address, e.g. using getaddrinfo. However, one of the reasons that Aeron is incredibly fast (did I mention that already) is that it has a very principle based approach to its design. One of the principles is "Non-blocking IO in the message path". This is to avoid any unusual stalls caused by the processing of blocking IO operations. The call to resolve a host name can block for extended periods of time (BTW, if you are ever using an app on an other fast machine and it stalls for weird periods of time, it is worth asking the question, is it DNS causing the problem). Therefore we want to offload name resolution from our sender and receiver threads (the message path) onto the conductor where we can perform the slower blocking operations.

Name Resolvers

It was apparent very early on that we could could make the resolution of names an abstract concept. Obviously using DNS and host names is the most obvious solutions, but it would be interesting to allow for names to come from other sources. E.g. we could name individual media drivers and use those names with our channel configuration. This allows a couple of neat behaviours. All of the configuration for naming can be self contained within Aeron itself independent of DNS, which may require configuration of a separate system and we could also allow names to resolve to more that just IP addresses, e.g. host and port pairs or maybe direct to MAC addresses* in the future.

* Bonus points if you can figure out why this might be useful.

To support this in both the Java and C media drivers have the concept of a name resolver, with 2 production implementations, default (host name based) and driver where the media drivers are responsible to managing the list of names. With the driver based name resolution we need a mechanism to communicate the names between the instances of the media driver across the network.

Enter the Gossip Protocol

To allow driver names to propagate across the network, Aeron supports a gossip-style protocol, where we have an additional frame type (resolution frame) that contains mappings of names to addresses. Currently, only IPv4 and IPv6 addresses are supported, but there is scope for adding others later.

To make this work, for each media driver we specify 3 things. The name for the media driver (this will default to the host name when not specified), a bootstrap neighbour to send initial name resolutions to and a resolver interface. The most important option is the resolver interface as specifying this will enable the driver name resolution. It also determines which network interface to use to send and receive resolution frames and is the address reported to the neighbors for self-resolutions. This can also be a wildcard address (e.g. 0.0.0.0), in which case the neighbors will use the source address of the received resolution frames to identify that node.

On start each of the nodes will have an empty set of neighbour nodes and a bootstrap neighbour. Every 1s the driver name resolver will send out a self resolution, i.e. tell all the nodes that it knows about, what its own name and address are. This will be sent (via UDP) to all of its known neighbour nodes and the bootstrap node (if not already in the neighbour list). Because the neighbour list is initially empty, then messages will only be sent to bootstrap neighbours on the first pass. The bootstrap neighbour can be specified using a host name and the driver name resolver will ensure that it is re-resolved periodically in case it too has died and come back with a different IP address.

As a result of this the driver name resolvers will start to receive resolution frames. The name/address entries from these frames will be added to a cache and the neighbor list. If the resolution frame has come through as a notification of a self resolution we update a last activity timestamp for that node.

Every 2s, the media driver will send its cache of name/address pairs to all of its neighbours, so eventually all of the nodes will know about all of the other as the name/address entries are shared around the cluster. At the higher layer the conductor when trying to resolve a name to a supplied address on a channel URI will call the driver name resolver first, which can resolve the name from its cache, handing off to the default resolver if not found.

Periodically the cache and the neighbor list will be checked to see if we are still receiving self resolutions for a particular node. If the last activity timestamp hasn't been been updated recently enough then the entries are evicted from the cache and neighbour list under the assumption that the neighbour has died.

All of this is happening on the conductor thread so that it will not impact the performance of the sender and the receiver. This is primarily designed for small clusters of nodes as all nodes will be gossiping to all other nodes once the resolutions have propagated across the network. It is not designed for large scale system wide name resolution. However, it is a very new feature and we will expect to evolve over time as users experiment with it.

Write your own

With a lot of the algorithms within Aeron it is often not possible to pick a single implementation, so we offer the ability to provide your own implementation (e.g. flow control, congestion control). Name resolution fits into that model as well. There is an interface for the Java Driver and a function pointer based struct on the C driver that can be implemented by a user. So if there is a custom name resolution strategy that you would prefer to use, it can be plugged in quite easily.

If you look carefully, you notice that there is a 2-phase approach to resolving a name. There is lookup method and a resolve method. The lookup method takes a name and returns a host name, UDP port pair, e.g. 'example.com:2020', where as the resolve function takes in the host name portion of that pair and returns an internet address. The additional param name is so the resolver can distinguish between an endpoint and a control address.

Conclusion

While perhaps not a ground-breaking feature, it is a useful one. It manages to provide the convenience of support name-based resolution without compromising on the latency goals of Aeron. It is supported in both the Java (1.26.0) and C (1.27.0) media drivers. Feedback is always welcome and check out the wiki for more information.

Flow Control in Aeron

2020-03-19T22:20:00.000+00:00

One of my more recent projects has led me to become more involved in the Aeron project. If you are unaware of Aeron, then head over to the Github site and check it out. At its core is an reliable messaging system that works over UDP, Multicast UDP and IPC. It also contains an archiving feature for recording and replay and (still under active development) an implementation of the Raft protocol for clustering. Did I mention that it was fast too.

I've spent the last few weeks buried in the various strategies the Aeron has for flow control. Specifically modifying the existing flow control strategies and adding more flexible configuration on a per channel basis. Before I jump into that it would be useful to cover a little background first.

What is flow control?

Within a distributed system the purpose of flow control is to limit the rate of a sender so that is does not overrun it's associated receiver. UDP does not come with any form of flow control, therefore it is easy to create a sender that will out pace the receiver, leading to message loss. There are a number of different forms of flow control, but I'm going to focus on the sliding window flow control protocol used by TCP and Aeron. The sliding window protocol requires that the sender maintain a buffer of data (referred to as a window). The size of this window will typically communicated from the receiver to the sender as part of the protocol. With a bi-directional protocol like TCP the size of the window is communicated in each TCP segment header. This is the amount of data that the sender can transmit to the receiver before having to wait until an acknowledgement is received. If the application thread on the receiver side is busy and does not read the data from the socket and the sender continues to transmit, the window size value will decrease until it reaches 0, at which time the sender must stop and wait for an acknowledgement with a non-zero window size before sending again. There is a lot more networking theory around sizing the flow control window in order to get full utilisation of the network. But I will leave that as an exercise for the reader.

With Aeron and UDP unicast it is very similar to TCP, however Aeron is a unidirectional protocol where the receivers send status messages to indicate to the sender that it is ready to receive data and how much. The status message indicates where the subscriber is up to using the consumption term id and consumption term offset for a specific channel/stream/session triple. The receiver window value is the amount of data that can be sent from that position before the sender needs to stop and wait for a new status message indicating the the receiver is able to consume more data. The size of the receiver window is at most of ¼ of the term size and at least the size of the MTU (maximum transfer unit).

However, one of the neat features of Aeron is that it supports multicast (and multi-destination-cast, for which the same rules will apply), where there are multiple receivers for the same publication. In this situation how do we determine what values should be used for the flow control window? This is a question that has no one right answer, so Aeron provides a number of configuration options and it is also possible to plug in your own strategy.

In fact Aeron is the only tool that supports UDP multicast messaging with dynamic flow control (that we're aware of).

Max Flow Control

The simplest and fastest form of multicast flow control is a strategy where we take the maximum position of all of the receivers and use that value to derive limit that the sender can use for publication. This means any receivers that are not keeping up with the fastest one may fall behind and experience packet loss.

Min Flow Control

This is the inverse of the max flow control strategy, where instead we take minimum of all of the available receivers. This will prevent slower nodes (as long as they are still sending status messages) from falling behind. However this strategy does run the risk that the slower nodes can hold up the rest of the receivers by causing back pressure slowing the publisher. Because this strategy needs to track all of the individual receivers and their positions, it also must handle the case that a node has disappeared altogether. E.g. it has been shutdown or crashed. This is handled via a timeout (default 2s, but configurable). If status messages for a receiver have not been seen that period of time, that receiver is ejected from the flow control strategy and the publisher is allowed to move forward.

Tagged Flow Control (previously known as Preferred Flow Control)

Tagged flow control is a strategy that attempts to mitigate some of the short comings of the min flow control strategy. It works by using a min flow control approach, but only for a subset of receivers that are tagged to be included in the flow control group. The min flow control strategy is a special case of this strategy where are all receivers are considered to be in the group.

Configuring Flow Control

One of the new features that came with Aeron 1.26.0 was the ability to control the flow control strategy directly from the channel URI allowing for fine grained control over each publication and subscription. Defaults can also be specified on the media driver context. On the publication side the channel can be specified as:

The min and max flow control settings for the publication are the simplest, but the tagged one starts to get a little bit interesting. The ,g:1001 specifies that the group tag is 1001 and any receiver that want to be involved in flow control for this publication will need to specify that group tag. The subscription channel URI show how to ensure that the receiver sends the appropriate group tag so that it will be included in the publishers flow control group.

The tagged flow control strategy is really useful for receiving from a channel where there are a number of different types of subscribers that have different reliability requirements. A good example is where there is a flow of events that needs to go to a gateway service to be sent out to users, perhaps via HTTP and also needs to go to a couple of archiving services to store the data redundantly in a database. It may be possible for the gateway nodes to easily deal with message loss, either by reporting an error to the user or re-requesting the data. However it may not be possible for the archiving service nodes to do so. In this case the publication would specify the tagged flow control strategy and the subscriptions on the archiving services would use gtag parameter to ensure that they are included in the flow control group. The gateway services could leave the gtag value unset and not impact the flow control on the publisher.

While being able to include just the important subscribers into a flow control group so that they aren't overrun by the publisher is useful, there would still be an issue. If both of our archiving services happened to be down eventually their receivers would be timed out and removed from the group. Wouldn't it be great if we could require that a group contain a certain number of tagged receivers before the publication can report that it is connected. That way we could ensure that our archiving service nodes were up before we started publishing data.

Flow Control Based Connectivity

Turns that this is also now possible with the release of 1.26.0. For both the tagged flow control and the min flow control strategies we can specify a group minimum size that must be met before a publication can be considered connected. This is independent of to the requirement that there needs to be one connected subscriber. Therefore the default value for this group minimum size is 0. Like the strategy and the flow control group, the group minimum size can be specified on the channel URI.

In both of these cases the group minimum size is set to 3. For the min flow control strategy we would need at least 3 connected receivers, for the tagged flow control strategy we would need at least 3 connected receivers with tag 1001 and any receivers without the tag are disregarded.

Time Outs

One last new feature available on the channel URI configuration is the ability to specify the length of the timeout for the min and tagged flow control strategies. As mentioned the earlier this will default to 2s, but can be set to any value. Some care should be taken in specifying this value, if it is too short then receivers may frequently timeout during normal running. Status messages are emitted at least once every 200 ms (more if necessary), so any shorter than that would not be useful. Too long and a failed receiver could result in a significant back pressure stall on the publisher. Setting this for min and tagged flow control strategies:

Summary

As mentioned earlier the idea of using flow control to provide dynamic back pressure for a multicast messaging bus is a unique and powerful feature of Aeron. Being able to configure these settings on a per publication provides a an extra level of flexibility that to help our users to build the system that they need. Any questions, come over to Gitter channel and chat to us.

Resurrecting my blog

2020-03-19T04:14:00.002+00:00

If you have been a long time follower of my blog you will have noticed that is has been a really long time since I posted any new content. Since my last post six years ago at lot has changed. I no longer work at LMAX an have started out as an independent consultant. If you are part of an organisation looking for assistance in building software, especially if you facing challenges around performance (throughput, latency, scalability, or efficiency) then I might be able to help. My services and contact details are available as http://ephemeris.tech.

I am hoping to start posting again about some of the work I've doing more recently and good potion of it will be open source, so there should be plenty to share.

The "Uncanny Valley" of L3 Cache Contention

2014-12-03T01:13:00.000+00:00

While preparing for my talk at QCon SF 2014, I wanted to investigate a theory around how micro-benchmarks are not a useful reflection of how software may behave when run as part of a larger application. Specifically due contention in the last-level cache (L3* in current Intel CPUs).

Read more »

Speaking in October and November

2014-09-23T04:38:00.005+01:00

I'll be giving some talks over the next few months:

8 Oct: Auckland Software Craftsmanship - 6 Years of test automation.
16 Oct: Auckland JVM Group - Stuff I learned about performance.
5 Nov: QCon San Francisco 2014 - Stuff I learned about performance.

YOW! 2013 Video

2014-04-23T02:07:00.001+01:00

My talk on the Disruptor at YOW! 2013 is now available at the YOW! Eventer site.

Linux Alternatives and Oracle Java

2014-01-29T19:39:00.001+00:00

If, like me, you prefer to run the Oracle version of Java on your Linux machine as the default JDK, you will often find that the Linux distro will have other ideas. Fedora for example has a number of Java based applications as part of the distribution which will include a dependency on the OpenJDK. When the distro installs OpenJDK is will generally be setup as the default for executing the various Java binaries (e.g. 'java', 'javac'). However, the team at Redhat built a system called alternatives which maintains a set of symbolic links that allows the user to switch between multiple implementations of a package the supports the same functionality. I've managed to understand enough about the alternatives package that I can now easily switch between the Oracle JDK and the OpenJDK.

Read more »

An Alternative Multi-Producer Approach

2013-12-15T19:42:00.000+00:00

Recently on InfoQ, Aliaksei Papou posted an article on some of his experiments with high performance interchange of messages between threads. There were a number of examples within the article, but I am going to focus on the multi-producer case. One of the optimisations that the article showed was that if you knew the number of producers that you have at initialisation time you can build a structure that significantly reduces contention. The existing MultiProducerSequencer does not have this constraint, which is essential for a large number of use cases. However, I wanted to see what we could achieve if I applied this approach to this Disruptor.

Read more »

Fonts on Fedora Core 19

2013-10-10T22:13:00.000+01:00

I recently upgraded my personal workstation from Ubuntu 12.04 to Fedora Core 19. I'm very happy with the change, I prefer the standard Gnome 3 to Canonical's Unity desktop. Probably something to do with my workstation not being a tablet. However, one of the annoyances with Fedora Core has always been its font rendering. It was never as nice a default Ubuntu install. However after a couple of days of digging I've managed to figure out the magic incantation required to get all of my fonts looking good, including in Google Chrome.

Read more »

Release of Disruptor 3.0.0

2013-04-11T07:22:00.001+01:00

I've decided that I'm a bit bored of the whole putting a beta tag on various versions of the Disruptor so I've decide to send forth Disruptor 3.0.0 into the world. The big challenges of this release were to clean up the code and come up with a better algorithm for handling multiple producers. If I was lucky, make it even faster. I went down a couple of dark alleys initially with this release, but have come up again for air with a version that is less different to the 2.x version, but still brings some nice benefits. I had wanted to implement a few more functional tests and improve the documentation, but I could be waiting forever to those aspects to a level where I was 100% satisfied.
Read more »

Speaking at Tech Mesh

2012-11-07T23:15:00.000+00:00

I'm happy to announce that I will speaking at Tech Mesh in December. I'll be speaking about the Disruptor from two perspectives, firstly looking briefly back at some of the history and motivations behind the Disruptor. Then spending some time explaining at the challenges of building high performance concurrent systems (like the Disruptor) and delving into how the JVM and hardware could change to support the development of these systems.

Talk from JAX London

2012-10-19T21:53:00.000+01:00

Last week I gave a talk on non-blocking concurrency at JAX London. Here are the slides:

The video is also available

Arithmetic Overflow and Intrinsics

2012-08-30T13:05:00.001+01:00

During a recent conversation on the LJC mailing list around a proposal for adding a library to the JDK that would add support for handling integer overflow a question arose. Would the JVM be able to optimise this code to make efficient use of the hardware support for overflow detection if this functionality was implemented as a library. I made the comment that this is problem is probably one best solved using intrinsics, but in the course of writing an explanation I thought it would be better explained in a blog post, so here goes...

What is an Intrinsic?

From the JVM perspective an intrinsic an identifiable code pattern (typically a method) where the JVM understands the intent, such that it can be complied to more optimal machine specific assembly. This is really useful when you have multiple target platforms and a subset of those targets contain instructions that may not be available on the others. E.g. Intel's X86 instruction set is quite rich when compared to a RISC-type processor, such as ARM.

An Example using POPCNT

One of the simplest examples of an intrinsic is the Integer.bitCount() method and the optimisation into Intel's POPCNT instruction (available on Nehalem and later), partially because it can be disabled and the effects of it not being applied are easy to observe. Lets start with some simple code that calls the Integer.bitCount() method:

The implementation of the Integer.bitCount() is a reasonably complex combination of arithmetic and bit shifting in order to calculate the number of bits set to 1 within a given int¹.

If we run the PopCntTest class and print out the assembler generated by hotspot, we can see that this will result in quite a large number of instructions that need to be issued to the CPU. Running the class using following command line (-XX:-UsePopCountInstruction disables the intrinsic):

java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:-UsePopCountInstruction PopCntTest.

Generates the following assembly code:

Now, lets look at what happens when the we allow the intrinsic optimisation, this time the command line is:

java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly PopCntTest

In this case the entire body of the Integer.bitCount() method and it's associated ~17 assembly instructions are replaced with a single instruction.

How it Works

Inside of Hotspot (other JVMs may work differently) a number of things are happening to make this work. As Hotspot loads classes it builds an abstract syntax tree (AST) representation of the Java byte code. When executing the Java byte code, if the interpreter notices that a particular method has been called a certain number of times² (default is 10000) then Hotspot will look to optimise and JIT that method. Before optimising, the method signature will be matched against the set of predefined intrinsics, declared in vmSymbols.hpp. If there is a match Hotspot will replace the nodes in AST with a set of nodes specific to the intrinsic that was matched. At some later point during the compile pass of the AST, it will see the new nodes and generate the optimised machine specific assembly for that part of the tree and type of node.

Does it affect performance?

This can be tested with a simple benchmark. I've used Google's Calliper micro-benchmarking framework with 2 test methods. The first uses the built-in Integer.bitCount(), the second uses my own implementation of the bit count. My implementation copies the Java implementation, but puts it in a method whose signature will not match the JVM defined intrinsic. Code:

Results:

      benchmark    ns linear runtime
IntegerBitCount 0.571 ====
     MyBitCount 3.438 ==============================

Fairly conclusive, the built-in Integer.bitCount() is 6-7 times faster than the one that is not optimised with an intrinsic. Hotspot supports a fair number of intrinsics for a variety of operations in the JDK libraries. Need to reorder the bytes in a int? Integer.reverseBytes() will compile down to a BWSAP instruction on Intel. This is also the mechanism by which AtomicInteger.compareAndSet() becomes a LOCK CMPXCHG instruction. When you combine intrinsics with Hotspot's daddy optimisation (inlining) it can produce some fairly optimal machine code³.

Back to Overflow

So how does this relate to the overflow conversation that turned up during the conversation on the LJC list? I think (there may be subtleties that I'm missing) that if overflow checking was implemented as a JDK library function, then it would be straight forward for the Hotspot (and other JVM) teams to implement an optimisation that will allow for overflow to be detected and raised by the hardware and be able to run at "close to the metal" speeds. There are a number of good reasons why this approach is preferable. 1) It's easy to implement, adding a library feature is much easier than changing the language or JVM. 2) Optimisation can come later, developers can start testing the functional behaviour of the code. 3) Being library code it is much easier to support the feature in alternative JVMs and create back ports for older JVMs. For those whose primary concern is not performance can get something that is functionally correct. When performance becomes a priority then they simply upgrade to the appropriate JVM.

Implementation hails from Hacker's Delight by Henry S. Warren
Actually I'm over simplifying here, with tiered compilation this decision is a little more complex.
Intrinsics and inlining aren't only optimisations that Hotspot can perform, there is a huge host of others that help to produce fast code.

Disruptor v3 - Faster Hopefully

2012-07-27T21:13:00.000+01:00

I've be working sporadically on the next major revision of the Disruptor, but still making steady progress. I've merged my experimental branch into the main line and I'm working on ensure comprehensive test coverage and re-implement the Disruptor DSL.

As a matter of course I've been running performance tests to ensure that we don't regress performance. While I've not been focusing on performance, just some refactoring and simplification I got a nice surprise. The new version is over twice as fast for the 1P1C simple test case; approximately 85M ops/sec versus 35M ops/sec. This is on my workstation which is an Intel(R) Xeon(R) CPU E5620@2.40GHz.

Current released version (2.10.1)

[barkerm@snake disruptor-2]$ taskset -c 4-8,12-15 ant througput:single-test
througput:single-test:
    [junit] Running
com.lmax.disruptor.OnePublisherToOneProcessorUniCastThroughputTest
    [junit] Started Disruptor run 0
    [junit] Disruptor=21,141,649 ops/sec
    [junit] Started Disruptor run 1
    [junit] Disruptor=20,597,322 ops/sec
    [junit] Started Disruptor run 2
    [junit] Disruptor=33,233,632 ops/sec
    [junit] Started Disruptor run 3
    [junit] Disruptor=32,883,919 ops/sec
    [junit] Started Disruptor run 4
    [junit] Disruptor=33,852,403 ops/sec
    [junit] Started Disruptor run 5
    [junit] Disruptor=32,819,166 ops/sec
    [junit] Started Disruptor run 6

Current trunk.

[barkerm@snake disruptor]$ taskset -c 4-8,12-15 ant througput:single-test
througput:single-test:
    [junit] Running
com.lmax.disruptor.OnePublisherToOneProcessorUniCastThroughputTest
    [junit] Started Disruptor run 0
    [junit] Disruptor=23,288,309 ops/sec
    [junit] Started Disruptor run 1
    [junit] Disruptor=23,573,785 ops/sec
    [junit] Started Disruptor run 2
    [junit] Disruptor=86,805,555 ops/sec
    [junit] Started Disruptor run 3
    [junit] Disruptor=87,183,958 ops/sec
    [junit] Started Disruptor run 4
    [junit] Disruptor=86,956,521 ops/sec
    [junit] Started Disruptor run 5
    [junit] Disruptor=87,260,034 ops/sec
    [junit] Started Disruptor run 6
    [junit] Disruptor=88,261,253 ops/sec
    [junit] Started Disruptor run 7

There is still more to come. In addition to improvement for the single producer use case the next version will include a new multiple producer algorithm that will replace the two that we currently have and work better than both of them in all scenarios.

Disruptor 2.10 Release

2012-05-13T17:59:00.001+01:00

The Disruptor version 2.10 has been released. It is available from the Google Code download page and has been submitted to Maven central.

Changes

Remove deprecated timeout methods.
Added OSGI metadata to jar file.
Removed PaddedAtomicLong and use Sequence in all places.
Fix various generics warnings.
Change Sequence implementation to work around IBM JDK bug and improve performance by ~10%.
Add a remainingCapacity() call to the Sequencer class.

Interview on the Disributed Podcast

2012-04-03T21:17:00.002+01:00

Last month the guys from the Distributed Podcast interviewed me about LMAX and the Disruptor. Many thanks to Jonathan and Rinat, I had a great time.

Speaking at QCon London

2012-02-14T00:35:00.001+00:00

Along with Martin Thompson, I'm speaking about non-blocking concurrency at QCon London in March.

Slides From Recent Presentations

2012-02-10T06:39:00.003+00:00

My slides from JAX London - Beginner's Guide to Hardcore Concurrency:

Beginners guide-concurrency

View more presentations from Michael Barker

Video: LJC@Playfish, JAX London

My slides from Devoxx - Disruptor Tools In Action:

Disruptor tools in action

View more presentations from Michael Barker

Video: Devoxx (Payment required)

Building a CPU Topology on MacOS X

2012-01-08T19:44:00.000+00:00

Within a project I've been working on I've had the need to simulate the capabilities of Linux's /proc/cpuinfo on Mac OS. Specifically I needed to build a topology of the CPUs on a given system. I.e. I need to map the operating system's processors to hardware threads, then build a picture of which cores and sockets those threads reside. For example my Mac looks something like:

CPU0 (Thread 0) ---+
                   |---> Core 0 ---+
CPU1 (Thread 1) ---+               |
                                   | ----> Socket 0
CPU2 (Thread 2) ---+               |
                   |---> Core 1 ---+
CPU3 (Thread 3) ---+

While this sounds very simple, it's actually fraught with a number of little niggles. Not only did it require getting down and dirty with a bit of C++ and X86 Assembly, it also required writing a MacOS kernel extension.

The first step was to understand what information was available from the CPU. Intel exposes an instruction called CPUID. The is the primary mechanism for getting information about the CPU. There is a raft of information available from listing of the CPU features available (e.g. hyperthreading) to sizes of the various levels of cache and the associated cache lines. To access the CPUID instruction we need a little bit of inline assembler.

The code shows how to get the vendor string from the CPU. On my Mac I get the following:

// Output:
Vendor String: GenuineIntel

For those unfamiliar with Intel inline assembly, the Intel CPU defines a number of registers. The ones used for the CPUID instruction are EAX, EBX, ECX, and EDX (referenced as RAX, RBX, etc if using 64 bit instructions via the REX extension). These used for both input and output. An inline asm segment consists of 3 parts. The first part is the instruction to be executed. In this case the "cpuid" instruction. The second line defines the output parameters. The snippet "=a" (data[0]) means store the result in the EAX register in the variable data[0]. The "=a" refers to the 2nd letter of the register designation. The 3rd and final section are the input parameters. The CPUID instruction takes 2 parameters, one in EAX and one in ECX.

The particular CPUID reference that provides information needed to building the topology is 0xB (11) - the extended topology enumeration leaf. The data returned from this instruction is:

     0                   1                   2                   3   
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
EAX | Shift |                     Reserved                          |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
EBX |   No. Process at this level   |            Reserved           |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
ECX |   Level No.   |  Level type   |            Reserved           |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
EDX |                           x2APIC ID                           |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The extended topology enumeration leaf is one of the CPUID indexes that also makes use of ECX as an input parameter. This indicates the level of the CPU that you wish to work at. I started at level 0 and worked my way up. The 'Level type' describes whether the current level is a CPU thread (1) or a physical core (2) or invalid (0). All values greater than 2 are reserved, presumably for later use. The 2 other useful values are the x2APIC ID and Shift. The x2APIC ID is a unique identifier for the smallest logic unit in the CPU. The Shift value is used to group x2APIC ID values into units at the next logical level. This is done by shifting the x2APIC ID right by the value specified by Shift. For example on my using the following code on my workstation (2 sockets, 8 cores, 16 threads):

Outputs the following:

Shift: 1, Count: 2, Level Id: 0, Level Type: 1, x2APIC: 33, Group: 16
Shift: 5, Count: 8, Level Id: 1, Level Type: 2, x2APIC: 33, Group: 1

This indicates that the hardware thread indicated by APIC 33 has the core id of 16 and socket id of 1. The socket and core ids aren't really references, but values that indicate threads have have the same id share that unit. This gets me most of the way there, I can group all of my threads into a hierarchy of cores and sockets. However there is one small niggle remaining. I need this code to run all of the CPUs on my system. On Linux this is reasonably simple, you count the number of CPUs then iterate through all of them and use the pthread_setaffinity_np(...) to specify specify which CPU to run the CPUID instructions on.

However, on Mac OS X actual thread affinity is not supported; just a mechanism logically grouping or separating threads, powerful in its own right, but not what I'm after here. This is where the Kernel module comes in. The XNU Kernel defines a funky little method called mp_rendevous(...). This method takes in a number function pointers. The key one (action_func), is run once on each of the CPUs. So to get the topology information for all of the CPUs we can use it like so:

Because the method mp_rendevous() is defined in the kernel the code above needs to be packaged as a kernel extension. Even then, getting access to the method is interesting. It's not defined in a header that can be easily included, however it is available at link time when compiling a kernel module. Therefore in order compile correctly, it's necessary to provide your own forward declaration of the method. The same is true of the cpu_number(). Calling into the kernel method from user space requires use of the IOKit framework, but I'll leave the details of that as an exercise for the reader.

Concurrent Sequencing

2011-12-31T10:37:00.003+00:00

A few weeks ago one of the users of the Disruptor posted some worrying benchmarks:

ThreePublisherToOneProcessorSequencedThroughputTest
  run 0: BlockingQueue=645,161 Disruptor=1,772 ops/sec
  run 1: BlockingQueue=1,250,000 Disruptor=20,000,000 ops/sec
  run 2: BlockingQueue=1,250,000 Disruptor=56 ops/sec

It appears under heavy contention with fewer available cores than busy threads the Disruptor can perform terribly. After a bit of investigation I managed to isolate the problem. One of the most complex parts of the Disruptor is the multi-threaded claim strategy. It is the only place in the Disruptor where - out of necessity - we break the single-writer principal.

The approach that we used was very simple. Each thread claims a slot in the ring buffer using AtomicLong.incrementAndGet(). This ensures that each claim will return a unique sequential value. The complexity arrives when the multiple threads try to publish their sequence. We require that all events placed in the ring buffer must be made available to event processors in a strictly sequential order. To ensure this behaviour we have a method called serialisePublishing(). Our simple implementation would have the thread that is publishing busy spin until the last published sequence (the cursor) is one less the value being published.

This works because each sequence published is unique and strictly ascending. For example if one thread wants to publish the value 8, it will spin until the cursor value reaches 7. Because no other thread will be trying publish value 8 it can make progress and ensuring the sequential publishing behaviour in the process. However, this busy spin loop causes problems when there are more threads than cores. The threads that need wait for the prior sequences to be published can starve out the thread that should be updating the cursor. This leads to the unpredictable results shown above.

We need a better solution. In an ideal world there would be a Java API that would compile down to the Intel MONITOR/MWAIT instructions, unfortunately they're limited to Ring 0, so require a little kernel assistance to be useful. Another instruction (but unavailable in Java) would be the Intel PAUSE instruction that could used in the middle of the spin loop. One of the problems with busy loops on modern processors is in order keep the pipeline full the CPU may speculatively execute the condition at the top of the loop, causing an unnecessarily high number of instructions to fill the CPU pipeline. This can starve other logical threads of CPU resources. The PAUSE instruction on hyper-threaded Intel processors can improve this situation.

Java has neither of those, so we need to go back and address the shape of the serialisePublishing method. For the redesign I drew some inspiration from Cliff Click's non-blocking hash map. There 2 aspects of his design that are very interesting:

No locks
No CAS retries

While the first is obvious, the second is trickier. Any one familiar CAS operations will have seen the traditional loop until success approach to handling concurrent updates. For example the incrementAndGet method inside the AtomicLong in Oracle's JVM uses a similar loop. It could look something like¹:

While there are no locks here it is not necessarily wait-free. It is theoretically possible, if a number of threads are trying to increment the value, for one (or more) of the threads to get stuck unable to make progress if other threads are constantly winning the race to the atomic operation. For an algorithm to be wait free all calls must complete within a fixed number of operations. One way to get closer to a true wait free algorithm is to design the algorithm such that a failure of a CAS operation is a signal to exit rather than to retry the operation. Cliff Click's approach was to model the algorithm using a state machine, where all states are valid and a transition between states is typically a CAS operation. E.g. image a state machine with 3 states {A, B, C} and 2 transitions {A->B, B->C}. If a instance of the state machine is in state A and 2 threads try to apply the transition A->B only one will succeed. For the thread that fails to apply its CAS operation retrying the operation makes no sense. The instance has already transitioned to state B. In fact the failure of CAS operation is an indication that the instance is already in the desired state. The thread can exit if B is the desired state of the action or try to apply the B->C transition if that is what's required.

How does this apply to our concurrent sequencing problem? We could allow threads to continue to make progress while waiting for other threads to catch by maintaining a list of sequences that are pending publication. If a thread tries to publish a sequence that is greater than 1 higher than current cursor (i.e. it would need to wait for another thread to publish its sequence) it could place that sequence into the pending list and return. The thread that is currently running behind would publish its own sequence, then check the pending list and publish those sequences before exiting.

To represent this as a state machine we would have 3 states {unpublished, pending, published} and 2 transitions {unpublished->pending, pending->published}. In recognition of the fact that computing resources are finite, we have a guard condition on the unpublished->pending transition. I.e. a limit on number of sequences we allow in the pending state. Because each sequence is unique, the transition unpublished->pending does not require a CAS operation. The pending list is represented as an AtomicLongArray and the transition is a simple AtomicLongArray.set() where the index is the sequence modulo the size of the pending list². The final transition pending->published is where the CAS operation comes in. The thread will first try to publish its own sequence number. If that passes then the thread will try to publish the next value from the pending list. If the CAS fails the thread leaves the method. The failure means that the value is already published or will be by some other thread.

Running the multi-publisher performance test on my 2-Core laptop (where at least 4 threads would normally be required):

ThreePublisherToOneProcessorSequencedThroughputTest
  run 0: BlockingQueue=5,832,264 Disruptor=6,399,590 ops/sec
  run 1: BlockingQueue=5,521,506 Disruptor=6,470,816 ops/sec
  run 2: BlockingQueue=5,373,743 Disruptor=6,931,928 ops/sec

Sanity restored.

This update will be included in the 2.8 release of the Disruptor as the default implementation of the MultiThreadedClaimStrategy. The old implementation will still be available as MultiThreadedLowContentionClaimStrategy. Those who have plenty of cores where the publishers aren't often contented may find the old implementation faster, which it should be as it is simpler and requires fewer memory barriers. I'm going to continue to revise and work on this code. While improved, it is not truly wait free. It is possible for one of the threads to get stuck doing all of the publishing.

¹ The AtomicLong.getAndIncrement() does use a slightly different loop structure by the semantics are the same.
² Actually it's a mask of the sequence and the pending list size minus 1. This is equivalent when the size of the pending list is a power of 2.

Blog Rename and Video Links

2011-12-23T07:15:00.004+00:00

I've decided to rename my blog. I plan to focus my blog efforts more on concurrency than any other particular subject and I thought it would be fun to pay homage to the bad science blog by Ben Goldacre.

It will probably kill my traffic for a while, until I get the few blog aggregators that carry my blog to update their feeds. So welcome to my rebranded blog, I'll start posting again in the next few weeks. In the mean time here are some links to videos from the various conferences that I spoke at over the past month or 2.

LJC @Playfish - Beginner's Guide to Concurrency (the trial run)
JAX London - Beginner's Guide to Concurrency
Devoxx - A tools in action session on the Disruptor (this was a experiment that didn't work very well, an attempt at live coding)

Upcoming Speaking Events

2011-09-03T12:10:00.001+01:00

I've had a few speaking events confirmed for the end of this year:

Trisha Gee and I are speaking at JAX London on November 2nd.
I'm presenting the disruptor in a half hour "Tools in Action" slot at Devoxx 2011 on November 14th.
I'm speaking on Technology and Performance Folklore at StackOverflow Dev Days on November 15th.

Trisha and I are hoping to preview our talk to the LJC sometime in October.

ALL of the hotspot flags

2011-08-18T10:58:00.000+01:00

Ever wanted to know what flags could be set and their default values for the Hotspot VM? Try this:

java -XX:+PrintFlagsFinal

Now Offically an OpenJDK Contributor

2011-07-20T21:19:00.001+01:00

Being someone who's interested in Open Source and very focused on performance, I've been looking at the Java & the JVM and wondering about what I would most like to see added. IMHO one of the fundamental missing pieces is support for complex stack allocated types. This is not a new idea. C# calls them structs. In C, C++ and Go stack & heap allocation is orthogonal to the definition of the type.

An implementation for Java is described by John Rose (some time ago) on his blog. While this is presented as a feature that would benefit alternative languages on the JVM, I think this could bring something significant to Java as well (not that I actually know what such a language feature would look like, I'm not a language designer). It would improve the efficiency of a some very useful design patterns that tend to be avoided (by me anyway) due to the cost of heap allocation. E.g. a number of classes in LMAX's code base consist of aggregations of 64 bit integers.

public class OrderInstruction {
    public OrderInstruction(long accountId,
                            long instrumentId, 
                            long clientOrderId,
                            long stopPrice,
                            long price,
                            long quantity) {
        // ...
    }
}

While very efficient, it can lead to a number of errors as it is very easy to screw up an assignment wouldn't prevent assigning the accountId to the price. However, with support for tuples/value types/structs then it would be possible to do the following (I've borrowed from the C# syntax):

public struct AccountId {
    public AccountId(long id) {
        // ...
    }
}

public struct InstrumentId {
    public InstrumentId(long id) {
        // ...
    }
}

public class OrderInstruction {
    public OrderInstruction(AccountId accountId,
                            InstrumentId instrumentId, 
                            ClientOrderId clientOrderId,
                            Price stopPrice,
                            Price price,
                            Quantity quantity) {
        // ...
    }
}

Given any reasonable implementation the structs Account, InstrumentId, etc, would compile down to a long (and be no less efficient). It would enforce type safety preventing accidental incorrect assignment. Another benefit is that is allows seamless extension of the types, with far less impact. E.g. if we need to extend the AccountId to include a realm field, no change to the OrderInstruction would be required.

public struct AccountId {
    public AccountId(int realm, long id) {
        // ...
    }
}

This pattern is known as Tiny Types. Sometimes it's the small things that matter.

The other nice feature would support for more complex and suitably efficient numeric types, e.g:

public struct Rational {
    public Rational(long numerator, long denominator) {
        // ....
    }
}

// This would a favourite at LMAX
public struct Decimal {
    public Decimal(long value, int exponent) {
        // ...
    }
}

We could even steal some features from Google Go:

public struct Slice {
    public Slice(byte[] array, int offset, int length) {
        // ....
    }
}

Pairs, Tuples and Option types could also be implemented efficiently too. Instances of these types would be passed and assigned by copy. The memory for these types would be either allocated on the stack or inline within a object when laid out on the heap. This could have (I hope) a significant improvement in spatial cache locality and reduce the pressure on the garbage collector.

I think this feature is so important that I have started implementing it myself as part of the MLVM project. My first patches have now been accepted into MLVM. I'm not sure if or when this will end up in the JDK proper or if the Java language will support it, but it has to start somewhere.

Downloading JDK 13

Disruptor, Now Open Source!

2011-06-22T16:07:00.001+01:00

The Disruptor, the design pattern at the core of the financial exchange I'm helping build at LMAX has been open sourced. If you've seen the presentation that Martin Thompson and I gave at QCon or the one I did for the LJC @Skillsmatter, this is the ring-buffer based code that we've been banging on about for a while.

What is the Disruptor?

At its simplest, it's an alternative to a queue, but can also can be thought of as a simple actor-like concurrency framework.

Why?

Performance, it's about 8-9 times higher throughput and 3 orders of magnitude lower latency than Java's ArrayBlockingQueue. There's a technical article on the Google code site with more details of the implementation and comprehensive performance test results.