Bad Concurrency: Arithmetic Overflow and Intrinsics

During a recent conversation on the LJC mailing list around a proposal for adding a library to the JDK that would add support for handling integer overflow a question arose. Would the JVM be able to optimise this code to make efficient use of the hardware support for overflow detection if this functionality was implemented as a library. I made the comment that this is problem is probably one best solved using intrinsics, but in the course of writing an explanation I thought it would be better explained in a blog post, so here goes...

What is an Intrinsic?

From the JVM perspective an intrinsic an identifiable code pattern (typically a method) where the JVM understands the intent, such that it can be complied to more optimal machine specific assembly. This is really useful when you have multiple target platforms and a subset of those targets contain instructions that may not be available on the others. E.g. Intel's X86 instruction set is quite rich when compared to a RISC-type processor, such as ARM.

An Example using POPCNT

One of the simplest examples of an intrinsic is the Integer.bitCount() method and the optimisation into Intel's POPCNT instruction (available on Nehalem and later), partially because it can be disabled and the effects of it not being applied are easy to observe. Lets start with some simple code that calls the Integer.bitCount() method:

The implementation of the Integer.bitCount() is a reasonably complex combination of arithmetic and bit shifting in order to calculate the number of bits set to 1 within a given int¹.

If we run the PopCntTest class and print out the assembler generated by hotspot, we can see that this will result in quite a large number of instructions that need to be issued to the CPU. Running the class using following command line (-XX:-UsePopCountInstruction disables the intrinsic):

java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:-UsePopCountInstruction PopCntTest.

Generates the following assembly code:

Now, lets look at what happens when the we allow the intrinsic optimisation, this time the command line is:

java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly PopCntTest

In this case the entire body of the Integer.bitCount() method and it's associated ~17 assembly instructions are replaced with a single instruction.

How it Works

Inside of Hotspot (other JVMs may work differently) a number of things are happening to make this work. As Hotspot loads classes it builds an abstract syntax tree (AST) representation of the Java byte code. When executing the Java byte code, if the interpreter notices that a particular method has been called a certain number of times² (default is 10000) then Hotspot will look to optimise and JIT that method. Before optimising, the method signature will be matched against the set of predefined intrinsics, declared in vmSymbols.hpp. If there is a match Hotspot will replace the nodes in AST with a set of nodes specific to the intrinsic that was matched. At some later point during the compile pass of the AST, it will see the new nodes and generate the optimised machine specific assembly for that part of the tree and type of node.

Does it affect performance?

This can be tested with a simple benchmark. I've used Google's Calliper micro-benchmarking framework with 2 test methods. The first uses the built-in Integer.bitCount(), the second uses my own implementation of the bit count. My implementation copies the Java implementation, but puts it in a method whose signature will not match the JVM defined intrinsic. Code:

Results:

      benchmark    ns linear runtime
IntegerBitCount 0.571 ====
     MyBitCount 3.438 ==============================

Fairly conclusive, the built-in Integer.bitCount() is 6-7 times faster than the one that is not optimised with an intrinsic. Hotspot supports a fair number of intrinsics for a variety of operations in the JDK libraries. Need to reorder the bytes in a int? Integer.reverseBytes() will compile down to a BWSAP instruction on Intel. This is also the mechanism by which AtomicInteger.compareAndSet() becomes a LOCK CMPXCHG instruction. When you combine intrinsics with Hotspot's daddy optimisation (inlining) it can produce some fairly optimal machine code³.

Back to Overflow

So how does this relate to the overflow conversation that turned up during the conversation on the LJC list? I think (there may be subtleties that I'm missing) that if overflow checking was implemented as a JDK library function, then it would be straight forward for the Hotspot (and other JVM) teams to implement an optimisation that will allow for overflow to be detected and raised by the hardware and be able to run at "close to the metal" speeds. There are a number of good reasons why this approach is preferable. 1) It's easy to implement, adding a library feature is much easier than changing the language or JVM. 2) Optimisation can come later, developers can start testing the functional behaviour of the code. 3) Being library code it is much easier to support the feature in alternative JVMs and create back ports for older JVMs. For those whose primary concern is not performance can get something that is functionally correct. When performance becomes a priority then they simply upgrade to the appropriate JVM.

Implementation hails from Hacker's Delight by Henry S. Warren
Actually I'm over simplifying here, with tiered compilation this decision is a little more complex.
Intrinsics and inlining aren't only optimisations that Hotspot can perform, there is a huge host of others that help to produce fast code.

4 comments:

robann said...: Superb article Mike!

I completely agree that these features should be added as utilities rather than adding complexity to the basic language - especially as this must only be a requirement for a tiny percentage of Java users.; 30 August 2012 at 14:09
Unknown said...: Is there an equivalent intrinsic for getting the number of set bits on a Java Byte?

I can always do Integer.bitCount(someByte) since byte to Int is not a downcast. Is this cast an expensive operation.

What I am really looking for is a Java equivalent for __builtin_popcount(some_uint8_t); 17 July 2013 at 12:07
Michael Barker said...: Hi Rajiv,

There's no Byte.bitCount that I'm aware of. You'll be stuck with:

Integer.bitCount(0xFF & b)

For the time being. Don't forget the mask otherwise negative values will give incorrect results.

As to the performance you'd have to measure it yourself. Probably worth looking at the generated assembler too.; 17 July 2013 at 22:05
Unknown said...: Oops! I missed the mask. Thanks for the article.; 18 July 2013 at 10:53

Thursday, 30 August 2012

Arithmetic Overflow and Intrinsics