Wednesday 3 December 2014

The "Uncanny Valley" of L3 Cache Contention

While preparing for my talk at QCon SF 2014, I wanted to investigate a theory around how micro-benchmarks are not a useful reflection of how software may behave when run as part of a larger application.  Specifically due contention in the last-level cache (L3* in current Intel CPUs).

An Example

Last year while working on a system to store market data, I needed a time series data store.  I'd used LevelDB to solve a similar problem in the past and looked at using it again.  The system I was building had a reasonably high throughput requirement (160K ops/sec).  I built a micro-benchmark around the data I was trying to store in order to verify that that LevelDB would be sufficiently fast.  It managed to clock in at around 400K ops/sec, which suggested that it would be a workable solution.  However, when I merged the code into our main application an ran it within our end to end performance testing environment and it was unable to keep up the aforementioned throughput rate.

My suspicion is when running the code within the micro-benchmark it had the whole L3 cache to itself.  When running as part of a larger application, with a number of other busy threads running concurrently, the shared L3 cache becomes a point of contention significantly impacting the throughput of the data store.  The problem with a contended L3 is that each thread will be constantly replacing the values in that cache pushing out the cache lines of the other thread, meaning there is a much greater probability of the memory access missing cache and needing to access the data from main memory.

CPU Architecture Refresher

If you have a general interest in computing performance and haven't been hiding under a rock for the past 10 years, then you are probably aware that modern CPUs have a multi-layer memory hierarchy; consisting of a series of progressively larger and slower** caches between the processor and main memory.  The diagram below shows the layout of a Intel Nehalem CPU.

As you can see each of the individual cores has a dedicated L1 and L2 cache, but the L3 cache is a shared resource.  If 2 threads are running on 2 separate cores and their working set fits cleanly within L1 or L2 and the memory used by each is independent, then the 2 threads can run in parallel without impeding each other.  So the question remains that if the working set is larger and requires L3 how will that impact the performance of individual threads.

A Performance Test

To see if I could understand the impact I wrote a small performance test.  The test works by holding two integer arrays.  The first holds a set of random values, the second holds indexes into the first array.  The code iterates over the array of indexes, reading values from the first array.  The index array is shuffled resulting in a data access pattern that mixes sequential and random access.  The test measures the how quickly all of the values from the data array are loaded via the index array.

The test runs with a varying number of threads (one to five) with a range of working set sizes, ranging from 16 bytes to 128 MB.  The chart below shows the throughput for varying thread counts against the working set size. 

The throughput of each scenario proceeds in a relatively linear fashion until the size of the working set approaches the size of the L3, which happens to be 8MB on my test machine.  After this point the performance degrades as most of the access stop hitting L3 and have to fetch data from main memory.  What is interesting is the point around 8MB where the throughput of all the cases converge to roughly the same value.  If instead of plotting the throughput we calculated the "speed up" that each of the scenarios had over the baseline single-threaded case, we would see the scalability*** of the code, by working set size.

There is clearly a dramatic drop in the "speed up" or scalability when the working set for each thread matches the size of the L3.  This is what I referred to as the "uncanny valley" in the title of the post.  At this point the code is "perfectly contended" or, to quote the story of Mel, "most pessimum".  This shows very clearly that multiple threads working on completely independent sets of data can contend with each other when running on different cores, but the same socket.  To the point where the performance degrades such that it is equivalent of running the code on a single thread.

There are a couple things that can be taken from this:
  • Beware believing the specific numbers from micro-benchmarks as there is a good chance that when running on a busy system that other threads on the same socket could be reducing the overall performance of the code that was benchmarked.
  • If writing parallel code, working set size is very important.  In fork/join style applications the point where the code stops subdividing should take into account the size of the working set to avoid contending on the L3.
There is one further thing to consider.  The test above is based on a mix of sequential and random access.  If we change the tested so that the index array wasn't shuffled, so that all memory accesses are sequential, does the "uncanny valley" still make an appearance?

Looking at these charts, there is only a small movement in the "speed-up" at 8MB for the 2 Threads case.  However, this could be explained by run-to-run variance, so is not conclusive.  I think that if memory accesses are purely sequential then L3 cache contention is unlikely to have a significant affect on performance and scalability.  So (surprise, surprise) algorithms that are completely (or mostly) sequential with regards to the way they access memory are going to suffer from fewer performance artefacts.  However, there are many algorithms that are random access by their very nature, e.g. the mark phase of a garbage collection system.  In those cases paying attention to working set size and what other threads in the system are doing will be important to ensure that the code runs well.

* Some of the latest Haswell CPUs have an L4 cache.
** Specifically higher latency.
*** Increase in throughput with additional threads.


Bill Broadley said...

I've written similar micro benchmarks and encountered various bottlenecks like your mentioned uncanny valley. How ever when I was careful with allocation, alignment, and stride they disappeared.

Can you provide source so I can see exactly what you are doing?

Michael Barker said...

Unknown said...

In ~2008 I wrote an article on hoisting micro-benchmarks up into some arbitrary application that is instrumented and metered to counter such overstated "positives" in micro benchmarks. Then at measurement points (methods) the micro benchmark is executed via an AOP like callback. I tried explaining this more recently on another thread but I was probably not in the right frame of mind to get the message across the barriers put in.

Dr. Guy Gordon said...

Your first chart is interesting. You write "linear fashion until the size of the working set reaches the size of the L3". More accurately, contention starts when the sum of the working sets reaches the L3 size. As one might expect, 4 Threads start contending for L3 at 2MB, etc.

Good work. Especially in pointing out how sequential access avoids the problem. I've seen huge improvements by refactoring data structures to remove pointers and factor in cache line size.

Bill Broadley said...
This comment has been removed by the author.
Bill Broadley said...

Took me awhile to realize what bothered me about these graphs. So I did a haswell latency chart. Note the y axis is log:

Note a few things about the graph:
* The last level of cache is 20MB, so all the curves decay there.
* the performance at least doubles with each higher level of cache.
* That the 4 thread random at 16kb is about 16 times (2^4th) faster than at 10MB.
* There's no inflection point where all the lines cross, so no uncanny valley
* The graph represents pure random, which should be worse than a mix of sequential and random.

I find the java results quite puzzling, especially not getting faster with higher levels of cache.

I'd be interested to see a pure random latency graph from java, maybe that would help explain what is going on.

I'm also very suspicious of the 4 thread numbers at 128MB. Guess it depends on how many of the accesses are sequential. Generally I'd expect main memory latency around 70ns, and 2 threads hitting 2 memory chanels (common on haswell desktops) would manage a throughput of one cache line every 35ns or so. The graph shows 120M or so, which is a random lookup every 8ns or so.

Michael Barker said...

Hi Bill,

Interesting results, could you post the code for your benchmark?


Bill Broadley said...

Michael Barker said...

I've been digging into the tests again and I think the lack of performance in the smaller datasets is due to the cost of JMH's blackhole to prevent dead code elimination. I'm reworking to the test to avoid its use. When I get some more time I'll repost new results.

I had a look through your test. I might actually steal the approach for ensuring 100% random access instead of the 50% mix I have now. However, I ran your code through the linux perf profiler and it seems to be spending the majority (>95%) of its time on a 'TEST' instruction and not on the 'MOV's. If this is correct then it is possible that your test is not actually measuring memory access time and not encountering the same contention that my test is seeing. Similar profiling of my test shows most of the time spent on 'MOV' instructions. However, I'm not sure how accurate perf is, this could be an artefact of pipe-lining or some other inaccuracy.

Bill Broadley said...

Assuming you mean the latency test it's just:
while (p > 0)
p = a[p];

I've verified with various CPU counters that the expected cache misses and page misses happen in TLB friendly mode and TLB thrashing mode. I suspect the waiting at the test instruction is because that's the first use of P. So the load from a[p] is speculatively executed, but the test forces a wait for the result. The register holding p can't be tested until the previous load finishes.

So spending 95% of the time in the test doesn't surprise me, it's the instruction that forces waiting for the load to finish.

I wrote it this way to completely defeat any attempt at prefetch, or cheating in any way. There's no way to know here the next load is coming from until the current load finishes.

At first I didn't believe all the numbers, especially on more complex dual/quad socket systems. But after careful analysis I believe the numbers.

Michael Barker said...

Hi Bill,

I think your test is measuring something different to mine. My original thesis was that if I have 2 threads each with a working set the same size as the L3, then I would see contention, and performance would not scale with the number of CPUs. E.g. with an 8MB L3 and 2 threads each with a 8MB working set (for a total of 16MB).

I notice your test divides the memory in use (maxmem) by the number of threads. So if you have 1 thread and an 8MB working set then you have a total of 8MB. When that same test is run with 2 threads, then the per-thread working set is 4MB for a total of 8MB which will still fit nicely into an 8MB L3. Hence, you don't see the contention effects.

I haven't dug much further into the results on my machine as I have a cache line size of 64 bytes and the '-z' option appears to be broken, such that the perCacheLine value seems to always be 16 no matter what value I specify. So in all cases it will only be touching every other cache line.

Jimy said...

nice post

Bill Broadley said...
This comment has been removed by the author.
Bill Broadley said...
This comment has been removed by the author.
Bill Broadley said...

Wow, sorry for not replying for 2 years or so, I obviously dropped the ball.

Certainly I'd expect things not to scale well when you call out of cache. So certainly 2 processes needing 8MB each will not benefit much from an 8MB L3. But I think in your graph the dip at 8MB is more about alignment than some inherent problem with cache scalability. It's also very weird you aren't seeing any speedups in L1 or L2. Nor any slow down when you fall out of L3.

I've seen compilers that definitely tweak things based on the working set, or even the size of the loop. No idea if java is that smart these days. It can significantly confuse results. Some use significantly different optimizations if you dynamically allocate memory vs static allocation.

But with pure latency or bandwidth I've observed pretty much the expected performance decreasing from L1 -> main memory with 1 thread up to whatever number of CPUs are available. I have some old runs up to 512 CPUs (old SGI large NUMA systems) and more recently up to 96 or so.

One big surprise to me was that random latency kept scaling till the number of threads was twice or so the number of memory channels. So on a dual socket system with 8 channels going from 8 threads to 16 threads for random access makes a significant improvement in throughput (random accesses per second). Doubling again to 32 doesn't help though.

Although you also have to be very careful with random. Once you fall out of cache it's really easy to thrash the TLB. If that's want you want to benchmark, then it's easy. To seperate that out I use a sliding window, so each cache line is hit exactly once, but only within a fixed range. That way you can tell the difference between pure random (TLB+ram access) and sliding window random (TLB friendly, but random ram).

Huge pages can achieve similar benefits.