I've be working sporadically on the next major revision of the Disruptor, but still making steady progress. I've merged my experimental branch into the main line and I'm working on ensure comprehensive test coverage and re-implement the Disruptor DSL.
As a matter of course I've been running performance tests to ensure that we don't regress performance. While I've not been focusing on performance, just some refactoring and simplification I got a nice surprise. The new version is over twice as fast for the 1P1C simple test case; approximately 85M ops/sec versus 35M ops/sec. This is on my workstation which is an Intel(R) Xeon(R) CPU E5620@2.40GHz.
Current released version (2.10.1)
[barkerm@snake disruptor-2]$ taskset -c 4-8,12-15 ant througput:single-test througput:single-test: [junit] Running com.lmax.disruptor.OnePublisherToOneProcessorUniCastThroughputTest [junit] Started Disruptor run 0 [junit] Disruptor=21,141,649 ops/sec [junit] Started Disruptor run 1 [junit] Disruptor=20,597,322 ops/sec [junit] Started Disruptor run 2 [junit] Disruptor=33,233,632 ops/sec [junit] Started Disruptor run 3 [junit] Disruptor=32,883,919 ops/sec [junit] Started Disruptor run 4 [junit] Disruptor=33,852,403 ops/sec [junit] Started Disruptor run 5 [junit] Disruptor=32,819,166 ops/sec [junit] Started Disruptor run 6
Current trunk.
[barkerm@snake disruptor]$ taskset -c 4-8,12-15 ant througput:single-test througput:single-test: [junit] Running com.lmax.disruptor.OnePublisherToOneProcessorUniCastThroughputTest [junit] Started Disruptor run 0 [junit] Disruptor=23,288,309 ops/sec [junit] Started Disruptor run 1 [junit] Disruptor=23,573,785 ops/sec [junit] Started Disruptor run 2 [junit] Disruptor=86,805,555 ops/sec [junit] Started Disruptor run 3 [junit] Disruptor=87,183,958 ops/sec [junit] Started Disruptor run 4 [junit] Disruptor=86,956,521 ops/sec [junit] Started Disruptor run 5 [junit] Disruptor=87,260,034 ops/sec [junit] Started Disruptor run 6 [junit] Disruptor=88,261,253 ops/sec [junit] Started Disruptor run 7
There is still more to come. In addition to improvement for the single producer use case the next version will include a new multiple producer algorithm that will replace the two that we currently have and work better than both of them in all scenarios.
5 comments:
Awesome. Nice work, Mike!
How have you improved the performance?
Nothing specific, mostly code clean up, refactoring and simplification, the performance boost was a surprise.
Just to confirm:
1) you are measuring the time it takes for all messages to make it to the consumer as opposed to the time it takes for the producer to get rid of all the messages, correct?
2) you are using hyper-threading, in other words, producer and consumer are pinned to the same core which makes their communication much faster through the L1 cache, correct?
Hi Chris,
1) Yes.
2) No, I've set the taskset to allow a range of CPUs. Primarily to avoid running on the core that is handling OS interrupts. The actual cores that they run on is set up by the OS.
Post a Comment