Tuesday 25 February 2014

Updating the Disruptor sample

Quick update to Disruptor sample

As a comment was posted on the previous blog asking about comparisons to the juc Executor implementations, I thought it was time to refresh the examples a little and update that with some additional examples and in particular a comparison to a thread pooled implementation (since that was the question).

The code simply does 10000 UDP "echo" packets, the service is expected to uppercase the characters. What I want to look at here is latency while bringing in an additional test to the mix.

This test is a pretty straightforward test - just looking to identify 99th percentile latency across a couple of scenarios. Since these are simple applications I've included a baseline blocking IO implementation to get a comparison.

Total runtime tests

The initial test is a total runtime test. This is a naive indicative number of relative performance. Test runs a number of 10K UDP request-response loops.



A single thread shows that the blocking IO send/receive loop is quicker than the Disruptor implementation, which is quicker than the threadpool. But - one issue; after more than one thread, the results are almost identical? What's going on here?

My initial assumption is that the log to file was causing issues as a cap on throughput, so let's remove it.

Latency with logging disabled


As you can see, similar results apply. Disruptor is only slightly slower than a blocking loop in the sleeping view. The threadpool is a bit further away but still close behind. Disruptor with busy spin waiting strategy - and the non blocking IO in a tight loop - results in the best latency. I think this requires a bit more investigation than I have time to understand but I would welcome comments.

Key observations

Suffice to say, similar key points come up as normally do:
  • A single thread processing is often the quickest approach. Avoid threads until there is a clear reason. Not only that, but the code is significantly shorter and simpler.
  • Disruptor is very quick for handoff between threads. In fact, the message passes between three threads nearly as quick as the single threaded code.
  • Disruptor response times are very sensitive to the choice of blocking strategy. By choosing the wrong strategy, you will burn 100pc CPU across multiple cores while getting worse latency.
  • Threadpool is "quick enough" for many cases. Unless your architecture and design is already very efficient, you're unlikely to notice magical performance gains without significant tuning to other areas of the codebase. It's probable that other areas of the codebase need optimisations first.

The test code and bench

I've tried to write the code in a fairly "natural" style for each approach, so the comparisons are not strictly apples to apples - I think they are more indicative of real-world usage. For example, the creation of a dedicated outbound socket for the threadpool approach is used, where the single threaded and disruptor code can quite easily create a single outbound UDP socket to transmit all responses. I'd be happy to take pull requests.

Beware when you are running the samples that you will need a quiet network to avoid any dropped UDP packets.

Running the client -> mvn clean package ; java -jar target/disruptor-1.0-SNAPSHOT.jar <<serveripaddress>> <<numclients>>
Running the server -> mvn exec:java -Dbio (or -Dthreadpool or -Ddisruptor), will open UDP port 9999.

The code.