Tuesday 25 February 2014

Updating the Disruptor sample

Quick update to Disruptor sample

As a comment was posted on the previous blog asking about comparisons to the juc Executor implementations, I thought it was time to refresh the examples a little and update that with some additional examples and in particular a comparison to a thread pooled implementation (since that was the question).

The code simply does 10000 UDP "echo" packets, the service is expected to uppercase the characters. What I want to look at here is latency while bringing in an additional test to the mix.

This test is a pretty straightforward test - just looking to identify 99th percentile latency across a couple of scenarios. Since these are simple applications I've included a baseline blocking IO implementation to get a comparison.

Total runtime tests

The initial test is a total runtime test. This is a naive indicative number of relative performance. Test runs a number of 10K UDP request-response loops.



A single thread shows that the blocking IO send/receive loop is quicker than the Disruptor implementation, which is quicker than the threadpool. But - one issue; after more than one thread, the results are almost identical? What's going on here?

My initial assumption is that the log to file was causing issues as a cap on throughput, so let's remove it.

Latency with logging disabled


As you can see, similar results apply. Disruptor is only slightly slower than a blocking loop in the sleeping view. The threadpool is a bit further away but still close behind. Disruptor with busy spin waiting strategy - and the non blocking IO in a tight loop - results in the best latency. I think this requires a bit more investigation than I have time to understand but I would welcome comments.

Key observations

Suffice to say, similar key points come up as normally do:
  • A single thread processing is often the quickest approach. Avoid threads until there is a clear reason. Not only that, but the code is significantly shorter and simpler.
  • Disruptor is very quick for handoff between threads. In fact, the message passes between three threads nearly as quick as the single threaded code.
  • Disruptor response times are very sensitive to the choice of blocking strategy. By choosing the wrong strategy, you will burn 100pc CPU across multiple cores while getting worse latency.
  • Threadpool is "quick enough" for many cases. Unless your architecture and design is already very efficient, you're unlikely to notice magical performance gains without significant tuning to other areas of the codebase. It's probable that other areas of the codebase need optimisations first.

The test code and bench

I've tried to write the code in a fairly "natural" style for each approach, so the comparisons are not strictly apples to apples - I think they are more indicative of real-world usage. For example, the creation of a dedicated outbound socket for the threadpool approach is used, where the single threaded and disruptor code can quite easily create a single outbound UDP socket to transmit all responses. I'd be happy to take pull requests.

Beware when you are running the samples that you will need a quiet network to avoid any dropped UDP packets.

Running the client -> mvn clean package ; java -jar target/disruptor-1.0-SNAPSHOT.jar <<serveripaddress>> <<numclients>>
Running the server -> mvn exec:java -Dbio (or -Dthreadpool or -Ddisruptor), will open UDP port 9999.

The code.

4 comments:

  1. Hi Jason, had a look at the source seeing you are running 3 different disruptors (or missed I something ?). This way it is degraded to a queue. You should get much lower latencies using assembly line style discruptor (=1 Disruptor instance and 3 eventprocessors using processing dependencies to ensure correct processing order).

    ReplyDelete
  2. Hi Rudiger, I see what you mean. Actually, there are only two Disruptor instances in use in this case. The reason the example is set up as such is that the middle "business logic" handler has to create new data that it's going to send. If we have the business logic thread writing back into the first ringbuffer, then we end up with multiple writers onto the single queue which is likely to cause contention. In other words, if I had a single ring, it means the UDP packet reading goes into the buffer, then another thread modifies that buffer in place, then the third thread reads from the buffer. Since the UDP packet and the "business logic" both write to the buffer, this violates single writer principle. Does this make sense? Diagram is here - http://fasterjava.blogspot.com.au/2013/04/disruptor-example-udp-echo-service-with.html

    Does it make sense or have I overlooked something? I haven't actually tested the cost of this in terms of latency, which would be an interesting exercise.

    ReplyDelete
  3. At least in Disruptor 3.x there is a multiple writer option. I just was wondering, since from my experience latency should be <200 microseconds without applying advanced magic. Pure socket UDP is like 10-30 micros.

    ReplyDelete
  4. Interesting, I will definitely look into this. This is remote host, not same host, and old-ish hardware, but I will try a few things and see if I can isolate a cause for the long rtt times.

    ReplyDelete