Follow your heart , Steve Jobs: High Speed Concurrent Framework, Disruptor

Have gone through many articles about the Disruptor, devised by LMAX team. It captures my eye because of this http://martinfowler.com/articles/lmax.html

LMAX is a new retail financial trading platform. As a result it has to process many trades with low latency. The system is built on the JVM platform and centers on a Business Logic Processor that can handle 6 million orders per second on a single thread. The Business Logic Processor runs entirely in-memory using event sourcing. The Business Logic Processor is surrounded by Disruptors - a concurrency component that implements a network of queues that operate without needing locks. During the design process the team concluded that recent directions in high-performance concurrency models using queues are fundamentally at odds with modern CPU design.

Impressive! isn't it :). But the above is a little bit misleading. It is not single thread system. To me, the key for this Disruptor achievements are they brilliantly avoid most common multi-thread traps. Today, instead of the old model, CPU, Registers & Memory, high performance program, need deal with the CPU, Register & Cache. The memory to the today CPU, is just like a hard disk in the old days. Another one is now CPU has multiple cores. The out of order executing will affect your program too.This great work again prove that how important the fundamental is, such as data structure, thread management in OS and deeply understanding on the hardware.

Besides LMAX articles, you can directly insights from Intel Intel® 64 and IA-32 Architectures Optimization Reference Manual. the L1 Cache has 2 types, data & instruction. For data is only 32 KB. The cache line size is 64bytes. The wiki provides very comprehensive explanation on the cache line. Another article from Microsoft on the Driver development clearly sates the common issues in the multiple processor architecture.

Sharing Is the Root of All Contention

from Herb Sutter Drbobbs blog

Cache Line & Atomic Operations

LOCKED ATOMIC OPERATIONS

The 32-bit IA-32 processors support locked atomic operations on locations in system memory. These operations are typically used to manage shared data structures (such as semaphores, segment descriptors, system segments, or page tables) in which two or more processors may try simultaneously to modify the same field or flag. The processor uses three interdependent mechanisms for carrying out locked atomic operations:

Guaranteed atomic operations
Bus locking, using the LOCK# signal and the LOCK instruction prefix
Cache coherency protocols that insure that atomic operations can be carried out on cached data structures (cache lock); this mechanism is present in the Pentium 4, Intel Xeon, and P6 family processors

These mechanisms are interdependent in the following ways. Certain basic memory transactions (such as reading or writing a byte in system memory) are always guaranteed to be handled atomically. That is, once started, the processor guarantees that the operation will be completed before another processor or bus agent is allowed access to the memory location. The processor also supports bus locking for performing selected memory operations (such as a read-modify-write operation in a shared area of memory) that typically need to be handled atomically, but are not automatically handled this way. Because frequently used memory locations are often cached in a processor’s L1 or L2 caches, atomic operations can often be carried out inside a processor’s caches without asserting the bus lock. Here the processor’s cache coherency protocols insure that other processors that are caching the same memory locations are managed properly while atomic operations are performed on cached memory locations.

Memory Barrier Semantics

Acquire semantics mean that the results of the operation are visible before the results of any operation that appears after it in code.

Release semantics mean that the results of the operation are visible after the results of any operation that appears before it in code.

Fence semantics combine acquire and release semantics. The results of an operation with fence semantics are visible before those of any operation that appears after it in code and after those of any operation that appears before it

Cachine Issue

The hardware always reads an entire cache line, rather than individual data items. If you think of the cache as an array, a cache line is simply a row in that array: a consecutive block of memory that is read and cached in a single operation. The size of a cache line is generally from 16 to 128 bytes, depending on the hardware;

Each cache line has one of the following states:

Exclusive, meaning that this data does not appear in any other processor’s cache. When a cache line enters the Exclusive state, the data is purged from any other processor’s cache.
Shared, meaning that another cache line has requested the same data.
Invalid, meaning that another processor has changed the data in the line.
Modified, meaning that the current processor has changed the data in this line.

All architectures on which Windows runs guarantee that every processor in a multiprocessor configuration will return the same value for any given memory location. This guarantee, which is called cache coherency between processors, ensures that whenever data in one processor’s cache changes, all other caches that contain the same data will be updated. On a single-processor system, whenever the required memory location is not in the cache, the hardware must reload it from memory. On a multiprocessor system, if the data is not in the current processor’s cache, the hardware can read it from main memory or request it from other processors’ caches. If the processor then writes a new value to that location, all other processors must update their caches to get the latest data.

Some data structures have a high locality of reference. This means that the structure often appears in a sequence of instructions that reference adjacent fields. If a structure has a high locality of reference and is protected by a lock, it should typically be in its own cache line.

For example, consider a large data structure that is protected by a lock and that contains both a pointer to a data item and a flag indicating the status of that data item. If the structure is laid out so that both fields are in the same cache line, any time the driver updates one variable, the other variable is already present in the cache and can be updated immediately.

In contrast, consider another scenario. What happens if two data structures in the same cache line are protected by two different locks and are accessed simultaneously from two different processors? Processor 0 updates the first structure, causing the cache line in Processor 0 to be marked Exclusive and the data in that line to be purged from other processors’ caches. Processor 1 must request the data from Processor 0 and wait until its own cache is updated before it can update the second structure. If Processor 0 again tries to write the first structure, it must request the data from Processor 1, wait until the cache is updated, and so on. However, if the structures are not on the same cache line, neither processor must wait for these cache updates. Therefore, two data structures that can be accessed simultaneously on two different processors (because they are not protected by the same lock) should be on different cache lines.

Mechanical Sympathy

First thing is to understand the CPU, your rice bow, Mechanical Sympathy.

The following is from http://www.infoq.com/presentations/LMAX

Don't use lock

This will trap your program to "Amdahl's law", (of cause, bad side). The lock will cause the execution context switching, ring3 -> ring0 -> ring3... Refer to Trisha's blog for more detail. But how to avoid the lock in the multi-thread environment? The idea is "don't share the data". The shared resource is the lock existence reason. That means you need consider the data segregation for the high performance system. Their whole ring buffer design is around this point.

Don't copy the data round for the inter-thread communication

Their ring buffer is just like a infinite array. And the index is just like the reference pointer. So the object isn't copied around and won't involve the dynamic memory management. That is today's most common stupid action for OO programmer, always new object(). Remember I have tried the similar idea before in my C++ project.

Some valuable points from Trisha's blog http://mechanitis.blogspot.sg/2011/07/dissecting-disruptor-why-its-so-fast_22.html

Martin and Mike's QCon presentation gives some indicative figures for the cost of cache misses:

Latency from CPU to...	Approx. number of CPU cycles	Approx. time in nanoseconds
Main memory		~60-80ns
QPI transit (between sockets, not drawn)		~20ns
L3 cache	~40-45 cycles,	~15ns
L2 cache	~10 cycles,	~3ns
L1 cache	~3-4 cycles,	~1ns
Register	1 cycle

Cache Line

Volatile = Memory Barrier

This means if you write to a volatile field, you know that:
Any thread accessing that field after the point at which you wrote to it will get the updated value
Anything you did before you wrote that field is guaranteed to have happened and any updated data values will also be visible, because the memory barrier flushed all earlier writes to the cache.

False Sharing

from Herb Sutter Drdobbs.com blog
The general case to watch out for is when you have two objects or fields that are frequently accessed (either read or written) by different threads, at least one of the threads is doing writes, and the objects are so close in memory that they're on the same cache line because they are:

objects nearby in the same array
fields nearby in the same object
objects allocated close together in time (C++, Java) or by the same thread (C#, Java)
static or global objects that the linker decided to lay out close together in memory;
objects that become close in memory dynamically, as when during compacting garbage collection two objects can become adjacent in memory because intervening objects became garbage and were collected; or
objects that for some other reason accidentally end up close together in memory.

First, we can reduce the number of writes to the cache line. For example, writer threads can write intermediate results to a scratch variable most of the time, then update the variable in the popular cache line only occasionally as needed. This is the approach we took in Example 2, where we changed the code to update a local variable frequently and write into the popular result array only once per worker to store its final count.

Second, we can separate the variables so that they aren't on the same cache line. Typically the easiest way to do this is to ensure an object has a cache line to itself that it doesn't share with any other data. To achieve that, you need to do two things:

Ensure that no other object can precede your data in the same cache line by aligning it o begin at the start of the cache line or adding sufficient padding bytes before the object.
Ensure that no other object can follow your data in the same cache line by adding sufficient padding bytes after the object to fill up the line.

References

Java Memory Model

Wiki: Java Memory Model

Memory Barriers

Memory Barriers: a Hardware View for Software Hackers

False Share & Parallel Performance

http://en.wikipedia.org/wiki/False_sharing
http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206
http://www.drdobbs.com/parallel/writing-lock-free-code-a-corrected-queue/210604448
http://www.drdobbs.com/parallel/maximize-locality-minimize-contention/208200273
http://www.drdobbs.com/parallel/understanding-parallel-performance/211800538
http://www.drdobbs.com/architecture-and-design/sharing-is-the-root-of-all-contention/214100002

Follow your heart , Steve Jobs

Friday, August 16, 2013

High Speed Concurrent Framework, Disruptor