Monday, August 26, 2013

理不辩不明

这两天追完了薄熙来的文字转播。不得不佩服薄熙来的功力。虽然一直不是很喜欢他,而在他搞了唱红之后尤其讨厌他做秀的模样。这次审判倒是让我眼前一亮。在看完了他及证人的证词后,感觉到一种莫名的悲哀。这帮人到底每天都是一个什么样的心理状态呀!在杯觥交错之后入眠前的1分钟,是否会感到一种落寞。办公室里充满了算计,家里也是机关重重。那样的日子有意思吗? 

相信他是一个工作能力很强的人,就这样把他拉下马和他过去种种的极具争议的唱红打黑难道仅仅是他个人的过错?在我看来他的结局其实在他进入那样一个只唯上和绝对的权利的体系时,就已经注定他的结局了。如果通过做秀而由下至上的改变上头的意思,那就不是现在的中国了。

一个在这样的贪腐风行的官场,如果真的就只有500万公款和一套别墅,以他的权位也真的是够清廉了!!!



Friday, August 16, 2013

High Speed Concurrent Framework, Disruptor

Have gone through many articles about the Disruptor, devised by LMAX team. It captures my eye because of this http://martinfowler.com/articles/lmax.html 

LMAX is a new retail financial trading platform. As a result it has to process many trades with low latency. The system is built on the JVM platform and centers on a Business Logic Processor that can handle 6 million orders per second on a single thread. The Business Logic Processor runs entirely in-memory using event sourcing. The Business Logic Processor is surrounded by Disruptors - a concurrency component that implements a network of queues that operate without needing locks. During the design process the team concluded that recent directions in high-performance concurrency models using queues are fundamentally at odds with modern CPU design.

Impressive! isn't it :). But the above is a little bit misleading. It is not single thread system. To me, the key for this Disruptor achievements are they brilliantly avoid most common multi-thread traps. Today, instead of the old model, CPU, Registers & Memory, high performance program, need deal with the CPU, Register & Cache. The memory to the today CPU, is just like a hard disk in the old days. Another one is now CPU has multiple cores. The out of order executing will affect your program too.This great work again prove that how important the fundamental is, such as data structure, thread management in OS and deeply understanding on the hardware.



Besides LMAX articles, you can directly insights from Intel Intel® 64 and IA-32 Architectures Optimization Reference Manual. the L1 Cache has 2 types, data & instruction. For data is only 32 KB. The cache line size is 64bytes. The wiki provides very comprehensive explanation on the cache line. Another article from Microsoft on the Driver development clearly sates the common issues in the multiple processor architecture.

Sharing Is the Root of All Contention

from Herb Sutter Drbobbs blog


Cache Line & Atomic Operations


LOCKED ATOMIC OPERATIONS

The 32-bit IA-32 processors support locked atomic operations on locations in system memory. These operations are typically used to manage shared data structures (such as semaphores, segment descriptors, system segments, or page tables) in which two or more processors may try simultaneously to modify the same field or flag. The processor uses three interdependent mechanisms for carrying out locked atomic operations:
  • Guaranteed atomic operations
  • Bus locking, using the LOCK# signal and the LOCK instruction prefix
  • Cache coherency protocols that insure that atomic operations can be carried out on cached data structures (cache lock); this mechanism is present in the Pentium 4, Intel Xeon, and P6 family processors
These mechanisms are interdependent in the following ways. Certain basic memory transactions (such as reading or writing a byte in system memory) are always guaranteed to be handled atomically. That is, once started, the processor guarantees that the operation will be completed before another processor or bus agent is allowed access to the memory location. The processor also supports bus locking for performing selected memory operations (such as a read-modify-write operation in a shared area of memory) that typically need to be handled atomically, but are not automatically handled this way. Because frequently used memory locations are often cached in a processor’s L1 or L2 caches, atomic operations can often be carried out inside a processor’s caches without asserting the bus lock. Here the processor’s cache coherency protocols insure that other processors that are caching the same memory locations are managed properly while atomic operations are performed on cached memory locations.

Memory Barrier Semantics

  • Acquire semantics mean that the results of the operation are visible before the results of any operation that appears after it in code. 
  • Release semantics mean that the results of the operation are visible after the results of any operation that appears before it in code.
  • Fence semantics combine acquire and release semantics. The results of an operation with fence semantics are visible before those of any operation that appears after it in code and after those of any operation that appears before it

Cachine Issue 

The hardware always reads an entire cache line, rather than individual data items. If you think of the cache as an array, a cache line is simply a row in that array: a consecutive block of memory that is read and cached in a single operation. The size of a cache line is generally from 16 to 128 bytes, depending on the hardware; 

Each cache line has one of the following states:
  • Exclusive, meaning that this data does not appear in any other processor’s cache. When a cache line enters the Exclusive state, the data is purged from any other processor’s cache.
  • Shared, meaning that another cache line has requested the same data.
  • Invalid, meaning that another processor has changed the data in the line.
  • Modified, meaning that the current processor has changed the data in this line.
All architectures on which Windows runs guarantee that every processor in a multiprocessor configuration will return the same value for any given memory location. This guarantee, which is called cache coherency between processors, ensures that whenever data in one processor’s cache changes, all other caches that contain the same data will be updated. On a single-processor system, whenever the required memory location is not in the cache, the hardware must reload it from memory. On a multiprocessor system, if the data is not in the current processor’s cache, the hardware can read it from main memory or request it from other processors’ caches. If the processor then writes a new value to that location, all other processors must update their caches to get the latest data.
Some data structures have a high locality of reference. This means that the structure often appears in a sequence of instructions that reference adjacent fields. If a structure has a high locality of reference and is protected by a lock, it should typically be in its own cache line.
For example, consider a large data structure that is protected by a lock and that contains both a pointer to a data item and a flag indicating the status of that data item. If the structure is laid out so that both fields are in the same cache line, any time the driver updates one variable, the other variable is already present in the cache and can be updated immediately.

In contrast, consider another scenario. What happens if two data structures in the same cache line are protected by two different locks and are accessed simultaneously from two different processors? Processor 0 updates the first structure, causing the cache line in Processor 0 to be marked Exclusive and the data in that line to be purged from other processors’ caches. Processor 1 must request the data from Processor 0 and wait until its own cache is updated before it can update the second structure. If Processor 0 again tries to write the first structure, it must request the data from Processor 1, wait until the cache is updated, and so on. However, if the structures are not on the same cache line, neither processor must wait for these cache updates. Therefore, two data structures that can be accessed simultaneously on two different processors (because they are not protected by the same lock) should be on different cache lines.

Mechanical Sympathy

First thing is to understand the CPU, your rice bow, Mechanical Sympathy.

The following is from http://www.infoq.com/presentations/LMAX













Don't use lock 

This will trap your program to "Amdahl's law", (of cause, bad side). The lock will cause the execution context switching, ring3 -> ring0 -> ring3... Refer to Trisha's blog for more detail. But how to avoid the lock in the multi-thread environment? The idea is "don't share the data". The shared resource is the lock existence reason. That means you need consider the data segregation for the high performance system. Their whole ring buffer design is around this point. 

Don't copy the data round for the inter-thread communication

Their ring buffer is just like a infinite array. And the index is just like the reference pointer. So the object isn't copied around and won't involve the dynamic memory management. That is today's most common stupid action for OO programmer, always new object(). Remember I have tried the similar idea before in my C++ project. 


Some valuable points from Trisha's blog http://mechanitis.blogspot.sg/2011/07/dissecting-disruptor-why-its-so-fast_22.html

Martin and Mike's QCon presentation gives some indicative figures for the cost of cache misses:

Latency from CPU to...Approx. number of
CPU cycles
Approx. time
in nanoseconds
Main memory~60-80ns
QPI transit
(between sockets, not drawn)
~20ns
L3 cache~40-45 cycles,~15ns
L2 cache~10 cycles,~3ns
L1 cache~3-4 cycles,~1ns
Register1 cycle


Cache Line


Volatile = Memory Barrier




This means if you write to a volatile field, you know that:
Any thread accessing that field after the point at which you wrote to it will get the updated value 
Anything you did before you wrote that field is guaranteed to have happened and any updated data values will also be visible, because the memory barrier flushed all earlier writes to the cache.

False Sharing 

from  Herb Sutter Drdobbs.com blog
The general case to watch out for is when you have two objects or fields that are frequently accessed (either read or written) by different threads, at least one of the threads is doing writes, and the objects are so close in memory that they're on the same cache line because they are:


  • objects nearby in the same array
  • fields nearby in the same object
  • objects allocated close together in time (C++, Java) or by the same thread (C#, Java)
  • static or global objects that the linker decided to lay out close together in memory;
  • objects that become close in memory dynamically, as when during compacting garbage collection two objects can become adjacent in memory because intervening objects became garbage and were collected; or
  • objects that for some other reason accidentally end up close together in memory.

First, we can reduce the number of writes to the cache line. For example, writer threads can write intermediate results to a scratch variable most of the time, then update the variable in the popular cache line only occasionally as needed. This is the approach we took in Example 2, where we changed the code to update a local variable frequently and write into the popular result array only once per worker to store its final count.

Second, we can separate the variables so that they aren't on the same cache line. Typically the easiest way to do this is to ensure an object has a cache line to itself that it doesn't share with any other data. To achieve that, you need to do two things:
  • Ensure that no other object can precede your data in the same cache line by aligning it o begin at the start of the cache line or adding sufficient padding bytes before the object.
  • Ensure that no other object can follow your data in the same cache line by adding sufficient padding bytes after the object to fill up the line.

References

Java Memory Model

Simple words always inspire peoples

Brian Goetz, http://mail.openjdk.java.net/pipermail/lambda-dev/2013-March/008435.html

When evaluating a language feature, you need to examine both the cost and the benefit side of the proposal.

Benefit: how would having this feature enable me to write code that is better than what I can write today.

Cost: how would having this feature enable other people to write WORSE code than they might write today.

I like them! When consider things or make decisions, we should always remind ourselves on "Cost". The "Benefit" is just allure for our mistakes.

Wednesday, August 14, 2013

How HTTPS works, HTTP Tunneling & WebSocket

HTTPS

Finally understand how it works. HTTPS is just HTTP on top of SSL/TSL. HTTPs isn't a protocol at all. All the web proxy is just HTTP proxy. Their working flow is as 

Request message 
Client -> Proxy -> Server

Repsond message
Client <- Proxy <- Server

Because HTTP is just clear text message, the proxy is able to cache the data if the request is same. This is clearly defined in the HTTP protocol. 

The interesting part is about the HTTPS. I mistakenly believe it is similar as HTTP. But in fact it is completely not. HTTPS is HTTP message is packaged as SSL message. It can't be proxy/cached at all. It relies on the HTTP tunneling (http://en.wikipedia.org/wiki/HTTP_tunnelhttp://tools.ietf.org/html/draft-luotonen-web-proxy-tunneling-01) .


CLIENT -> SERVER                        SERVER -> CLIENT
--------------------------------------  -----------------------------------
CONNECT home.netscape.com:443 HTTP/1.0
User-agent: Mozilla/4.0
<<< empty line >>>
                                        HTTP/1.0 200 Connection established
                                        Proxy-agent: Netscape-Proxy/1.1
                                        <<< empty line >>>
              <<< data tunneling to both directions begins >>>

From the above, :) we can easily to tunnel any protocol over proxy, such as SSH.

WebSocket

The interesting part is the Web socket (http://www.ietf.org/rfc/rfc6455.txt) also rely on the HTTP, CONNECT, when need pass through the proxy.

URL format

 ws-URI = "ws:" "//" host [ ":" port ] path [ "?" query ]
 wss-URI = "wss:" "//" host [ ":" port ] path [ "?" query ]

Handshake

client request:

        GET /chat HTTP/1.1
        Host: server.example.com
        Upgrade: websocket
        Connection: Upgrade
        Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
        Origin: http://example.com
        Sec-WebSocket-Protocol: chat, superchat
        Sec-WebSocket-Version: 13

server response

        HTTP/1.1 101 Switching Protocols
        Upgrade: websocket
        Connection: Upgrade
        Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=
        Sec-WebSocket-Protocol: chat

Message Frame

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-------+-+-------------+-------------------------------+
     |F|R|R|R| opcode|M| Payload len |    Extended payload length    |
     |I|S|S|S|  (4)  |A|     (7)     |             (16/64)           |
     |N|V|V|V|       |S|             |   (if payload len==126/127)   |
     | |1|2|3|       |K|             |                               |
     +-+-+-+-+-------+-+-------------+ - - - - - - - - - - - - - - - +
     |     Extended payload length continued, if payload len == 127  |
     + - - - - - - - - - - - - - - - +-------------------------------+
     |                               |Masking-key, if MASK set to 1  |
     +-------------------------------+-------------------------------+
     | Masking-key (continued)       |          Payload Data         |
     +-------------------------------- - - - - - - - - - - - - - - - +
     :                     Payload Data continued ...                :
     + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
     |                     Payload Data continued ...                |
     +---------------------------------------------------------------+

CometD, Bayeux Server

CometD is the framework to implement the Bayeux protocol for the Comet messaging. Refer to the http://docs.cometd.org/reference/ for the detail.

Its 2.4 performance can be found here 

CometD components



Message flow

  1. It invokes BayeuxServer extensions (methods rcv() or rcvMeta()); if one extension denies processing, a reply is sent to the client indicating that the message has been deleted, and no further processing is performed for the message.

  2. It invokes ServerSession extensions (methods rcv() or rcvMeta(), only if a ServerSession for that client exists); if one extension denies processing, a reply is sent to the client indicating that the message has been deleted, and no further processing is performed for the message.

  3. It invokes authorization checks for both the security policy and the authorizers; if the authorization is denied, a reply is sent to the client indicating the failure, and no further processing is performed for the message.

  4. If the message is a service or broadcast message, the message passes through BayeuxServer extensions (methods send() or sendMeta()).

  5. It invokes server channel listeners; the application adds server channel listeners on the server, and offers the last chance to modify the message before it is eventually sent to all subscribers (if it is a broadcast message). All subscribers see any modification a server channel listener makes to the message, just as if the publisher has sent the message already modified. After the server channel listeners processing, the message is frozen and no further modifications should be made to the message. Applications should not worry about this freezing step, because the API clarifies whether the message is modifiable or not: the API has as a parameter a modifiable message interface or an unmodifiable one to represent the message object. This step is the last processing step for an incoming non-broadcast message, and it therefore ends its journey on the server. A reply is sent to publishers to confirm that the message made it to the server (see below), but the message is not broadcast to other server sessions.

  6. If the message is a broadcast message, for each server session that subscribes to the channel, the message passes through ServerSession extensions (methods send() or sendMeta()), then the server session queue listeners are invoked and finally the message is added to the server session queue for delivery.

  7. If the message is a lazy message (see Section 7.4.7, “Lazy Channels and Messages”), it is sent on first occasion. Otherwise the message is delivered immediately. If the server session onto which the message is queued corresponds to a remote client session, it is assigned a thread to deliver the messages in its queue through the server transport. The server transport drains the server session message queue, converts the messages to JSON and sends them on the conduit as the payloads of transport-specific envelopes (for example, an HTTP response or a WebSocket message). Otherwise, the server session onto which the message is queued corresponds to a local session, and the messages in its queue are delivered directly to the local session.

  8. For both broadcast and non-broadcast messages, a reply message is created, passes through BayeuxServer extensions and ServerSession extensions (methods send() or sendMeta()). It then passes to the server transport, which converts it to JSON through a JSONContext.Server instance (see Section 7.5.1, “JSONContext API”), and sends it on the conduit as the payload of a transport-specific envelope (for example, an HTTP response or a WebSocket message).

  9. The envelope travels back to the client, where the client transport receives it. The client transport converts the messages from the JSON format back to message objects, for the Java client via a JSONContext.Client instance (see Section 7.5.1, “JSONContext API”).

  10. Each message then passes through the extensions (methods send() or sendMeta()), and channel listeners and subscribers are notified of the message.
The round trip from client to server back to client is now complete.

Tuesday, August 13, 2013

AsyncHTTP, Comet

HTTP is 1 way and stateless protocol. In order to get the real time updates, we have to use the polling. The article from Jetty list down the cost for the polling. It is huge. But lucky, we got the Comet and Async Servlet3.0.

the article from IBM has very comprehensive introduction on the various Comet solutions, such as polling, long-polling and streaming.

http://www.ibm.com/developerworks/web/library/wa-cometjava/

AJAX polling problem

Refer to original for the detail: http://docs.codehaus.org/display/JETTY/Continuations
But there is a new problem. The advent of AJAX as a web application model is significantly changing the traffic profile seen on the server side. Because AJAX servers cannot deliver asynchronous events to the client, the AJAX client must poll for events on the server. To avoid a busy polling loop, AJAX servers will often hold onto a poll request until either there is an event or a timeout occurs. Thus an idle AJAX application will have an outstanding request waiting on the server which can be used to send a response to the client the instant an asynchronous event occurs. This is a great technique, but it breaks the thread-per-request model, because now every client will have a request outstanding in the server. Thus the server again needs to have one or more threads for every client and again there are problems scaling to thousands of simultaneous users.

Formula
Web 1.0
Web 2.0 +
Comet
Web 2.0 +
Comet +
Continuations
Users
u
10000
10000
10000
Requests/Burst
b
5
2
2
Burst period (s)
p
20
5
5
Request Duration (s)
d
0.200
0.150
0.175
Poll Duration (s)
D
0
10
10





Request rate (req/s)
rr=u*b/20
2500
4000
4000
Poll rate (req/s)
pr=u/d
0
1000
1000
Total (req/s)
r=rr+pr
2500
5000
5000





Concurrent requests
c=rr*d+pr*D
500
10600
10700
Min Threads
T=c
T=r*d
500
-
10600
-
-
875
Stack memory
S=64*1024*T
32MB
694MB
57MB

HTTP, WebSocket, SPDY, HTTP/2.0 Evolution of Web Protocols

A very comprehensive doc on the HTTP technical evolvement.