Follow your heart , Steve Jobs: Tech

Showing posts with label Tech. Show all posts

Monday, January 27, 2014

Designing Reusable Classes

Ralph E. Johnson & Brian Foote

Software reuse does not happen by accident, even with object-oriented programming languages. System designers must plan to reuse old components and must look for new reusable components.

Evolutionary lifecycles are the rule rather than the exception. Software maintenance can be categorized as corrective, adaptive, and perfective

Classes usually start out being application dependent. It is always worthwhile to examine a nearly-complete project to see if new abstract classes and frameworks can be discovered. They can probably be reused in later projects, and their presence in the current project will make later enhancements much easier. Thus, creating abstract classes and frameworks is both a way of scavenging components for later reuse and a way of cleaning up a design. The final class hierarchy is a description of how the system ought to have been designed, though it may bear little relation to the original design.

One sign that a good abstraction has been found is that code size decreases, indicating that code is being reused. Many Smalltalk projects have periods in which the size of the code increases at a steady rate, followed by periods in which little change occurs to the code, followed by a sharp decrease in the size of the code. Code size increases as the programmers add new classes and new methods to old classes. Eventually the programmers realize that they need to rearrange the class hierarchy. They spend a bit of time in debate and experimentation and then make the necessary changes, usually creating a new abstract class or two. Since Smalltalk programs tend to be compact, it is feasible to rewrite a system many times during its development. The result is much easier to understand and maintain than typical nonobject-oriented systems.

Finding new abstractions is difficult. In general, it seems that an abstraction is usually discovered by generalizing from a number of concrete examples.

Rules for Finding Standard Protocols

Rule 1: Recursion introduction

Rule 2: Eliminate case analysis

Rule 3: Reduce the number of arguments

Rule 4: Reduce the size of methods

Rules for Finding Abstract Classes

Rule 5: Class hierarchies should be deep and narrow

Rule 6: The top of the class hierarchy should be abstract

Rule 7: Minimize accesses to variables

Rule 8: Subclasses should be specializations

Rules for Finding Frameworks

Rule 9: Split large classes

Rule 10: Factor implementation differences into subcomponents

Rule 11: Separate methods that do not communicate

Rule 12: Send messages to components instead of to self

Rule 13: Reduce implicit parameter passing

Monday, January 6, 2014

Open Source Project Types

http://zguide.zeromq.org/page%3aall#toc89

There are three main open source patterns. The first is the large firm dumping code to break the market for others. This is the Apache Foundation model. The second is tiny teams or small firms building their dream. This is the most common open source model, which can be very successful commercially. The last is aggressive and diverse communities that swarm over a problem landscape. This is the Linux model, and the one to which we aspire with ØMQ.

It's hard to overemphasize the power and persistence of a working open source community. There really does not seem to be a better way of making software for the long term. Not only does the community choose the best problems to solve, it solves them minimally, carefully, and it then looks after these answers for years, decades, until they're no longer relevant, and then it quietly puts them away.

Saturday, January 4, 2014

Reliability!, What does it mean?

http://zguide.zeromq.org/page%3aall#High-Level-Messaging-Patterns

Most people who speak of "reliability" don't really know what they mean. We can only define reliability in terms of failure. That is, if we can handle a certain set of well-defined and understood failures, then we are reliable with respect to those failures. No more, no less. So let's look at the possible causes of failure in a distributed ØMQ application, in roughly descending order of probability:

Application code is the worst offender. It can crash and exit, freeze and stop responding to input, run too slowly for its input, exhaust all memory, and so on.
System code—such as brokers we write using ØMQ—can die for the same reasons as application code. System code should be more reliable than application code, but it can still crash and burn, and especially run out of memory if it tries to queue messages for slow clients.
Message queues can overflow, typically in system code that has learned to deal brutally with slow clients. When a queue overflows, it starts to discard messages. So we get "lost" messages.
Networks can fail (e.g., WiFi gets switched off or goes out of range). ØMQ will automatically reconnect in such cases, but in the meantime, messages may get lost.
Hardware can fail and take with it all the processes running on that box.
Networks can fail in exotic ways, e.g., some ports on a switch may die and those parts of the network become inaccessible.
Entire data centers can be struck by lightning, earthquakes, fire, or more mundane power or cooling failures.

To make a software system fully reliable against all of these possible failures is an enormously difficult and expensive job and goes beyond the scope of this book.

Because the first five cases in the above list cover 99.9% of real world requirements outside large companies (according to a highly scientific study I just ran, which also told me that 78% of statistics are made up on the spot, and moreover never to trust a statistic that we didn't falsify ourselves), that's what we'll examine. If you're a large company with money to spend on the last two cases, contact my company immediately! There's a large hole behind my beach house waiting to be converted into an executive swimming pool.

Key for Scalable System/Program

http://zguide.zeromq.org/page%3aall

To make utterly perfect MT programs (and I mean that literally), we don't need mutexes, locks, or any other form of inter-thread communication except messages sent across ØMQ sockets.

By "perfect MT programs", I mean code that's easy to write and understand, that works with the same design approach in any programming language, and on any operating system, and that scales across any number of CPUs with zero wait states and no point of diminishing returns.

If you've spent years learning tricks to make your MT code work at all, let alone rapidly, with locks and semaphores and critical sections, you will be disgusted when you realize it was all for nothing. If there's one lesson we've learned from 30+ years of concurrent programming, it is: just don't share state. It's like two drunkards trying to share a beer. It doesn't matter if they're good buddies. Sooner or later, they're going to get into a fight. And the more drunkards you add to the table, the more they fight each other over the beer. The tragic majority of MT applications look like drunken bar fights.

The list of weird problems that you need to fight as you write classic shared-state MT code would be hilarious if it didn't translate directly into stress and risk, as code that seems to work suddenly fails under pressure. A large firm with world-beating experience in buggy code released its list of "11 Likely Problems In Your Multithreaded Code", which covers forgotten synchronization, incorrect granularity, read and write tearing, lock-free reordering, lock convoys, two-step dance, and priority inversion.

Yeah, we counted seven problems, not eleven. That's not the point though. The point is, do you really want that code running the power grid or stock market to start getting two-step lock convoys at 3 p.m. on a busy Thursday? Who cares what the terms actually mean? This is not what turned us on to programming, fighting ever more complex side effects with ever more complex hacks.

Some widely used models, despite being the basis for entire industries, are fundamentally broken, and shared state concurrency is one of them. Code that wants to scale without limit does it like the Internet does, by sending messages and sharing nothing except a common contempt for broken programming models.

You should follow some rules to write happy multithreaded code with ØMQ:

Isolate data privately within its thread and never share data in multiple threads. The only exception to this are ØMQ contexts, which are threadsafe.
Stay away from the classic concurrency mechanisms like as mutexes, critical sections, semaphores, etc. These are an anti-pattern in ØMQ applications.
Create one ØMQ context at the start of your process, and pass that to all threads that you want to connect via inproc sockets.
Use attached threads to create structure within your application, and connect these to their parent threads using PAIR sockets over inproc. The pattern is: bind parent socket, then create child thread which connects its socket.
Use detached threads to simulate independent tasks, with their own contexts. Connect these over tcp. Later you can move these to stand-alone processes without changing the code significantly.
All interaction between threads happens as ØMQ messages, which you can define more or less formally.
Don't share ØMQ sockets between threads. ØMQ sockets are not threadsafe. Technically it's possible to migrate a socket from one thread to another but it demands skill. The only place where it's remotely sane to share sockets between threads are in language bindings that need to do magic like garbage collection on sockets.

Friday, January 3, 2014

Angular Testing

http://www.yearofmoo.com/2013/09/advanced-testing-and-debugging-in-angularjs.html#presentation-slides-plus-video

http://www.yearofmoo.com/2013/01/full-spectrum-testing-with-angularjs-and-karma.html

What to test	Tools to use and why
Testing to see if each page loads properly	Use Protractor (Selenium) A Web Driver or an E2E test can run a full integration test through a given URL on your website and can provide data on whether any JavaScript errors occurred
Does my backend API work as expected?	Server-Side testing Lookup which server-side test framework fits your needs. Don't solely rely integration tests to see if things are working.
If I change my front-end JavaScript API code then how do I know things are working?	Unit Testing (Jasmine or Mocha) Ideally you want to have a solid test spec for each feature (logical branch of execution) within each block of code (functions, objects, services, methods, etc...) to keep track of what goes on in that method.
How do I know things work for browser X	Integration or Unit Testing Typically you would setup a collection of integration tests to cover various pages/views on your application. Then when a bug appears, try to isolate the broken code into it's own service/subroutine and setup a unit test or two to cover what's going on. This way if your page/view code changes then you'll still have a reference to the unit test (since unit tests are cheap) and you won't have to worry about looking out for that bug again (since the contained service/subroutine is where it will be located). This may be challenging with DOM code, but anything is possible.
I didn't read the CHANGELOG and I want to know if things will break if I up the version on my 3rd-party code	Integration or Unit Testing Integration tests need to cover most if not all of the views on your page and, if you upgrade to a new version of framework X, then it should be easy to find out which features are not working. However, if the changes in the 3rd-party code are more internal and purely JavaScript then an integration test may not detect any broken code. If you have good amount of unit test code coverage then you should be fine, but if you have close to nothing then setup a simple test spec which tests out the simple input and outputs of the 3rd party code (method names and return values should be enough).

Thursday, December 26, 2013

Wonderful article on JS Delete

http://perfectionkills.com/understanding-delete/

Here’s a short summary of how deletion works in Javascript:

Variables and function declarations are properties of either Activation or Global objects.
Properties have attributes, one of which — DontDelete — is responsible for whether a property can be deleted.
Variable and function declarations in Global and Function code always create properties with DontDelete.
Function arguments are also properties of Activation object and are created with DontDelete.
Variable and function declarations in Eval code always create properties without DontDelete.
New properties are always created with empty attributes (and so without DontDelete).
Host objects are allowed to react to deletion however they want.

Tuesday, December 17, 2013

Analysis vs. Design: What’s the Difference?

http://www.omg.org/news/meetings/workshops/presentations/eai_2001/tutorial_monday/tockey_tutorial/2-Analysis_vs_Design.pdf

Requirements

Unambiguous

Interpretable in only one way

Testable

Compliance (or, non-compliance) can be clearly demonstrated

Binding

The customer is willing to pay for it and is unwilling to not have it

Every requirement that is still necessary in spite of “perfect technology” is an essential requirement.

Requirements about speed, cost, and capacity go into the design bucket

Requirements about reliability (MTBF, MTTR) go into the design bucket

Requirements about I/O mechanisms and presentations go into the design bucket

Requirements about computer languages go into the design bucket

Requirements about archiving go into the design bucket

Requirements about the customer's business policy / business process go into the essential bucket

UML for Analysis

UML for Design

Benefits

Reduce apparent complexity: one large problem becomes two smaller ones

Understand the customer’s business policy / business process
Figure out how to automate that business policy / process with the available technology

Isolate areas of expertise

Apply the principles of coupling and cohesion at the highest level of the software architecture

More robust, less fragile systems
Enable separate evolution of the business policy / business process and the implementation technology

Responsive Web Design - Study Memo

Articles

http://alistapart.com/article/responsive-web-design
http://blog.teamtreehouse.com/beginners-guide-to-responsive-web-design
http://alistapart.com/article/fluidgrids
http://www.javascriptkit.com/dhtmltutors/cssmediaqueries2.shtml

W3C

http://www.w3.org/TR/css3-mediaqueries/

Recommended Media Screen Width

320px
480px
600px
768px
900px
1200px

Use EM, relative unit, instead of PIXCEL for the font size.

Use fluid grid for responsive website

Tools

css3-mediaqueries-js

https://code.google.com/p/css3-mediaqueries-js/

Thursday, October 31, 2013

Wonderful Performance Metrics Tool

When my crawler project is closed to the end. my concern on the performance is heavier. How to measure my system performance? there is no easy answer for it. So many module, so many parameters, most important is all these CAN'T affect the system performance and increase the module complexity.

Even for the JSON parsing function, I spend almost 1 working day to come out the performance measurement and it is just for the unit testing ;( . The result is just like the following.

stockprice (36648000 records) elapsed ms:376806.531 for 36000 avg:9.937 variance:5.652 Fastest:9.000 Slowest:244.000

[67, 9 x 16910, 10 x 13029, 11 x 4306, 12 x 868, 73, 13 x 248, 14 x 123, 15 x 49, 17 x 15, 16 x 33, 19 x 9, 18 x 8, 21 x 26, 20 x 11, 23 x 51, 22 x 47, 25 x 43, 24 x 41, 27 x 39, 26 x 38, 29 x 24, 28 x 28, 31 x 3, 30 x 18, 34, 35 x 2, 32 x 2, 33 x 6, 38, 39 x 2, 36 x 6, 37 x 3, 42 x 2, 43, 41, 50, 48, 54, 244]

But how about other functions.... I was almost frightened by the future workload. It seems my system launch day need be postpone. Until today, I find this wonderful library, Metrics, through the netty example. It is fantastic and save me huge time on the performance measurement and reporting.

With just few lines, the following result will be automatically printed into the System console. If you need, it can easily output the result into CSV, log file, JMX, even provide the servlet to remotely pass the result as JSON. Wonderful!!!

With all these tools, I almost got the insurance on my system quality.

final ConsoleReporter reporter = ConsoleReporter.forRegistry(Metrics.defaultRegistry())

.convertRatesTo(TimeUnit.SECONDS)

.convertDurationsTo(TimeUnit.MILLISECONDS)

.build();

reporter.start(1, TimeUnit.MINUTES);

Timer timer = Metrics.newTimer(this.getClass(),"StockPrice Batch Parse","timer",new SlidingWindowReservoir(nMax));

timer.update(stopwatch2.stop().elapsed(TimeUnit.NANOSECONDS),TimeUnit.NANOSECONDS);

-- Timers ----------------------------------------------------------------------

test.JSONParserTest.StockPrice Batch Parse.timer

count = 34356

mean rate = 95.49 calls/second

1-minute rate = 95.70 calls/second

5-minute rate = 88.47 calls/second

15-minute rate = 80.06 calls/second

min = 9.32 milliseconds

max = 244.26 milliseconds

mean = 10.46 milliseconds

stddev = 2.38 milliseconds

median = 10.07 milliseconds

75% <= 10.64 milliseconds

95% <= 11.99 milliseconds

98% <= 13.57 milliseconds

99% <= 22.14 milliseconds

99.9% <= 31.03 milliseconds

Saturday, October 26, 2013

SCTP vs UDT

When need a better messaging protocol for my project DTCrawler. I find these 2 new implementation. After study them, especially the book Networks for Grid Applications I choose UDT

1. built upon the UDP (which is my preference)

2. Provide the flow/congestion management, which it is necessary for the application

The difference is like this.

"UDT borrows the messaging and partial reliability semantics from SCTP. However, SCTP are specially designed for VoIP and telephony, but UDT targets general purpose data transfer. UDT unifies both messaging and streaming semantics in one protocol."

Saturday, October 19, 2013

Java Performance Tuning Study Memo

Wonderful blogs from http://java-performance.info/!!! List of articles http://java-performance.com/

Java type memory usage

byte, boolean	1 byte
short, char	2 bytes
int, float	4 bytes
long, double	8 bytes

Byte, Boolean	16 bytes
Short, Character	16 bytes
Integer, Float	16 bytes
Long, Double	24 bytes

`EnumSet`, `BitSet`	1 bit per value
`EnumMap`	4 bytes (for value, nothing for key)
`ArrayList`	4 bytes (but may be more if `ArrayList` capacity is seriously more than its size)
`LinkedList`	24 bytes (fixed)
`ArrayDeque`	4 to 8 bytes, 6 bytes on average

JDK collection	Size	Possible Trove substitution	Size
`HashMap`	32 * SIZE + 4 * CAPACITY bytes	`THashMap`	8 * CAPACITY bytes
`HashSet`	32 * SIZE + 4 * CAPACITY bytes	`THashSet`	4 * CAPACITY bytes
`LinkedHashMap`	40 * SIZE + 4 * CAPACITY bytes	None
`LinkedHashSet`	32 * SIZE + 4 * CAPACITY bytes	`TLinkedHashSet`	8 * CAPACITY bytes
`TreeMap, TreeSet`	40 * SIZE bytes	None
`PriorityQueue`	4 * CAPACITY bytes	None

All Java objects start with 8 bytes containing service information like object class and its identity hash code (returned by System.identityHashCode method). Arrays have 4 more bytes (one int field) containing array length. It looks like all user-written (not JDK classes) have another reference to object Class. These fields are followed by all declared fields. All objects are aligned by 8 bytes boundary. All primitive fields must be aligned by their size (for example, chars should be aligned by 2 bytes boundary).Object reference (including any arrays) occupy 4 bytes. What does it mean for us? In order to get most use of available memory, all object fields must occupy N*8+4 bytes (4, 12, 20, 28 and so on). In this case 100% memory will contain useful data.

Java Boxing Type Caching

Byte, Short, Long	Character	Integer	Float, Double
From -128 to 127	From 0 to 127	From -128 to java.lang.Integer.IntegerCache.high or 127, whichever is bigger	No caching

Java Performance Tips

Never use exceptions as return code replacement or for any likely to happen events (especially in not IO-bound methods!). Throwing an exception is too expensive - you may experience 100 times slowdown for simple methods.

Throwing an exception in Java is a very slow operation. Expect that throwing an exception costs you something between 100 and 1000 ticks in most cases.

Case Study

http://java-performance.info/use-case-optimizing-memory-footprint-of-read-only-csv-file-trove-unsafe-bytebuffer-data-compression/

Tuesday, October 1, 2013

MYSQL Tips

1. Enable Log

SET GLOBAL log_output = 'TABLE';
SET GLOBAL general_log = 'ON';
SELECT * FROM MYSQL.GENERAL_LOG ORDER BY EVENT_TIME DESC LIMIT 100;

Sunday, September 22, 2013

Memory Hierachy

Get this from http://dank.qemfd.net/dankwiki/images/d/dc/Memoryhierarchy.png

Thursday, September 12, 2013

MySQL INSERT ON DUPLICATE UPDATE IS FASTER THAN UPDATE!!!

It is very strange. But the test result show it is.

MySQL

innodb_version 5.6.13

protocol_version 10

version 5.6.13-enterprise-commercial-advanced

version_compile_machine x86_64

version_compile_os osx10.7

Result

SELECT udf_CreateCounterID(0,CURRENT_DATE);

SELECT @update,@updateend,@updatediff,@insertupdate,@insertupdate_end,@insertupdatediff,@keyval,@countlmt;

@update=2013-09-12 17:32:27
@updateend=2013-09-12 17:33:01
@updatediff=34

@insertupdate=2013-09-12 17:32:00
@insertdate_end=2013-09-12 17:32:27
@insertupdatediff=27

@keyval=13
@countlmt=1000000

Table

CREATE TABLE `sys_CounterID` (
  `exch_year` int(11) NOT NULL,
  `nextID` int(11) NOT NULL,
  PRIMARY KEY (`exch_year`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Test Function

CREATE DEFINER=`root`@`localhost` FUNCTION `udf_CreateCounterID`(exchID SMALLINT, listyear DATE) RETURNS int(10) unsigned
BEGIN
 /**
 counter ID is 32 bits, 
 highest 9 bits: exchange ID (until 2013,  totally 317 operator MIC. for any >511, modular 512)
 middel 7 bits: 2 digits year (max:99)
 left bits: counter number
 */
 DECLARE keyvalue INT UNSIGNED DEFAULT 0;
 
 SET @countlmt = 1000000;
 SET keyvalue = ((exchID % 512) << 9 ) + EXTRACT(YEAR FROM listyear) % 100;

 SET @keyval = keyvalue;
 SET @retVal =  0;

 SET @count = @countlmt;
 SET @insertupdate = SYSDATE();

 WHILE @count > 0 DO

  INSERT INTO `sys_CounterID`(`exch_year`,nextID)
  VALUE( keyvalue, 1)
  ON DUPLICATE KEY UPDATE 
   nextID = (@retVal := nextID + 1);

  SET @count = @count - 1;

 END WHILE;

 SET @insertupdate_end = SYSDATE();
 SET @insertupdatediff = TIMESTAMPDIFF(SECOND, @insertupdate,@insertupdate_end);

 
 SET @count = @countlmt;
 SET @update = SYSDATE();
 
 WHILE @count > 0 DO

  UPDATE sys_CounterID 
  SET nextID = (@retVal := nextID + 1)
  WHERE exch_year = keyvalue;

  SET @count = @count - 1;

 END WHILE;

 SET @updateend = SYSDATE();
 SET @updatediff = TIMESTAMPDIFF(SECOND, @update,@updateend);


 RETURN @retVal;

END

Friday, August 16, 2013

High Speed Concurrent Framework, Disruptor

Have gone through many articles about the Disruptor, devised by LMAX team. It captures my eye because of this http://martinfowler.com/articles/lmax.html

LMAX is a new retail financial trading platform. As a result it has to process many trades with low latency. The system is built on the JVM platform and centers on a Business Logic Processor that can handle 6 million orders per second on a single thread. The Business Logic Processor runs entirely in-memory using event sourcing. The Business Logic Processor is surrounded by Disruptors - a concurrency component that implements a network of queues that operate without needing locks. During the design process the team concluded that recent directions in high-performance concurrency models using queues are fundamentally at odds with modern CPU design.

Impressive! isn't it :). But the above is a little bit misleading. It is not single thread system. To me, the key for this Disruptor achievements are they brilliantly avoid most common multi-thread traps. Today, instead of the old model, CPU, Registers & Memory, high performance program, need deal with the CPU, Register & Cache. The memory to the today CPU, is just like a hard disk in the old days. Another one is now CPU has multiple cores. The out of order executing will affect your program too.This great work again prove that how important the fundamental is, such as data structure, thread management in OS and deeply understanding on the hardware.

Besides LMAX articles, you can directly insights from Intel Intel® 64 and IA-32 Architectures Optimization Reference Manual. the L1 Cache has 2 types, data & instruction. For data is only 32 KB. The cache line size is 64bytes. The wiki provides very comprehensive explanation on the cache line. Another article from Microsoft on the Driver development clearly sates the common issues in the multiple processor architecture.

Sharing Is the Root of All Contention

from Herb Sutter Drbobbs blog

Cache Line & Atomic Operations

LOCKED ATOMIC OPERATIONS

The 32-bit IA-32 processors support locked atomic operations on locations in system memory. These operations are typically used to manage shared data structures (such as semaphores, segment descriptors, system segments, or page tables) in which two or more processors may try simultaneously to modify the same field or flag. The processor uses three interdependent mechanisms for carrying out locked atomic operations:

Guaranteed atomic operations
Bus locking, using the LOCK# signal and the LOCK instruction prefix
Cache coherency protocols that insure that atomic operations can be carried out on cached data structures (cache lock); this mechanism is present in the Pentium 4, Intel Xeon, and P6 family processors

These mechanisms are interdependent in the following ways. Certain basic memory transactions (such as reading or writing a byte in system memory) are always guaranteed to be handled atomically. That is, once started, the processor guarantees that the operation will be completed before another processor or bus agent is allowed access to the memory location. The processor also supports bus locking for performing selected memory operations (such as a read-modify-write operation in a shared area of memory) that typically need to be handled atomically, but are not automatically handled this way. Because frequently used memory locations are often cached in a processor’s L1 or L2 caches, atomic operations can often be carried out inside a processor’s caches without asserting the bus lock. Here the processor’s cache coherency protocols insure that other processors that are caching the same memory locations are managed properly while atomic operations are performed on cached memory locations.

Memory Barrier Semantics

Acquire semantics mean that the results of the operation are visible before the results of any operation that appears after it in code.

Release semantics mean that the results of the operation are visible after the results of any operation that appears before it in code.

Fence semantics combine acquire and release semantics. The results of an operation with fence semantics are visible before those of any operation that appears after it in code and after those of any operation that appears before it

Cachine Issue

The hardware always reads an entire cache line, rather than individual data items. If you think of the cache as an array, a cache line is simply a row in that array: a consecutive block of memory that is read and cached in a single operation. The size of a cache line is generally from 16 to 128 bytes, depending on the hardware;

Each cache line has one of the following states:

Exclusive, meaning that this data does not appear in any other processor’s cache. When a cache line enters the Exclusive state, the data is purged from any other processor’s cache.
Shared, meaning that another cache line has requested the same data.
Invalid, meaning that another processor has changed the data in the line.
Modified, meaning that the current processor has changed the data in this line.

All architectures on which Windows runs guarantee that every processor in a multiprocessor configuration will return the same value for any given memory location. This guarantee, which is called cache coherency between processors, ensures that whenever data in one processor’s cache changes, all other caches that contain the same data will be updated. On a single-processor system, whenever the required memory location is not in the cache, the hardware must reload it from memory. On a multiprocessor system, if the data is not in the current processor’s cache, the hardware can read it from main memory or request it from other processors’ caches. If the processor then writes a new value to that location, all other processors must update their caches to get the latest data.

Some data structures have a high locality of reference. This means that the structure often appears in a sequence of instructions that reference adjacent fields. If a structure has a high locality of reference and is protected by a lock, it should typically be in its own cache line.

For example, consider a large data structure that is protected by a lock and that contains both a pointer to a data item and a flag indicating the status of that data item. If the structure is laid out so that both fields are in the same cache line, any time the driver updates one variable, the other variable is already present in the cache and can be updated immediately.

In contrast, consider another scenario. What happens if two data structures in the same cache line are protected by two different locks and are accessed simultaneously from two different processors? Processor 0 updates the first structure, causing the cache line in Processor 0 to be marked Exclusive and the data in that line to be purged from other processors’ caches. Processor 1 must request the data from Processor 0 and wait until its own cache is updated before it can update the second structure. If Processor 0 again tries to write the first structure, it must request the data from Processor 1, wait until the cache is updated, and so on. However, if the structures are not on the same cache line, neither processor must wait for these cache updates. Therefore, two data structures that can be accessed simultaneously on two different processors (because they are not protected by the same lock) should be on different cache lines.

Mechanical Sympathy

First thing is to understand the CPU, your rice bow, Mechanical Sympathy.

The following is from http://www.infoq.com/presentations/LMAX

Don't use lock

This will trap your program to "Amdahl's law", (of cause, bad side). The lock will cause the execution context switching, ring3 -> ring0 -> ring3... Refer to Trisha's blog for more detail. But how to avoid the lock in the multi-thread environment? The idea is "don't share the data". The shared resource is the lock existence reason. That means you need consider the data segregation for the high performance system. Their whole ring buffer design is around this point.

Don't copy the data round for the inter-thread communication

Their ring buffer is just like a infinite array. And the index is just like the reference pointer. So the object isn't copied around and won't involve the dynamic memory management. That is today's most common stupid action for OO programmer, always new object(). Remember I have tried the similar idea before in my C++ project.

Some valuable points from Trisha's blog http://mechanitis.blogspot.sg/2011/07/dissecting-disruptor-why-its-so-fast_22.html

Martin and Mike's QCon presentation gives some indicative figures for the cost of cache misses:

Latency from CPU to...	Approx. number of CPU cycles	Approx. time in nanoseconds
Main memory		~60-80ns
QPI transit (between sockets, not drawn)		~20ns
L3 cache	~40-45 cycles,	~15ns
L2 cache	~10 cycles,	~3ns
L1 cache	~3-4 cycles,	~1ns
Register	1 cycle

Cache Line

Volatile = Memory Barrier

This means if you write to a volatile field, you know that:
Any thread accessing that field after the point at which you wrote to it will get the updated value
Anything you did before you wrote that field is guaranteed to have happened and any updated data values will also be visible, because the memory barrier flushed all earlier writes to the cache.

False Sharing

from Herb Sutter Drdobbs.com blog
The general case to watch out for is when you have two objects or fields that are frequently accessed (either read or written) by different threads, at least one of the threads is doing writes, and the objects are so close in memory that they're on the same cache line because they are:

objects nearby in the same array
fields nearby in the same object
objects allocated close together in time (C++, Java) or by the same thread (C#, Java)
static or global objects that the linker decided to lay out close together in memory;
objects that become close in memory dynamically, as when during compacting garbage collection two objects can become adjacent in memory because intervening objects became garbage and were collected; or
objects that for some other reason accidentally end up close together in memory.

First, we can reduce the number of writes to the cache line. For example, writer threads can write intermediate results to a scratch variable most of the time, then update the variable in the popular cache line only occasionally as needed. This is the approach we took in Example 2, where we changed the code to update a local variable frequently and write into the popular result array only once per worker to store its final count.

Second, we can separate the variables so that they aren't on the same cache line. Typically the easiest way to do this is to ensure an object has a cache line to itself that it doesn't share with any other data. To achieve that, you need to do two things:

Ensure that no other object can precede your data in the same cache line by aligning it o begin at the start of the cache line or adding sufficient padding bytes before the object.
Ensure that no other object can follow your data in the same cache line by adding sufficient padding bytes after the object to fill up the line.

References

Java Memory Model

Wiki: Java Memory Model

Memory Barriers

Memory Barriers: a Hardware View for Software Hackers

False Share & Parallel Performance

http://en.wikipedia.org/wiki/False_sharing
http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206
http://www.drdobbs.com/parallel/writing-lock-free-code-a-corrected-queue/210604448
http://www.drdobbs.com/parallel/maximize-locality-minimize-contention/208200273
http://www.drdobbs.com/parallel/understanding-parallel-performance/211800538
http://www.drdobbs.com/architecture-and-design/sharing-is-the-root-of-all-contention/214100002

Wednesday, August 14, 2013

How HTTPS works, HTTP Tunneling & WebSocket

HTTPS

Finally understand how it works. HTTPS is just HTTP on top of SSL/TSL. HTTPs isn't a protocol at all. All the web proxy is just HTTP proxy. Their working flow is as

Request message

Client -> Proxy -> Server

Repsond message

Client <- Proxy <- Server

Because HTTP is just clear text message, the proxy is able to cache the data if the request is same. This is clearly defined in the HTTP protocol.

The interesting part is about the HTTPS. I mistakenly believe it is similar as HTTP. But in fact it is completely not. HTTPS is HTTP message is packaged as SSL message. It can't be proxy/cached at all. It relies on the HTTP tunneling (http://en.wikipedia.org/wiki/HTTP_tunnel & http://tools.ietf.org/html/draft-luotonen-web-proxy-tunneling-01) .

CLIENT -> SERVER                        SERVER -> CLIENT
--------------------------------------  -----------------------------------
CONNECT home.netscape.com:443 HTTP/1.0
User-agent: Mozilla/4.0
<<< empty line >>>
                                        HTTP/1.0 200 Connection established
                                        Proxy-agent: Netscape-Proxy/1.1
                                        <<< empty line >>>
              <<< data tunneling to both directions begins >>>

From the above, :) we can easily to tunnel any protocol over proxy, such as SSH.

WebSocket

The interesting part is the Web socket (http://www.ietf.org/rfc/rfc6455.txt) also rely on the HTTP, CONNECT, when need pass through the proxy.

URL format

ws-URI = "ws:" "//" host [ ":" port ] path [ "?" query ]
wss-URI = "wss:" "//" host [ ":" port ] path [ "?" query ]

Handshake

client request:

        GET /chat HTTP/1.1
        Host: server.example.com
        Upgrade: websocket
        Connection: Upgrade
        Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
        Origin: http://example.com
        Sec-WebSocket-Protocol: chat, superchat
        Sec-WebSocket-Version: 13

server response

        HTTP/1.1 101 Switching Protocols
        Upgrade: websocket
        Connection: Upgrade
        Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=
        Sec-WebSocket-Protocol: chat

Message Frame

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-------+-+-------------+-------------------------------+
     |F|R|R|R| opcode|M| Payload len |    Extended payload length    |
     |I|S|S|S|  (4)  |A|     (7)     |             (16/64)           |
     |N|V|V|V|       |S|             |   (if payload len==126/127)   |
     | |1|2|3|       |K|             |                               |
     +-+-+-+-+-------+-+-------------+ - - - - - - - - - - - - - - - +
     |     Extended payload length continued, if payload len == 127  |
     + - - - - - - - - - - - - - - - +-------------------------------+
     |                               |Masking-key, if MASK set to 1  |
     +-------------------------------+-------------------------------+
     | Masking-key (continued)       |          Payload Data         |
     +-------------------------------- - - - - - - - - - - - - - - - +
     :                     Payload Data continued ...                :
     + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
     |                     Payload Data continued ...                |
     +---------------------------------------------------------------+

CometD, Bayeux Server

CometD is the framework to implement the Bayeux protocol for the Comet messaging. Refer to the http://docs.cometd.org/reference/ for the detail.

Its 2.4 performance can be found here

http://webtide.intalio.com/2011/09/cometd-2-4-0-websocket-benchmarks/

CometD components

Message flow

It invokes BayeuxServer extensions (methods rcv() or rcvMeta()); if one extension denies processing, a reply is sent to the client indicating that the message has been deleted, and no further processing is performed for the message.

It invokes ServerSession extensions (methods rcv() or rcvMeta(), only if a ServerSession for that client exists); if one extension denies processing, a reply is sent to the client indicating that the message has been deleted, and no further processing is performed for the message.

It invokes authorization checks for both the security policy and the authorizers; if the authorization is denied, a reply is sent to the client indicating the failure, and no further processing is performed for the message.

If the message is a service or broadcast message, the message passes through BayeuxServer extensions (methods send() or sendMeta()).

It invokes server channel listeners; the application adds server channel listeners on the server, and offers the last chance to modify the message before it is eventually sent to all subscribers (if it is a broadcast message). All subscribers see any modification a server channel listener makes to the message, just as if the publisher has sent the message already modified. After the server channel listeners processing, the message is frozen and no further modifications should be made to the message. Applications should not worry about this freezing step, because the API clarifies whether the message is modifiable or not: the API has as a parameter a modifiable message interface or an unmodifiable one to represent the message object. This step is the last processing step for an incoming non-broadcast message, and it therefore ends its journey on the server. A reply is sent to publishers to confirm that the message made it to the server (see below), but the message is not broadcast to other server sessions.

If the message is a broadcast message, for each server session that subscribes to the channel, the message passes through ServerSession extensions (methods send() or sendMeta()), then the server session queue listeners are invoked and finally the message is added to the server session queue for delivery.

If the message is a lazy message (see Section 7.4.7, “Lazy Channels and Messages”), it is sent on first occasion. Otherwise the message is delivered immediately. If the server session onto which the message is queued corresponds to a remote client session, it is assigned a thread to deliver the messages in its queue through the server transport. The server transport drains the server session message queue, converts the messages to JSON and sends them on the conduit as the payloads of transport-specific envelopes (for example, an HTTP response or a WebSocket message). Otherwise, the server session onto which the message is queued corresponds to a local session, and the messages in its queue are delivered directly to the local session.

For both broadcast and non-broadcast messages, a reply message is created, passes through BayeuxServer extensions and ServerSession extensions (methods send() or sendMeta()). It then passes to the server transport, which converts it to JSON through a JSONContext.Server instance (see Section 7.5.1, “JSONContext API”), and sends it on the conduit as the payload of a transport-specific envelope (for example, an HTTP response or a WebSocket message).

The envelope travels back to the client, where the client transport receives it. The client transport converts the messages from the JSON format back to message objects, for the Java client via a JSONContext.Client instance (see Section 7.5.1, “JSONContext API”).

Each message then passes through the extensions (methods send() or sendMeta()), and channel listeners and subscribers are notified of the message.

The round trip from client to server back to client is now complete.