Follow your heart , Steve Jobs: 2013

Monday, December 30, 2013

Product, Result of Idea Collision

When I am asked about the project complete time, I also feel a little bit curious why the stock alert project, simple project, can be so long. So I count the past recorded task time in my google Doc. The "study" + "design" is 70%+, the coding is only about 27%. While my initial estimate on the project development is 3 months, 90 days. Now the actual coding quite match it. The delay is mainly from "Study".

	Admin	Coding	Deployment	Design	Study	Testing	Grand Total
0	1	75	1	42	151	7	277
		27.08%		15.16%	54.51%

Study! Yes, study, I study so many things in order to address the technical issues. The whole development process to me is more like a study process. I still can remember how I dozed away when watching Google cloud tutorial video. :)

The following is the things for study. I know, the diagram is quite complex, which also surprise me. Initially I just want to come out a simple list. But its complexity pass the message, this is not a easy journey.

As my idol words, product is a collision of ideas. It starts from a unpolished idea, just like a open a window, then, you enter another new world, which is full of windows. This believing also imply how important to give the staff room for "mistakes". Because diamond is just a rock without polishing.

Thursday, December 26, 2013

Wonderful article on JS Delete

http://perfectionkills.com/understanding-delete/

Here’s a short summary of how deletion works in Javascript:

Variables and function declarations are properties of either Activation or Global objects.
Properties have attributes, one of which — DontDelete — is responsible for whether a property can be deleted.
Variable and function declarations in Global and Function code always create properties with DontDelete.
Function arguments are also properties of Activation object and are created with DontDelete.
Variable and function declarations in Eval code always create properties without DontDelete.
New properties are always created with empty attributes (and so without DontDelete).
Host objects are allowed to react to deletion however they want.

Tuesday, December 17, 2013

Analysis vs. Design: What’s the Difference?

http://www.omg.org/news/meetings/workshops/presentations/eai_2001/tutorial_monday/tockey_tutorial/2-Analysis_vs_Design.pdf

Requirements

Unambiguous

Interpretable in only one way

Testable

Compliance (or, non-compliance) can be clearly demonstrated

Binding

The customer is willing to pay for it and is unwilling to not have it

Every requirement that is still necessary in spite of “perfect technology” is an essential requirement.

Requirements about speed, cost, and capacity go into the design bucket

Requirements about reliability (MTBF, MTTR) go into the design bucket

Requirements about I/O mechanisms and presentations go into the design bucket

Requirements about computer languages go into the design bucket

Requirements about archiving go into the design bucket

Requirements about the customer's business policy / business process go into the essential bucket

UML for Analysis

UML for Design

Benefits

Reduce apparent complexity: one large problem becomes two smaller ones

Understand the customer’s business policy / business process
Figure out how to automate that business policy / process with the available technology

Isolate areas of expertise

Apply the principles of coupling and cohesion at the highest level of the software architecture

More robust, less fragile systems
Enable separate evolution of the business policy / business process and the implementation technology

Responsive Web Design - Study Memo

Articles

http://alistapart.com/article/responsive-web-design
http://blog.teamtreehouse.com/beginners-guide-to-responsive-web-design
http://alistapart.com/article/fluidgrids
http://www.javascriptkit.com/dhtmltutors/cssmediaqueries2.shtml

W3C

http://www.w3.org/TR/css3-mediaqueries/

Recommended Media Screen Width

320px
480px
600px
768px
900px
1200px

Use EM, relative unit, instead of PIXCEL for the font size.

Use fluid grid for responsive website

Tools

css3-mediaqueries-js

https://code.google.com/p/css3-mediaqueries-js/

Saturday, November 30, 2013

难怪港片不行了

今天看完了徐克的狄仁杰之通天帝国，实在是。。。差不忍睹！故事太牵强，情节发展太没逻辑。真不知道在看什么？还居然安排了一个潜规则的戏码，天后要我服侍大人，居然就是脱！！！晕倒。。。难道是演艺圈的潜规则太多，没有就不正常！虽然老狄同意出力，还是要弄个女人上保险！！！免得他不办事。如果真办事，天后也需要弄个小三反腐！！！！

徐克的作品也沦落到这个水平了。。。缅怀我的年轻时光呀。。。看来我是真老了！

Outside world is wonderful!!!

Today just finish the study on the Require.js module concept and realize that my seeking is already implemented and my direction is right, to divide the whole website into modules (widgets) and load them when it is necessary. This idea emerged from 2 years ago for the mobile stock trading system. Although it is my first web/mobile project, my network system development experience make me more focus on the system data volume & network traffic. Because of the tight schedule, I didn't really fully implement it. Unit now, my own system, I decide to implement the dynamic loading feature, which the formal name should be Asynchronous Module Definition. After finish the JS/CSS/HTML loading based on the jQuery widget object, I happened to re-study the require.js.

Then Shit!. That is what I need, asynchronous module loading, i18n... And all are nicely implemented, well organised. After reading its history, I feel a little upset to find this nice tool late.

Instead of struggling on why boss doesn't judge staff based on the performance, how could those guys be so shameless to take others credit or shifting the responsibility, these guys are who I am looking for. Suddenly recall my answer to my boss exist view question,

Boss: "well, you can't work with peoples"

Me: "Well, how can I work with them?! like xxx incident, even black and white, they can lie!!!"

For sure, I can't work them. Because,

Outside world is so wonderful!

Tuesday, November 26, 2013

终于通了！！！

DTStockAlert系统的模块终于都链接上可以通信了！。。。。终于看见一丝曙光！

Wednesday, November 20, 2013

Another HR Person

After the Monday lunch with the resigning HR manager from my ex-company, I wondering whether it is because of the HR person over professional that most company can't build up the environment like that taught in the management books. Seems they lacks the basic sincerity. Maybe it is because their job, they have to deal with all kinds of peoples, you know, that is the area not always under the sun.

Most HR persons who I deal with before, not pleasant, one hand they will talk about the supportive, passion... all glory stuff; another hand, they they don't really believe it.

Suddenly recall another HR person in my ex-company, is so "cool" and emotionless when she processes your request. But she will show you the warmest smile when talking about the company culture, working environment...

So what happened this time? Shit, it is so fake!!!

I was touched when she insisted on farewell with me back when I was leaving this company 6 months ago. And really appreciate what she shared with me,

Listening to people is the key for teamwork & cooperation, which is lacked in this company. And it is far from easy like it sounds.
Most HR persons works for their believing. The company culture must match with them, otherwise they will quit immediately.
She is strong mind and still believe she can improve the company culture. After all, she just join the company 1+ year

Woh! Wise and kind person! I like it. I even regret that I should know these persons earlier. It is such a big lost. But lucky, in the last minute, I still got chance to catch the wise.

So when recently I learn that she is leaving, I am so surprise!!! my ex-company lose the talent again!? I can't help to invite her for lunch and hope she is all right. Then got all the following surprises. Are they same person?

Financial industry isn't her industry. She dislike this industry and can't settle on this industry. In October (after my leave 4 months), she finally make it out.
This company main problem is decision process isn't clear, not professional, not like her previous companies, NTUC child care, NUS...
This company is full of empathy although it may be a issue. And very strong teamwork! which may make your miss when you really get chance learn those big name company from inside, Apple, since they lack it!

At beginning I was shock and can't get her points and slowly realize that I am so fool!. HR is just her job. Now I start to confirm my initial doubt on her sudden farewell invitation is true, she just worked for her boss to confirm whether my leave is because of that staff fire event and collect the comments on the company.

But I am still inspired by "her wise" words, listening is the key for the teamwork and corporation. Like my favourite idiom, "No one can do anything to the person who keep lying. It is just in the end, he/she can't believe others anymore".

Such a pity job, HR is !

Thursday, October 31, 2013

Wonderful Performance Metrics Tool

When my crawler project is closed to the end. my concern on the performance is heavier. How to measure my system performance? there is no easy answer for it. So many module, so many parameters, most important is all these CAN'T affect the system performance and increase the module complexity.

Even for the JSON parsing function, I spend almost 1 working day to come out the performance measurement and it is just for the unit testing ;( . The result is just like the following.

stockprice (36648000 records) elapsed ms:376806.531 for 36000 avg:9.937 variance:5.652 Fastest:9.000 Slowest:244.000

[67, 9 x 16910, 10 x 13029, 11 x 4306, 12 x 868, 73, 13 x 248, 14 x 123, 15 x 49, 17 x 15, 16 x 33, 19 x 9, 18 x 8, 21 x 26, 20 x 11, 23 x 51, 22 x 47, 25 x 43, 24 x 41, 27 x 39, 26 x 38, 29 x 24, 28 x 28, 31 x 3, 30 x 18, 34, 35 x 2, 32 x 2, 33 x 6, 38, 39 x 2, 36 x 6, 37 x 3, 42 x 2, 43, 41, 50, 48, 54, 244]

But how about other functions.... I was almost frightened by the future workload. It seems my system launch day need be postpone. Until today, I find this wonderful library, Metrics, through the netty example. It is fantastic and save me huge time on the performance measurement and reporting.

With just few lines, the following result will be automatically printed into the System console. If you need, it can easily output the result into CSV, log file, JMX, even provide the servlet to remotely pass the result as JSON. Wonderful!!!

With all these tools, I almost got the insurance on my system quality.

final ConsoleReporter reporter = ConsoleReporter.forRegistry(Metrics.defaultRegistry())

.convertRatesTo(TimeUnit.SECONDS)

.convertDurationsTo(TimeUnit.MILLISECONDS)

.build();

reporter.start(1, TimeUnit.MINUTES);

Timer timer = Metrics.newTimer(this.getClass(),"StockPrice Batch Parse","timer",new SlidingWindowReservoir(nMax));

timer.update(stopwatch2.stop().elapsed(TimeUnit.NANOSECONDS),TimeUnit.NANOSECONDS);

-- Timers ----------------------------------------------------------------------

test.JSONParserTest.StockPrice Batch Parse.timer

count = 34356

mean rate = 95.49 calls/second

1-minute rate = 95.70 calls/second

5-minute rate = 88.47 calls/second

15-minute rate = 80.06 calls/second

min = 9.32 milliseconds

max = 244.26 milliseconds

mean = 10.46 milliseconds

stddev = 2.38 milliseconds

median = 10.07 milliseconds

75% <= 10.64 milliseconds

95% <= 11.99 milliseconds

98% <= 13.57 milliseconds

99% <= 22.14 milliseconds

99.9% <= 31.03 milliseconds

Saturday, October 26, 2013

SCTP vs UDT

When need a better messaging protocol for my project DTCrawler. I find these 2 new implementation. After study them, especially the book Networks for Grid Applications I choose UDT

1. built upon the UDP (which is my preference)

2. Provide the flow/congestion management, which it is necessary for the application

The difference is like this.

"UDT borrows the messaging and partial reliability semantics from SCTP. However, SCTP are specially designed for VoIP and telephony, but UDT targets general purpose data transfer. UDT unifies both messaging and streaming semantics in one protocol."

Saturday, October 19, 2013

Java Performance Tuning Study Memo

Wonderful blogs from http://java-performance.info/!!! List of articles http://java-performance.com/

Java type memory usage

byte, boolean	1 byte
short, char	2 bytes
int, float	4 bytes
long, double	8 bytes

Byte, Boolean	16 bytes
Short, Character	16 bytes
Integer, Float	16 bytes
Long, Double	24 bytes

`EnumSet`, `BitSet`	1 bit per value
`EnumMap`	4 bytes (for value, nothing for key)
`ArrayList`	4 bytes (but may be more if `ArrayList` capacity is seriously more than its size)
`LinkedList`	24 bytes (fixed)
`ArrayDeque`	4 to 8 bytes, 6 bytes on average

JDK collection	Size	Possible Trove substitution	Size
`HashMap`	32 * SIZE + 4 * CAPACITY bytes	`THashMap`	8 * CAPACITY bytes
`HashSet`	32 * SIZE + 4 * CAPACITY bytes	`THashSet`	4 * CAPACITY bytes
`LinkedHashMap`	40 * SIZE + 4 * CAPACITY bytes	None
`LinkedHashSet`	32 * SIZE + 4 * CAPACITY bytes	`TLinkedHashSet`	8 * CAPACITY bytes
`TreeMap, TreeSet`	40 * SIZE bytes	None
`PriorityQueue`	4 * CAPACITY bytes	None

All Java objects start with 8 bytes containing service information like object class and its identity hash code (returned by System.identityHashCode method). Arrays have 4 more bytes (one int field) containing array length. It looks like all user-written (not JDK classes) have another reference to object Class. These fields are followed by all declared fields. All objects are aligned by 8 bytes boundary. All primitive fields must be aligned by their size (for example, chars should be aligned by 2 bytes boundary).Object reference (including any arrays) occupy 4 bytes. What does it mean for us? In order to get most use of available memory, all object fields must occupy N*8+4 bytes (4, 12, 20, 28 and so on). In this case 100% memory will contain useful data.

Java Boxing Type Caching

Byte, Short, Long	Character	Integer	Float, Double
From -128 to 127	From 0 to 127	From -128 to java.lang.Integer.IntegerCache.high or 127, whichever is bigger	No caching

Java Performance Tips

Never use exceptions as return code replacement or for any likely to happen events (especially in not IO-bound methods!). Throwing an exception is too expensive - you may experience 100 times slowdown for simple methods.

Throwing an exception in Java is a very slow operation. Expect that throwing an exception costs you something between 100 and 1000 ticks in most cases.

Case Study

http://java-performance.info/use-case-optimizing-memory-footprint-of-read-only-csv-file-trove-unsafe-bytebuffer-data-compression/

Tuesday, October 1, 2013

MYSQL Tips

1. Enable Log

SET GLOBAL log_output = 'TABLE';
SET GLOBAL general_log = 'ON';
SELECT * FROM MYSQL.GENERAL_LOG ORDER BY EVENT_TIME DESC LIMIT 100;

Sunday, September 22, 2013

Memory Hierachy

Get this from http://dank.qemfd.net/dankwiki/images/d/dc/Memoryhierarchy.png

Thursday, September 12, 2013

MySQL INSERT ON DUPLICATE UPDATE IS FASTER THAN UPDATE!!!

It is very strange. But the test result show it is.

MySQL

innodb_version 5.6.13

protocol_version 10

version 5.6.13-enterprise-commercial-advanced

version_compile_machine x86_64

version_compile_os osx10.7

Result

SELECT udf_CreateCounterID(0,CURRENT_DATE);

SELECT @update,@updateend,@updatediff,@insertupdate,@insertupdate_end,@insertupdatediff,@keyval,@countlmt;

@update=2013-09-12 17:32:27
@updateend=2013-09-12 17:33:01
@updatediff=34

@insertupdate=2013-09-12 17:32:00
@insertdate_end=2013-09-12 17:32:27
@insertupdatediff=27

@keyval=13
@countlmt=1000000

Table

CREATE TABLE `sys_CounterID` (
  `exch_year` int(11) NOT NULL,
  `nextID` int(11) NOT NULL,
  PRIMARY KEY (`exch_year`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Test Function

CREATE DEFINER=`root`@`localhost` FUNCTION `udf_CreateCounterID`(exchID SMALLINT, listyear DATE) RETURNS int(10) unsigned
BEGIN
 /**
 counter ID is 32 bits, 
 highest 9 bits: exchange ID (until 2013,  totally 317 operator MIC. for any >511, modular 512)
 middel 7 bits: 2 digits year (max:99)
 left bits: counter number
 */
 DECLARE keyvalue INT UNSIGNED DEFAULT 0;
 
 SET @countlmt = 1000000;
 SET keyvalue = ((exchID % 512) << 9 ) + EXTRACT(YEAR FROM listyear) % 100;

 SET @keyval = keyvalue;
 SET @retVal =  0;

 SET @count = @countlmt;
 SET @insertupdate = SYSDATE();

 WHILE @count > 0 DO

  INSERT INTO `sys_CounterID`(`exch_year`,nextID)
  VALUE( keyvalue, 1)
  ON DUPLICATE KEY UPDATE 
   nextID = (@retVal := nextID + 1);

  SET @count = @count - 1;

 END WHILE;

 SET @insertupdate_end = SYSDATE();
 SET @insertupdatediff = TIMESTAMPDIFF(SECOND, @insertupdate,@insertupdate_end);

 
 SET @count = @countlmt;
 SET @update = SYSDATE();
 
 WHILE @count > 0 DO

  UPDATE sys_CounterID 
  SET nextID = (@retVal := nextID + 1)
  WHERE exch_year = keyvalue;

  SET @count = @count - 1;

 END WHILE;

 SET @updateend = SYSDATE();
 SET @updatediff = TIMESTAMPDIFF(SECOND, @update,@updateend);


 RETURN @retVal;

END

Monday, August 26, 2013

理不辩不明

这两天追完了薄熙来的文字转播。不得不佩服薄熙来的功力。虽然一直不是很喜欢他，而在他搞了唱红之后尤其讨厌他做秀的模样。这次审判倒是让我眼前一亮。在看完了他及证人的证词后，感觉到一种莫名的悲哀。这帮人到底每天都是一个什么样的心理状态呀！在杯觥交错之后入眠前的1分钟，是否会感到一种落寞。办公室里充满了算计，家里也是机关重重。那样的日子有意思吗？

相信他是一个工作能力很强的人，就这样把他拉下马和他过去种种的极具争议的唱红打黑难道仅仅是他个人的过错？在我看来他的结局其实在他进入那样一个只唯上和绝对的权利的体系时，就已经注定他的结局了。如果通过做秀而由下至上的改变上头的意思，那就不是现在的中国了。

一个在这样的贪腐风行的官场，如果真的就只有500万公款和一套别墅，以他的权位也真的是够清廉了！！！

Friday, August 16, 2013

High Speed Concurrent Framework, Disruptor

Have gone through many articles about the Disruptor, devised by LMAX team. It captures my eye because of this http://martinfowler.com/articles/lmax.html

LMAX is a new retail financial trading platform. As a result it has to process many trades with low latency. The system is built on the JVM platform and centers on a Business Logic Processor that can handle 6 million orders per second on a single thread. The Business Logic Processor runs entirely in-memory using event sourcing. The Business Logic Processor is surrounded by Disruptors - a concurrency component that implements a network of queues that operate without needing locks. During the design process the team concluded that recent directions in high-performance concurrency models using queues are fundamentally at odds with modern CPU design.

Impressive! isn't it :). But the above is a little bit misleading. It is not single thread system. To me, the key for this Disruptor achievements are they brilliantly avoid most common multi-thread traps. Today, instead of the old model, CPU, Registers & Memory, high performance program, need deal with the CPU, Register & Cache. The memory to the today CPU, is just like a hard disk in the old days. Another one is now CPU has multiple cores. The out of order executing will affect your program too.This great work again prove that how important the fundamental is, such as data structure, thread management in OS and deeply understanding on the hardware.

Besides LMAX articles, you can directly insights from Intel Intel® 64 and IA-32 Architectures Optimization Reference Manual. the L1 Cache has 2 types, data & instruction. For data is only 32 KB. The cache line size is 64bytes. The wiki provides very comprehensive explanation on the cache line. Another article from Microsoft on the Driver development clearly sates the common issues in the multiple processor architecture.

Sharing Is the Root of All Contention

from Herb Sutter Drbobbs blog

Cache Line & Atomic Operations

LOCKED ATOMIC OPERATIONS

The 32-bit IA-32 processors support locked atomic operations on locations in system memory. These operations are typically used to manage shared data structures (such as semaphores, segment descriptors, system segments, or page tables) in which two or more processors may try simultaneously to modify the same field or flag. The processor uses three interdependent mechanisms for carrying out locked atomic operations:

Guaranteed atomic operations
Bus locking, using the LOCK# signal and the LOCK instruction prefix
Cache coherency protocols that insure that atomic operations can be carried out on cached data structures (cache lock); this mechanism is present in the Pentium 4, Intel Xeon, and P6 family processors

These mechanisms are interdependent in the following ways. Certain basic memory transactions (such as reading or writing a byte in system memory) are always guaranteed to be handled atomically. That is, once started, the processor guarantees that the operation will be completed before another processor or bus agent is allowed access to the memory location. The processor also supports bus locking for performing selected memory operations (such as a read-modify-write operation in a shared area of memory) that typically need to be handled atomically, but are not automatically handled this way. Because frequently used memory locations are often cached in a processor’s L1 or L2 caches, atomic operations can often be carried out inside a processor’s caches without asserting the bus lock. Here the processor’s cache coherency protocols insure that other processors that are caching the same memory locations are managed properly while atomic operations are performed on cached memory locations.

Memory Barrier Semantics

Acquire semantics mean that the results of the operation are visible before the results of any operation that appears after it in code.

Release semantics mean that the results of the operation are visible after the results of any operation that appears before it in code.

Fence semantics combine acquire and release semantics. The results of an operation with fence semantics are visible before those of any operation that appears after it in code and after those of any operation that appears before it

Cachine Issue

The hardware always reads an entire cache line, rather than individual data items. If you think of the cache as an array, a cache line is simply a row in that array: a consecutive block of memory that is read and cached in a single operation. The size of a cache line is generally from 16 to 128 bytes, depending on the hardware;

Each cache line has one of the following states:

Exclusive, meaning that this data does not appear in any other processor’s cache. When a cache line enters the Exclusive state, the data is purged from any other processor’s cache.
Shared, meaning that another cache line has requested the same data.
Invalid, meaning that another processor has changed the data in the line.
Modified, meaning that the current processor has changed the data in this line.

All architectures on which Windows runs guarantee that every processor in a multiprocessor configuration will return the same value for any given memory location. This guarantee, which is called cache coherency between processors, ensures that whenever data in one processor’s cache changes, all other caches that contain the same data will be updated. On a single-processor system, whenever the required memory location is not in the cache, the hardware must reload it from memory. On a multiprocessor system, if the data is not in the current processor’s cache, the hardware can read it from main memory or request it from other processors’ caches. If the processor then writes a new value to that location, all other processors must update their caches to get the latest data.

Some data structures have a high locality of reference. This means that the structure often appears in a sequence of instructions that reference adjacent fields. If a structure has a high locality of reference and is protected by a lock, it should typically be in its own cache line.

For example, consider a large data structure that is protected by a lock and that contains both a pointer to a data item and a flag indicating the status of that data item. If the structure is laid out so that both fields are in the same cache line, any time the driver updates one variable, the other variable is already present in the cache and can be updated immediately.

In contrast, consider another scenario. What happens if two data structures in the same cache line are protected by two different locks and are accessed simultaneously from two different processors? Processor 0 updates the first structure, causing the cache line in Processor 0 to be marked Exclusive and the data in that line to be purged from other processors’ caches. Processor 1 must request the data from Processor 0 and wait until its own cache is updated before it can update the second structure. If Processor 0 again tries to write the first structure, it must request the data from Processor 1, wait until the cache is updated, and so on. However, if the structures are not on the same cache line, neither processor must wait for these cache updates. Therefore, two data structures that can be accessed simultaneously on two different processors (because they are not protected by the same lock) should be on different cache lines.

Mechanical Sympathy

First thing is to understand the CPU, your rice bow, Mechanical Sympathy.

The following is from http://www.infoq.com/presentations/LMAX

Don't use lock

This will trap your program to "Amdahl's law", (of cause, bad side). The lock will cause the execution context switching, ring3 -> ring0 -> ring3... Refer to Trisha's blog for more detail. But how to avoid the lock in the multi-thread environment? The idea is "don't share the data". The shared resource is the lock existence reason. That means you need consider the data segregation for the high performance system. Their whole ring buffer design is around this point.

Don't copy the data round for the inter-thread communication

Their ring buffer is just like a infinite array. And the index is just like the reference pointer. So the object isn't copied around and won't involve the dynamic memory management. That is today's most common stupid action for OO programmer, always new object(). Remember I have tried the similar idea before in my C++ project.

Some valuable points from Trisha's blog http://mechanitis.blogspot.sg/2011/07/dissecting-disruptor-why-its-so-fast_22.html

Martin and Mike's QCon presentation gives some indicative figures for the cost of cache misses:

Latency from CPU to...	Approx. number of CPU cycles	Approx. time in nanoseconds
Main memory		~60-80ns
QPI transit (between sockets, not drawn)		~20ns
L3 cache	~40-45 cycles,	~15ns
L2 cache	~10 cycles,	~3ns
L1 cache	~3-4 cycles,	~1ns
Register	1 cycle

Cache Line

Volatile = Memory Barrier

This means if you write to a volatile field, you know that:
Any thread accessing that field after the point at which you wrote to it will get the updated value
Anything you did before you wrote that field is guaranteed to have happened and any updated data values will also be visible, because the memory barrier flushed all earlier writes to the cache.

False Sharing

from Herb Sutter Drdobbs.com blog
The general case to watch out for is when you have two objects or fields that are frequently accessed (either read or written) by different threads, at least one of the threads is doing writes, and the objects are so close in memory that they're on the same cache line because they are:

objects nearby in the same array
fields nearby in the same object
objects allocated close together in time (C++, Java) or by the same thread (C#, Java)
static or global objects that the linker decided to lay out close together in memory;
objects that become close in memory dynamically, as when during compacting garbage collection two objects can become adjacent in memory because intervening objects became garbage and were collected; or
objects that for some other reason accidentally end up close together in memory.

First, we can reduce the number of writes to the cache line. For example, writer threads can write intermediate results to a scratch variable most of the time, then update the variable in the popular cache line only occasionally as needed. This is the approach we took in Example 2, where we changed the code to update a local variable frequently and write into the popular result array only once per worker to store its final count.

Second, we can separate the variables so that they aren't on the same cache line. Typically the easiest way to do this is to ensure an object has a cache line to itself that it doesn't share with any other data. To achieve that, you need to do two things:

Ensure that no other object can precede your data in the same cache line by aligning it o begin at the start of the cache line or adding sufficient padding bytes before the object.
Ensure that no other object can follow your data in the same cache line by adding sufficient padding bytes after the object to fill up the line.

References

Java Memory Model

Wiki: Java Memory Model

Memory Barriers

Memory Barriers: a Hardware View for Software Hackers

False Share & Parallel Performance

http://en.wikipedia.org/wiki/False_sharing
http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206
http://www.drdobbs.com/parallel/writing-lock-free-code-a-corrected-queue/210604448
http://www.drdobbs.com/parallel/maximize-locality-minimize-contention/208200273
http://www.drdobbs.com/parallel/understanding-parallel-performance/211800538
http://www.drdobbs.com/architecture-and-design/sharing-is-the-root-of-all-contention/214100002

Simple words always inspire peoples

Brian Goetz, http://mail.openjdk.java.net/pipermail/lambda-dev/2013-March/008435.html

When evaluating a language feature, you need to examine both the cost and the benefit side of the proposal.

Benefit: how would having this feature enable me to write code that is better than what I can write today.

Cost: how would having this feature enable other people to write WORSE code than they might write today.

I like them! When consider things or make decisions, we should always remind ourselves on "Cost". The "Benefit" is just allure for our mistakes.

Wednesday, August 14, 2013

How HTTPS works, HTTP Tunneling & WebSocket

HTTPS

Finally understand how it works. HTTPS is just HTTP on top of SSL/TSL. HTTPs isn't a protocol at all. All the web proxy is just HTTP proxy. Their working flow is as

Request message

Client -> Proxy -> Server

Repsond message

Client <- Proxy <- Server

Because HTTP is just clear text message, the proxy is able to cache the data if the request is same. This is clearly defined in the HTTP protocol.

The interesting part is about the HTTPS. I mistakenly believe it is similar as HTTP. But in fact it is completely not. HTTPS is HTTP message is packaged as SSL message. It can't be proxy/cached at all. It relies on the HTTP tunneling (http://en.wikipedia.org/wiki/HTTP_tunnel & http://tools.ietf.org/html/draft-luotonen-web-proxy-tunneling-01) .

CLIENT -> SERVER                        SERVER -> CLIENT
--------------------------------------  -----------------------------------
CONNECT home.netscape.com:443 HTTP/1.0
User-agent: Mozilla/4.0
<<< empty line >>>
                                        HTTP/1.0 200 Connection established
                                        Proxy-agent: Netscape-Proxy/1.1
                                        <<< empty line >>>
              <<< data tunneling to both directions begins >>>

From the above, :) we can easily to tunnel any protocol over proxy, such as SSH.

WebSocket

The interesting part is the Web socket (http://www.ietf.org/rfc/rfc6455.txt) also rely on the HTTP, CONNECT, when need pass through the proxy.

URL format

ws-URI = "ws:" "//" host [ ":" port ] path [ "?" query ]
wss-URI = "wss:" "//" host [ ":" port ] path [ "?" query ]

Handshake

client request:

        GET /chat HTTP/1.1
        Host: server.example.com
        Upgrade: websocket
        Connection: Upgrade
        Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
        Origin: http://example.com
        Sec-WebSocket-Protocol: chat, superchat
        Sec-WebSocket-Version: 13

server response

        HTTP/1.1 101 Switching Protocols
        Upgrade: websocket
        Connection: Upgrade
        Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=
        Sec-WebSocket-Protocol: chat

Message Frame

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-------+-+-------------+-------------------------------+
     |F|R|R|R| opcode|M| Payload len |    Extended payload length    |
     |I|S|S|S|  (4)  |A|     (7)     |             (16/64)           |
     |N|V|V|V|       |S|             |   (if payload len==126/127)   |
     | |1|2|3|       |K|             |                               |
     +-+-+-+-+-------+-+-------------+ - - - - - - - - - - - - - - - +
     |     Extended payload length continued, if payload len == 127  |
     + - - - - - - - - - - - - - - - +-------------------------------+
     |                               |Masking-key, if MASK set to 1  |
     +-------------------------------+-------------------------------+
     | Masking-key (continued)       |          Payload Data         |
     +-------------------------------- - - - - - - - - - - - - - - - +
     :                     Payload Data continued ...                :
     + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
     |                     Payload Data continued ...                |
     +---------------------------------------------------------------+

CometD, Bayeux Server

CometD is the framework to implement the Bayeux protocol for the Comet messaging. Refer to the http://docs.cometd.org/reference/ for the detail.

Its 2.4 performance can be found here

http://webtide.intalio.com/2011/09/cometd-2-4-0-websocket-benchmarks/

CometD components

Message flow

It invokes BayeuxServer extensions (methods rcv() or rcvMeta()); if one extension denies processing, a reply is sent to the client indicating that the message has been deleted, and no further processing is performed for the message.

It invokes ServerSession extensions (methods rcv() or rcvMeta(), only if a ServerSession for that client exists); if one extension denies processing, a reply is sent to the client indicating that the message has been deleted, and no further processing is performed for the message.

It invokes authorization checks for both the security policy and the authorizers; if the authorization is denied, a reply is sent to the client indicating the failure, and no further processing is performed for the message.

If the message is a service or broadcast message, the message passes through BayeuxServer extensions (methods send() or sendMeta()).

It invokes server channel listeners; the application adds server channel listeners on the server, and offers the last chance to modify the message before it is eventually sent to all subscribers (if it is a broadcast message). All subscribers see any modification a server channel listener makes to the message, just as if the publisher has sent the message already modified. After the server channel listeners processing, the message is frozen and no further modifications should be made to the message. Applications should not worry about this freezing step, because the API clarifies whether the message is modifiable or not: the API has as a parameter a modifiable message interface or an unmodifiable one to represent the message object. This step is the last processing step for an incoming non-broadcast message, and it therefore ends its journey on the server. A reply is sent to publishers to confirm that the message made it to the server (see below), but the message is not broadcast to other server sessions.

If the message is a broadcast message, for each server session that subscribes to the channel, the message passes through ServerSession extensions (methods send() or sendMeta()), then the server session queue listeners are invoked and finally the message is added to the server session queue for delivery.

If the message is a lazy message (see Section 7.4.7, “Lazy Channels and Messages”), it is sent on first occasion. Otherwise the message is delivered immediately. If the server session onto which the message is queued corresponds to a remote client session, it is assigned a thread to deliver the messages in its queue through the server transport. The server transport drains the server session message queue, converts the messages to JSON and sends them on the conduit as the payloads of transport-specific envelopes (for example, an HTTP response or a WebSocket message). Otherwise, the server session onto which the message is queued corresponds to a local session, and the messages in its queue are delivered directly to the local session.

For both broadcast and non-broadcast messages, a reply message is created, passes through BayeuxServer extensions and ServerSession extensions (methods send() or sendMeta()). It then passes to the server transport, which converts it to JSON through a JSONContext.Server instance (see Section 7.5.1, “JSONContext API”), and sends it on the conduit as the payload of a transport-specific envelope (for example, an HTTP response or a WebSocket message).

The envelope travels back to the client, where the client transport receives it. The client transport converts the messages from the JSON format back to message objects, for the Java client via a JSONContext.Client instance (see Section 7.5.1, “JSONContext API”).

Each message then passes through the extensions (methods send() or sendMeta()), and channel listeners and subscribers are notified of the message.

The round trip from client to server back to client is now complete.

Tuesday, August 13, 2013

AsyncHTTP, Comet

HTTP is 1 way and stateless protocol. In order to get the real time updates, we have to use the polling. The article from Jetty list down the cost for the polling. It is huge. But lucky, we got the Comet and Async Servlet3.0.

the article from IBM has very comprehensive introduction on the various Comet solutions, such as polling, long-polling and streaming.

http://www.ibm.com/developerworks/web/library/wa-cometjava/

AJAX polling problem

Refer to original for the detail: http://docs.codehaus.org/display/JETTY/Continuations

But there is a new problem. The advent of AJAX as a web application model is significantly changing the traffic profile seen on the server side. Because AJAX servers cannot deliver asynchronous events to the client, the AJAX client must poll for events on the server. To avoid a busy polling loop, AJAX servers will often hold onto a poll request until either there is an event or a timeout occurs. Thus an idle AJAX application will have an outstanding request waiting on the server which can be used to send a response to the client the instant an asynchronous event occurs. This is a great technique, but it breaks the thread-per-request model, because now every client will have a request outstanding in the server. Thus the server again needs to have one or more threads for every client and again there are problems scaling to thousands of simultaneous users.

	Formula	Web 1.0	Web 2.0 + Comet	Web 2.0 + Comet + Continuations
Users	u	10000	10000	10000
Requests/Burst	b	5	2	2
Burst period (s)	p	20	5	5
Request Duration (s)	d	0.200	0.150	0.175
Poll Duration (s)	D	0	10	10

Request rate (req/s)	rr=u*b/20	2500	4000	4000
Poll rate (req/s)	pr=u/d	0	1000	1000
Total (req/s)	r=rr+pr	2500	5000	5000

Concurrent requests	c=rrd+prD	500	10600	10700
Min Threads	T=c T=r*d	500 -	10600 -	- 875
Stack memory	S=641024T	32MB	694MB	57MB

HTTP, WebSocket, SPDY, HTTP/2.0 Evolution of Web Protocols

A very comprehensive doc on the HTTP technical evolvement.

http://www.jfokus.se/jfokus13/preso/jf13_http_ws_spdy_evolution.pdf