Concurrent Patterns and Best Practices
上QQ阅读APP看书,第一时间看更新

The MapReduce pattern

The MapReduce pattern is an example of a common case where concurrency is needed. The following diagram shows a word frequency counter; given a huge text stream of trillions of words, we need to see how many times every word occurs in the text. The algorithm is super simple: we keep the running count of each word in a hash table, with the word as the key and the counter as the value. The hash table allows us to quickly look up the next word and increment the associated value (counter). 

Given the size of the input text, one single node does not have the memory for the entire hash table! Concurrency provides a solution, using the MapReduce pattern, as shown in the following diagram:

The solution is divide and conquer: we maintain a distributed hash table and run the same algorithm, suitably adapted for a cluster.

The master node reads—parses—the text and pushes it to a set of slave processing nodes. The idea is to distribute the text in such a way that one word is processed by one slave node only. For example, given three processing nodes, as shown in the preceding diagram, we would divide rangewise: push nodes starting with the characters {a..j} to node 1, {k..r} to node 2 and the rest—{s..z}—onto node 3. This is the map part (distribution of work).

Once the stream is exhausted, each slave node sends its frequency result back to the master, which prints the result. 

The slave nodes are all doing the same processing concurrently. Note that the algorithm would run faster if we add, more slave nodes; that is, if we scale it horizontally.