The similarities between Unix and MapReduce

In ‘Designing Data-Intensive Applications’ by Martin Kleppmann, he explains MapReduce by showing how similar it is to Unix’s approach to data processing. It’s a nice way to explain MapReduce since so many developers are familiar with Unix.

A key similarity is that Unix and MapReduce both have a philosophy of simple and composable functions. For example, the two frameworks take a similar approach when given the task of counting the occurrences of words in a document.

This is how Unix does it:

            
cat input.txt | tr -cs '[:alpha:]' '[\n*]' | sort | uniq -c | sort -nr

And these are the functions used by MapReduce to achieve the same thing:

            
def map_function(document):
    for word in document.split():
        yield (word.lower(), 1)

def reduce_function(word, counts):
    yield (word, sum(counts))

In Unix, you would first use tr -cs '[:alpha:]' '[\n*]' to split the input into lines of unique words. For an input document looking like this:

            
look at this sentence

The output would be:

            
look
at
this
sentence

This achieves something similar to the map stage of MapReduce, which produces a long list of lines that look like this {example_word:1} (where 1 is the count) from an input text document.

After a map stage, these lines are grouped together by key; something similar can be achieved in Unix by using the sort command, putting the lines in alphabetical order (although the lines do not include the count and simply look like example_word, so Unix is not exactly the same as MapReduce!).

In Unix, the next stage would be to collapse identical adjacent lines and count their occurrences, which is achieved by the uniq command. This is akin to reduce in MapReduce, which combines all the lines with the same key into one key, where the values have been summed. For example, if your data looked like this after the map stage:

            
{example_word:1}
{example_word:1}

It would look like this after the reduce stage:

            
{example_word:2}

Kleppman makes the point that it is remarkable how well these Unix commands can perform, for something written so long ago. However, the key difference between Unix and MapReduce (and the reason we need MapReduce) is that Unix is built to run on a single computer, whereas MapReduce is built to run across many.