What is unique about Spark?

Recently, I’ve been learning a bit about how Spark works internally, which led me to the PhD dissertation of Matei Zaharia, about Resilient Distributed Datasets: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf

This feels like a key insight:

“The insight behind RDDs is that although the workloads that MapReduce was unsuited for (e.g., iterative, interactive and streaming queries) seem at first very different, they all require a common feature that MapReduce lacks: efficient data sharing across parallel computation stages.”

RDDs keep data in RAM between computation steps. This is in contrast to MapReduce keeping intermediate data on disk.

For an iterative algorithm, made of hundreds of computation steps, this can save a lot of I/O. Similarly, Interactive workloads require many reads (lots of exploratory queries), while streaming workloads require many writes (constantly adding new data), so keeping data in memory can be very helpful here too.

Before Spark, there were specialised alternatives to MapReduce for iterative, interactive and streaming workloads, like HaLoop, Apache Impala and Apache Storm respectively. A lot of Spark's success is in the fact it is a single solution to all these cases.

Back to Homepage