“Spark … is what you might call a Swiss Army knife of Big Data analytics tools”
– Reynold Xin (@rxin), Berkeley AmpLab Shark Development Lead

The above quote – from the Wired article “Spark: Open Source Superstar Rewrites Future of Big Data” – encompasses why I am a fan of Spark.  If you are an avid hiker or outdoors-person, you already appreciate the flexibility of a Swiss Army Knife (or Leatherman).  It is the perfect compact tool to do a variety of simple but necessary tasks – bordering on life saving (below is a picture from my ascent to Mount Rainier’s Camp Schurmann base camp where the latter is far truer than I care to admit).

from Sunny Sunday: Mount Rainier

Without getting deep into the debate of what is a Data Scientist (as a former pseudo-Biostatistician, my response would be it is a Statistician that actually understands how to make sense of data whether it is to code, pivot, or visualize), Spark is one of those handy tools that helps you solve a lot of difficult data problems.  It may not necessarily be the best tool for everything, but it is a very good tool for many things.  Some examples of the Spark functionality include:

- Flexibility of Map Reduce but with multiple APIs including Scala, Java, and Python

- Fast because it the calculations are done in memory (via RDDs)

- Shark is an Apache Hive compatible data warehousing system utilizing Spark

- Process Live Data streams with Spark Streaming

- Associated project Tachyon: reliable in-memory cross-cluster file system

- Associated project Apache Mesos: Dynamic cross-cluster resource sharing

- Graph Analytics on Spark using GraphX

For more information, please reference http://spark-project.org.


“Play where the Data Lays”

– Dave Mariani (@dmariani), Serial Big Data Analytics Entrepreneur

This simple (but awesome) quote explains why Big Data Analytics is moving towards technologies like the Hive Stinger Initiative, Drill, Impala, and Spark.  It is the desire to analyze an entire cluster of data without actually moving all of the data as the physics of moving all of this data from storage to compute becomes cost prohibitive.  We have seen some pretty large extremes like the 24TB Yahoo! TAO cube whose source is 2PB in a 14PB Hadoop cluster (where Dave and I first worked together).  While we are not there yet – at some point the high cost, lengthy durations, and constant resource utilization makes it too difficult to move all of this data.  Therefore, we will want to be selective and limit the amount of data that is moved at any point in time. This is where frameworks like Hadoop become important because it gives you the ability to store the data first and then query the expansive amount of data.

The Definition of Analytics

A common theme among many analytics solutions is that they typically involve relatively simple queries: counts, aggregations, and trends. While these are very important, analytics (at least in my book) is so much more. It involves statistics, machine learning, data mining, graph algorithms, real time streaming, and many other concepts that are in the minds of brilliant PhD scientists that I barely understand.  Hadoop and Big Data has ignited the data scientist / analytics ecosystem with systems like R-Hadoop integration, Pegasus, Mahout, Giraph, GraphLab, and many other frameworks.  Because many of these systems are written in or can easily interface with Java Map Reduce, these same systems can more easily be integrated into a system like Spark so that they can run faster.  Just as important, with Scala and Python support it becomes easier to create your own algorithms to push the boundaries of analytics.

You said Fast, How Fast?

Before diving into some testing details, let me call out the test setup – I have a Mac Mini (2011), Core i7 (1st gen), 16GB RAM, 256GB SSD / 750GB Hybrid SSD where I am using audit data from Project Chateau Kebob (gzipped files: 59.8MB compressed, 8.88GB raw).  For these tests, I am using  Hadoop 1.1.1, Pig 0.10, Scala 2.9.3, and Spark 0.7.2.

To get a rowcount using a Pig script against this data, I am using the following script:
# Get rowcount from SQLAuditLog
A = LOAD '/audit/gz/*' USING TextLoader as (line:chararray);
B = group A all;
C = foreach B generate COUNT(A);
dump C;

To do the same calculation using Spark 0.7.2 from the Spark Shell
val auditLog = sc.textFile("hdfs://localhost:9000/audit/gz/*")

Both the single thread and 8 core calculations are included in the table below.

Pig Spark Spark
(8 cores)
Number of Rows 54,874,995 54,874,995 54,874,995
Duration 7m 41s 2m 4s 1m 19s

While ~2min may seem slow to folks who use traditional relational database and/or analytics tools (e.g. a PowerPivot workbook can perform this calculation in <1s), it is still much faster than running this on Pig / Hadoop.

But what if I wanted to do a count of the distinct words that are in my audit log (not your traditional RDBMS / BI query)?

My Pig script looks like below:
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = group B by word;
D = foreach C generate COUNT(B), group;
E = group D all;
F = foreach E generate COUNT(D);
dump F;

while my Spark Scala script looks like:
var wc = auditLog.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

Pig Spark Spark
(8 cores)
Number of Distinct Words 2,318,059 2,318,059 2,318,059
Duration 28min 31s 9min 24s 5min 23s

Similar to above, Spark solves the problem faster than Pig/Hadoop.  .

Do I need traditional RDBMS / BI systems?

For the foreseeable future, I still very much see traditional RDBMS/BI systems and Hadoop/Spark systems to be complementary vs. competing.  Over time, I can see the intersection between these two systems overlapping more as relational systems take on semi-structured datasets and BASE properties while “Big Data” systems take on more structured data sets and ACID properties.

But for now, if your primary concern is speed of the query, provided you have the time and resources to process the data, traditional RDBMS and BI systems will typically return their query results substantially faster.  If your primary concern is flexibility and the ability to query in a distributed environment, then these “Big Data” systems are the right tool of choice.  It’s back to choosing the right tool for the right problem or job (as I had noted in Scale Up or Scale Out Your Data Problems: A Space Analogy).

For example, if I wanted to quickly do a counts and aggregations against this same dataset (54M rows of audit data) where I can easily define the dimensions (group bys) using the available columns, I can use PowerPivot to get the query back in a matter of seconds (yes, I’m still a fan of PowerPivot – full disclosure, I’m co-author of the book Microsoft PowerPivot for Excel and SharePoint).  But if I want to easily determine how many of those 54M rows has the any column contain the text “<tsql_stack>”, this would be far easier in Hadoop or Spark.  In the case of Spark, I can determine this by running the query:
auditLog.filter(line => line.contains("<tsql_stack>").count()
and in ~1m 30s determine that 9,634,779 lines contain that text “<tsql_stack>”.

Choose the right tool for the right job, eh?!