One of the very exciting thing about Spark is that there is the potential to have one ubiquitous tool to solve my aggregate, machine learning, graph, and other statistical / analytics problems. And while I am proud of my time with the SQL Server team and we had achieved some amazing lofty goals (e.g. Yahoo! 24TB Analysis Services cube), I had been drawn back to my statistical roots.
It may surprise you that I had been bouncing between the path of becoming a Doctor (…you know, Asian parents) or a statistician (my father was a Statistics professor). I had even pursued – but never completed – a Masters in Biostatistics at the University of Washington. I never finished it as Microsoft had drawn me back (the internal debate was that I could either not make money and pay to earn a degree … or I could get paid money to do what I loved doing – not much of a debate).
More than a 24TB cube?!
But even with all the cool solutions and technologies that I got the chance to work with – I still had that statistical “itch” that couldn’t be scratched even after helping to build the largest known Analysis Services cube. As noted in my post Why all this interest in Spark?, analytics is more than just aggregations, counts, and trends. It’s about statistics, machine learning, data mining, graph algorithms, real time streaming, and many other concepts.
And the promise of Spark – among many other things – was that it would be easier for everyone to create the scale of a 24TB Analysis Services cube without actually needing specialized equipment. To build a distributed cube of sorts!
And within the context of data sciences and distribution – Spark has the potential to provide that solution.
Interactive OLAP Queries with Spark and Cassandra
This is just one case study but it shows the promise of solving a large traditional analytics problem at scale. It achieved many of the goals that our original 24TB Yahoo! cube were able to achieve, yet it was done on a distributed platform allowing us to scale farther and faster. And it was done against a technology that only recently (as of July 2014) just released is 1.0 version.
It’s a great read and here are some enjoyable call outs from Evan’s Seattle Spark Meetup session (of the same name):
“ApacheSpark and #Cassandra – separate and optimize the query and storage layers!”
Yes, its my own tweet but this is called out in “Separate Storage and Query Layers” slide in Evan’s presentation. This is very much inline with our push towards distributed cloud computing. Separate the storage and query layers and ensure that there is fast method for storage and query communications. This will allow you to optimize the different layers separately as you realize that storage and query are inherently different things with different ways to optimize.
“Write custom functions in Scala … Take that Hive UDFs!!”
As you build more complex systems in your distributed environments – yet make it easy for analysts to use – in the world of Hive, you can create UDFs. Anyone who has spent time creating Hive UDFs will also note that while they may be easy to use (and even that statement is a bit of a stretch at times), they can be a pain to write.
Instead, why not write those complex functions in Scala – less code to write, easier to debug, still uses the JVM so performance is in parity with Java.
“When in doubt, use brute force” — Ken Thompson
At some point, your problem because too complicated, too concurrent, with far too much variety that the brute force methods ends up being the most efficient. That is the promise of Hadoop and ultimately – even with all of its inefficiencies – why Hadoop is synonymous with the Big Data movement.
But anyone who has worked with the Hadoop ecosystem is that the system is inherently batch and not well designed for interactive queries. Spark takes this Map Reduce paradigm but gives it a serious boost by placing the data into RDDs and caching it in memory. You get the speed of in-memory calculations but the flexibility of Map Reduce. What’s not to love?
“Spark is the Swiss Army Knife of Big Data Analytics” — Reynold Xin