Ramblings of a data dork: from BI and Big Data to Travel and Food
“There must be some way out of here, said the joker to the thief / There’s too much confusion, I can’t get no relief”
– Bob Dylan’s ‘All Along the Watchtower’
As some of you may know, I have been around for a long time in the world of SQL Server Analysis Services. I had built the first Analysis Services cube in production with my compatriots Daniel Reh and Dave Schuba (plus a call out to Allen McDowell and Jim Bergh) back when it was still called OLAP Services 7.0 – while it was still beta. Over the years, I had dove into web analytics such as my foray into startups like digiMine – long before the term internet scale was even devised. Trudging through massive amount of internet-scale data, this was where I had my first battles against Analysis Services Distinct Count. And before I joined the SQL Customer Advisory Team, I had continued my fight against “big data” when I was part of the adCenter Engineering – as part of Bing.
Yet, through out all the various cool battles, in one form or another the end result (for me anyways) has always been Tier-1 Analysis Services. Always going for bigger and better, my friends at adCenter Engineering (an important call out to Bilal Obeidat and Mike Anderson [blog]) – with a little help from yours truly – went big and built multi-terabyte SSAS MOLAP cubes as noted in Accelerating Microsoft adCenter with Microsoft SQL Server 2008 Analysis Services. Just when I thought it couldn’t get any better, I have the honor of working with the amazing talented folks over at Yahoo! on their 12TB SSAS MOLAP cube as noted in the PASS Summit 2010 Day One Keynote.
A lot of the great learnings from all of these experiences culminated into the SSAS Maestros course – the Level 500 SQL Server 2008 R2 Analysis Services UDM deep dive. This was the brain child of myself, Daniel Yu, and Akshai Mirchandani – and with the full support of SQLCAT, SQL Product Planning, and the Analysis Services product group team – we forged ahead. We are currently on our SSAS Maestros v1.2 course and continuing efforts well into v2.0. BTW, hats off to my partner-in-crime Thomas Kejser [blog] and John Sirmon for some amazing last minute heroics for the v1.0 course.
Don’t worry, I’m still with Microsoft and SQLCAT – and still doing SSAS. Just follow the story to BigData…ok?
And through all this craziness, I have been fortunate enough to dive into PowerPivot (such as this cool book called PowerPivot for Excel and SharePoint) and push the boundaries of Tier-1 Analysis Services UDM. I’ve been also fortunate enough to now lead the SQLCAT DW/BI team diving to all sorts of DW and BI – Parallel Data Warehouse, SSAS, PowerPivot, Apollo, Crescent, Reporting Services, SSIS, DQS, Project Barcelona, etc.
And why do I say the web keeps calling me back? Because this is where I was first introduced to the concept of “BigData” long before we had developed the term BigData. It was still a time of parsing through thousands of Apache and IIS web logs and trying to make sense of the millions of events … when a 2 CPU box was still considered powerful! It was then when the idea of distributed computing became close to my heart. After all, the ultimate idea is the ability to run jobs on hundreds or thousands of nodes to solve all of these complex problems without the need to deal with network and (more importantly) disk latencies that can send many analytics projects into a complete stand still.
When I joined adCenter a few lifetimes ago, Microsoft Search had been diving deep into making BigData a reality for Bing. This has now been productized in the form of DryAd. Some great information on DryAd can be found here:
But the basic principle is that DryAd is an execution engine (using directed-acyclic graph) optimized for distributed computing.
And yet, Hadoop is something I’ve been always digging around in. Sure, it helps that Yahoo! is a huge proponent and investor into Hadoop and the basic principle behind Hadoop is map-reduce. I love the simplicity of it all (pun intended). For those diving into the realm of BigData, you have to know about Hadoop.
And in the middle of all of this, SQL Server is in the midst of a major – and very cool – transformation to go all-into the Cloud with SQL Azure. We’re doing some pretty amazing work here. Sure, it’s a great marketing tag line – but the vision is great nevertheless:
Focus on your application, not the infrastructure
In the vortex of confusion – as can be seen from the vortex painting behind Leoben and Starbuck (again, Battlestar Galactica reference here) – how do I grok and merge all of this together?
After all, I’m jumping around from BI to BigData to Cloud to Battlestar Galactica. It doesn’t make any frakkin’ sense!
So back to Bob Dylan’s Quote from All Along the Watchtower:
…There must be some way out of here, said the joker to the thief / There’s too much confusion, I can’t get no relief”
Thankfully, the epiphany for me came to me from Mario Kosmiskas’ blog post: Hadoop in Azure.
Note, this is just the beginning but the basic principle is that you can run Hadoop within Windows Azure. And then it all came together:
So no worries, I’m not leaving Analysis Services or BI any time soon. After all, BigData and BI are rooted on the same principle that we need to solve complex analytics over large amounts of data.
But as I gladly dive more into the abyss that is BigData, you’ll start seeing me talk more about the world of NoSQL.