SQL Server Analysis Services to Hive

Over the last few months, we’ve chatted about how Hadoop and BI are better together – it’s a common theme that Dave Mariani (@dmariani) and I have spoke together about in our session How Klout changed the landscape of social media with Hadoop and BI.

image

While this all sounds great, the typical question is “How do I do it”?  I am proud to say that after months of intense writing – all of this is due to the amazing work by Kay Unkroth who regularly blogs on the official Analysis Services and PowerPivot Team blog – we now have the first of our great collateral: SQL Server Analysis Services to Hive: A Klout Case Study.

Enjoy!

7 Comments

  1. […] Lee (@dennylee) described SQL Server Analysis Services to Hive [with SSIS] in a 9/26/2012 post: Over the last few months, we’ve chatted about how Hadoop and BI […]

  2. For distinct count measure group, if my understanding is correct, SSAS will issue a query with a order by, that will make sql server fetch all the data from hadoop, and then sort the data inside sql server, that might be time consuming, is it possible to fetch the data in hadoop in order, then sql server doesn’t need to sort it again, is this doable?

    1. Sorry about missing your comment Daniel! You are correct in your assessment in that when you run a distinct count query, it will automatically issue an ORDER BY statement. Right now there isn’t any way to force SSAS to not issue the ORDER BY unless you issue a cartridge change to remove the statement – not recommended at all. Saying this, if you want to be efficient about it, you may want in this case actually put the data into SQL Server first. By grabbing the data from Hadoop in an ordered fashion and placing the data into a table already ordered – if you prop the indexes into SQL indicating as such, while SQL will still run the ORDER BY statement, it will be pretty efficient about it once it realizes that the data is already ordered as per the ORDER BY clause. HTH!

  3. Given the large amount of data typically found in a hadoop cluster, when querying that data using the Hive connector wouldn’t the result sets tend to be far too large for PowerPivot to handle? If all returned data is copied into PowerPivot and in turn loaded into memory and you are pulling from a cluster that contains terabytes or more data stored there and as a result many query results will be extremely large as well how is this feasible in PowerPivot?

    1. Actually, in the case of this particular case study – we’re using the HiveODBC provider as a mechanism to populate an Analysis Services UDM (MOLAP) cube as opposed to a PowerPivot (Tabular) model. Saying this, yes, this technique can be used (and is in fact easier) when populating a PowerPivot (Tabular) model. When going this route, please note that the goal is not to pull down all of the data from Hadoop. Even in the case of the Yahoo! cube which is 24TB (largest known OLAP cube), the source is 2PB of Hadoop data. Instead the goal is to pull down the data only the data that you will want to perform adhoc analytics on. HTH!

  4. […] the paper that Dave Mariani (@dmariani) and I had contributed to the whitepaper / case study SQL Server Analysis Services to Hive; below is an aggregate of those tips – hope this […]

  5. […] Services (e.g. How Klout changed the landscape of social media with Hadoop and BI Slides Updated, SQL Server Analysis Services to Hive, etc.), this post calls out the reverse – how to quickly extract SQL Server data to Hadoop / […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s