Over the last few weeks I’ve fielded some questions concerning the paper that Dave Mariani (@dmariani) and I had contributed to the whitepaper / case study SQL Server Analysis Services to Hive; below is an aggregate of those tips – hope this helps!
Q: I’m running into the HiveODBC Error message “..expected data length is 334…”
A: Check out the post for details on how to potentially resolve this: HiveODBC error message “..expected data length is 334…”
Q: Can I connect Analysis Services Tabular to Hive instead of Multidimensional?
A: Yes! Ayad Shammout (@aashammout) has a couple of great blog posts that dig into exactly how to do this; it is similar to how PowerPivot can connect to Hive.
Q: The original SQL Server Analysis Services to Hive whitepaper had Analysis Services Multidimensional go against Cloudera’s CDH3, but can this technique work against other flavors of Hadoop?
A: Yes, the key tenet is that if you have the right HiveODBC driver, then you should be able to make a connection. When we originally published this whitepaper, the Microsoft HiveODBC driver was still very much beta and the Hive version was 0.7 for many flavors of Hadoop. A year+ later, some Hadoop distributions are on Hive 0.9 while others may be on Hive 0.11 (which has Orcfile!). To ensure connectivity, I would highly advise you to get the HiveODBC driver that matches your flavor for Hadoop – i.e. the Cloudera x64 driver for CDH4, HDInsight / Simba driver for HDInsight, MapR’s 64-bit driver, etc. Please note, I have not worked with the most recent 64-bit drivers lately so this is not the final word on this topic.
Q: Processing the cube is taking too long!
A: Try creating Hive fact table that contains only the current day of data so that way when the processing of the cube occurs, it is only hitting obtaining one day of data. Whether it is Analysis Services Tabular or Multidimensional, the query against the SQL view that goes against the Hive table is at its most fundamental level a “select * from [table]”. If you’re asking for all of the data in the Hive table, that could be very large and that would explain the very lengthy query. Instead, just query a single day of data and alter your SQL view to go against a Hive table with only the current day of data. This works under the context that you are creating daily partitions within your cube.
Q: Should I use ProcessFull or ProcessData and ProcessIndexes when processing an Analysis Services Multidimensional cube against Hive?
A: A little background on what the different processing enumerations do when processing an Analysis Services Multidimensional cube. ProcessFull is in fact both ProcessData and ProcessIndexes executed – the Analysis Services server will decide what to run concurrently though you can provide some hints on how that works. Nevertheless, I would typically recommend running ProcessData and ProcessIndexes separately because they do quite different things. As noted in the diagram below, ProcessData performs the task of reading data from the database, generating the lookup dimension keys and creating the *.fact.data files that contains the cube’s fact data. In the case of SSAS to Hive – the ProcessData task is a Hive heavy task because it is taking the data from Hive, transferring the data over the SQL Server view, into Analysis Services, and then creating the *.fact.data files. The ProcessIndexes task uses those aforementioned *.fact.data files to create the map, aggregations, and indexes that are ultimately used by Analysis Services Multidimensional to improve query performance.
Going back to the specifics surrounding Hive data into Analysis Services Multidimensional – the basics are:
- First run ProcessData so that way you bring the facts from Hive to SSAS Multidimensional by:
- The select queries generated by Analysis Services to go against the SQL View (refer to page 20 of Analysis Services ROLAP for SQL Server Data Warehouses for an example)
- If you are able to make it a select * from query then, the view is doing a fetch operation from Hive into memory or tempDB. If it is a complex query then, it will often involve a Hive query running multiple Map Reduce jobs which makes the query run longer than if it was just a fetch operation.
- The select col1, col2, measure1, measure2, …. generated by Analysis Services is performed against in memory / tempdB relatively fast because its against millions of rows for one day instead of the 10s-100s billions or more against the entire Hive table
- Then run the ProcessIndexes once the ProcessData has successfully created the *.fact.data files
- This allows you to control what data pull / aggregation tasks are being performed in parallel
Q: How do I control the job flow for all of this processing?
A: There are a number of options that you can use to sync the processing of the Hive table and the cube processing. In fact, in an early prototype that I had done with HDInsight and Analysis Services, I had used Windows Task manager to control the Hive processing and then used SQL Server Agent to run the Analysis Services jobs. This is a bit of a disconnected way of processing but it’s an old school way of getting things processed. Another mechanisms that I have done is to use Oozie – Workflow scheduler for Hadoop.
- Execute Oozie job flows to create the daily Hive tables so that way the select * from [HiveTable] from SQL server is only doing a fetch against a day’s worth of data
- Have the Oozie job complete its processing of the data (Pig, MR, Hive, etc.).
- Its final job is then to generate the new table (e.g. FactEvents_20130716)
- When processing the cube for the current day, the [Hive View] within SQL Server points to that new table
- You can use SQL Server Agent / SSIS / ascmd to perform the task of creating the cube partition, pointing that partition to the view (which technically should have the same name anyways) and then process the data