Over the last few weeks I’ve fielded some questions concerning the paper that Dave Mariani (@dmariani) and I had contributed to the whitepaper / case study SQL Server Analysis Services to Hive; below is an aggregate of those tips – hope this helps!
Q: I’m running into the HiveODBC Error message “..expected data length is 334…”
A: Check out the post for details on how to potentially resolve this: HiveODBC error message “..expected data length is 334…”
.
Q: Can I connect Analysis Services Tabular to Hive instead of Multidimensional?
A: Yes! Ayad Shammout (@aashammout) has a couple of great blog posts that dig into exactly how to do this; it is similar to how PowerPivot can connect to Hive.
Import Hadoop Data into Analysis Services Tabular
SSAS Tabular using HiveODBC connection
.
Q: The original SQL Server Analysis Services to Hive whitepaper had Analysis Services Multidimensional go against Cloudera’s CDH3, but can this technique work against other flavors of Hadoop?
A: Yes, the key tenet is that if you have the right HiveODBC driver, then you should be able to make a connection. When we originally published this whitepaper, the Microsoft HiveODBC driver was still very much beta and the Hive version was 0.7 for many flavors of Hadoop. A year+ later, some Hadoop distributions are on Hive 0.9 while others may be on Hive 0.11 (which has Orcfile!). To ensure connectivity, I would highly advise you to get the HiveODBC driver that matches your flavor for Hadoop – i.e. the Cloudera x64 driver for CDH4, HDInsight / Simba driver for HDInsight, MapR’s 64-bit driver, etc. Please note, I have not worked with the most recent 64-bit drivers lately so this is not the final word on this topic.
.
Q: Processing the cube is taking too long!
A: Try creating Hive fact table that contains only the current day of data so that way when the processing of the cube occurs, it is only hitting obtaining one day of data. Whether it is Analysis Services Tabular or Multidimensional, the query against the SQL view that goes against the Hive table is at its most fundamental level a “select * from [table]”. If you’re asking for all of the data in the Hive table, that could be very large and that would explain the very lengthy query. Instead, just query a single day of data and alter your SQL view to go against a Hive table with only the current day of data. This works under the context that you are creating daily partitions within your cube.
.
Q: Should I use ProcessFull or ProcessData and ProcessIndexes when processing an Analysis Services Multidimensional cube against Hive?
A: A little background on what the different processing enumerations do when processing an Analysis Services Multidimensional cube. ProcessFull is in fact both ProcessData and ProcessIndexes executed – the Analysis Services server will decide what to run concurrently though you can provide some hints on how that works. Nevertheless, I would typically recommend running ProcessData and ProcessIndexes separately because they do quite different things. As noted in the diagram below, ProcessData performs the task of reading data from the database, generating the lookup dimension keys and creating the *.fact.data files that contains the cube’s fact data. In the case of SSAS to Hive – the ProcessData task is a Hive heavy task because it is taking the data from Hive, transferring the data over the SQL Server view, into Analysis Services, and then creating the *.fact.data files. The ProcessIndexes task uses those aforementioned *.fact.data files to create the map, aggregations, and indexes that are ultimately used by Analysis Services Multidimensional to improve query performance.
Going back to the specifics surrounding Hive data into Analysis Services Multidimensional – the basics are:
- First run ProcessData so that way you bring the facts from Hive to SSAS Multidimensional by:
- The select queries generated by Analysis Services to go against the SQL View (refer to page 20 of Analysis Services ROLAP for SQL Server Data Warehouses for an example)
- If you are able to make it a select * from query then, the view is doing a fetch operation from Hive into memory or tempDB. If it is a complex query then, it will often involve a Hive query running multiple Map Reduce jobs which makes the query run longer than if it was just a fetch operation.
- The select col1, col2, measure1, measure2, …. generated by Analysis Services is performed against in memory / tempdB relatively fast because its against millions of rows for one day instead of the 10s-100s billions or more against the entire Hive table
- Then run the ProcessIndexes once the ProcessData has successfully created the *.fact.data files
- This allows you to control what data pull / aggregation tasks are being performed in parallel
.
Q: How do I control the job flow for all of this processing?
A: There are a number of options that you can use to sync the processing of the Hive table and the cube processing. In fact, in an early prototype that I had done with HDInsight and Analysis Services, I had used Windows Task manager to control the Hive processing and then used SQL Server Agent to run the Analysis Services jobs. This is a bit of a disconnected way of processing but it’s an old school way of getting things processed. Another mechanisms that I have done is to use Oozie – Workflow scheduler for Hadoop.
- Execute Oozie job flows to create the daily Hive tables so that way the select * from [HiveTable] from SQL server is only doing a fetch against a day’s worth of data
- Have the Oozie job complete its processing of the data (Pig, MR, Hive, etc.).
- Its final job is then to generate the new table (e.g. FactEvents_20130716)
- When processing the cube for the current day, the [Hive View] within SQL Server points to that new table
- You can use SQL Server Agent / SSIS / ascmd to perform the task of creating the cube partition, pointing that partition to the view (which technically should have the same name anyways) and then process the data
Enjoy!
Great! Tips
Reblogged this on Ayad Shammout's SQL & BI Blog.
You mention doing dailys for pulling data in is the key. But how do you ever handle a process full then, or updating your cube schema (which might require a process full). Do you let these hit the raw give tables and pull it directly from there?
Also any tips on maximizing query performance of openquery and the linked sever connection to hive? It seems my queries going through the odbc driver hitting hive are 4 times slower then running the query directly from the query console. Now I understand that the data is being pulled over the network, but 4 times slower?
Thx for any insights
In this case, we would have a second server (or instance depending on the size) that would perform the full process or update the schema. Users could hit the original cube while the processing occurred on the second cube. As noted, while there is a simplicity with this type of processing, the disadvantage is definitely the slowness of the processing.
At this point in time, I would most likely suggest that you connect to Spark instead of Hive so you can get faster performance. But once I go down that path, I may just say querying Spark SQL and you might not even need the cube. Full disclosure, please note that I have been working with Apache Spark since 2011 and currently work for Databricks (the founders of Databricks are the creators of Spark).
Thanks for the information.
I have been considering loading the data into hbase instead of hadoop, and using hbase as the source for the cube. I believe that would cut down the latency greatly.
I am interested in learning more about the last note you mentioned. Can i contact you directly for more information.
Thanks
If you’re going to push the data into HBase, you may want to consider looking at the Apache Kylin project but do note the issues with maintaining HBase (don’t get me wrong, I like HBase but the operational issues can be a tad complicated – but then again, so is everything else).
Oh sure – by all means do no hesitate to ping me. You can ping my personal email denny .at. dennyglee.com or my work email denny .at. databricks.com.
HTH!