One of the odd HiveODBC error messages that I recently encountered on a project is that when I am extracting data from my Hive/Hadoop cluster using the HiveODBC driver, I end up getting an error message similar to:
OLE DB provider ‘MSDASQL’ for linked server ‘[MyHadoopCluster]‘ returned data that does not match expected data length for column ‘[MSDASQL].error_message’. The (maximum) expected data length is 334, while the returned data length is 387.
In this case, the error message was generated from connecting my SQL Server to a Mac OSX Hadoop 1.0.1 cluster running Hive 0.8.1 using the HiveODBC driver (why I insist on experimenting on my Macbook Air is a blog post for another time!). This is similar to the design as called by the SQL Server Analysis Services to Hive- A Klout Case Study. The first I thought that was curious was that if I had created a HiveODBC linked server connection from my sql server, and then ran a select * into statement,
select * from openquery(‘MyHadoopCluster’,
select * from hive_weblogs
)
then all the string columns were automatically set to varchar(334).
Digging into the data, I had in fact had a string column where a bunch of rows had a length much greater than 334 characters. In my case, I had error messages that were potentially thousands of characters long. As reminded to me by Dave Mariani (@dmariani) and some colleagues over at Platon (thanks Stig Torngaard Hammeken and Morten Post), this in fact is a bug noted in jira Hive-3382: Strings truncated to length 334 when hive data is accessed through hive ODBC driver.
So until this is fixed, what can you do about it?
A tad frustrating at times, but in many BI-related cases, the full string isn’t required so the workarounds suggested above work fine.
Hopefully this can help you too!
Hi Denny,
I found that in some cases even if you use substr() in your hql query somehow corrupt strings make it through to the ODBC and it fails. For example I have a hive table where if I test my str column with LENGTH() no strings are longer than 150 chars. But when I SELECT * through the ODBC i still receive and error that some strings are longer than 334.
-Morten
Its interesting set of issues I have to admit. The problem with some of the functions is that it “just fraks up” when querying through the HiveODBC driver (that’s the technical term). This is the same whether talking about the Apache HiveODBC or the Windows version so this isn’t an OS specific problem. I haven’t quite figured out the pattern either so if you have any insight – comment away, eh?!