One of the odd HiveODBC error messages that I recently encountered on a project is that when I am extracting data from my Hive/Hadoop cluster using the HiveODBC driver, I end up getting an error message similar to:
OLE DB provider ‘MSDASQL’ for linked server ‘[MyHadoopCluster]’ returned data that does not match expected data length for column ‘[MSDASQL].error_message’. The (maximum) expected data length is 334, while the returned data length is 387.
In this case, the error message was generated from connecting my SQL Server to a Mac OSX Hadoop 1.0.1 cluster running Hive 0.8.1 using the HiveODBC driver (why I insist on experimenting on my Macbook Air is a blog post for another time!). This is similar to the design as called by the SQL Server Analysis Services to Hive- A Klout Case Study. The first I thought that was curious was that if I had created a HiveODBC linked server connection from my sql server, and then ran a select * into statement,
select * from openquery(‘MyHadoopCluster’,
select * from hive_weblogs
then all the string columns were automatically set to varchar(334).
Digging into the data, I had in fact had a string column where a bunch of rows had a length much greater than 334 characters. In my case, I had error messages that were potentially thousands of characters long. As reminded to me by Dave Mariani (@dmariani) and some colleagues over at Platon (thanks Stig Torngaard Hammeken and Morten Post), this in fact is a bug noted in jira Hive-3382: Strings truncated to length 334 when hive data is accessed through hive ODBC driver.
So until this is fixed, what can you do about it?
- You could always using the substring function to shorten it such as substring(errormessage, 1, 334)
- When exporting data out of HiveODBC, instead of exporting out all of the columns, just export out the columns you need. Using the above sql_hive_weblogs example, I only needed the errormessage column when I needed to dig into the data, so what I did was export out the server, eventtime, ipaddress, and errorcode columns into my PowerPivot workbook/ SSAS cube / SQL database via the HiveODBC driver. When I needed the actual error message, I had the errorcode handy and then I ran a Hive query to get the actual error message.
- Related to the above approach, use SQOOP or export the table (or just the errorcode and errormessage columns) out as CSV to ultimately load that data into your final destination.
- If you do not have an errorcode handy, perhaps build a hash of the errormessage field so you can do the above two tricks. Examples of hashes in Hive include (but are not limited to) Hive MD5 UDF, Hive-1262 patch to add security/checksum UDFs, etc.
A tad frustrating at times, but in many BI-related cases, the full string isn’t required so the workarounds suggested above work fine.
Hopefully this can help you too!
I found that in some cases even if you use substr() in your hql query somehow corrupt strings make it through to the ODBC and it fails. For example I have a hive table where if I test my str column with LENGTH() no strings are longer than 150 chars. But when I SELECT * through the ODBC i still receive and error that some strings are longer than 334.
Its interesting set of issues I have to admit. The problem with some of the functions is that it “just fraks up” when querying through the HiveODBC driver (that’s the technical term). This is the same whether talking about the Apache HiveODBC or the Windows version so this isn’t an OS specific problem. I haven’t quite figured out the pattern either so if you have any insight – comment away, eh?! 🙂
[…] A: Check out the post for details on how to potentially resolve this: HiveODBC error message “..expected data length is 334…” […]