One of the odd HiveODBC error messages that I recently encountered on a project is that when I am extracting data from my Hive/Hadoop cluster using the HiveODBC driver, I end up getting an error message similar to:
OLE DB provider ‘MSDASQL’ for linked server ‘[MyHadoopCluster]’ returned data that does not match expected data length for column ‘[MSDASQL].error_message’. The (maximum) expected data length is 334, while the returned data length is 387.
In this case, the error message was generated from connecting my SQL Server to a Mac OSX Hadoop 1.0.1 cluster running Hive 0.8.1 using the HiveODBC driver (why I insist on experimenting on my Macbook Air is a blog post for another time!). This is similar to the design as called by the SQL Server Analysis Services to Hive- A Klout Case Study. The first I thought that was curious was that if I had created a HiveODBC linked server connection from my sql server, and then ran a select * into statement,
select * from openquery(‘MyHadoopCluster’,
select * from hive_weblogs
then all the string columns were automatically set to varchar(334).
Digging into the data, I had in fact had a string column where a bunch of rows had a length much greater than 334 characters. In my case, I had error messages that were potentially thousands of characters long. As reminded to me by Dave Mariani (@dmariani) and some colleagues over at Platon (thanks Stig Torngaard Hammeken and Morten Post), this in fact is a bug noted in jira Hive-3382: Strings truncated to length 334 when hive data is accessed through hive ODBC driver.
So until this is fixed, what can you do about it?
- You could always using the substring function to shorten it such as substring(errormessage, 1, 334)
- When exporting data out of HiveODBC, instead of exporting out all of the columns, just export out the columns you need. Using the above sql_hive_weblogs example, I only needed the errormessage column when I needed to dig into the data, so what I did was export out the server, eventtime, ipaddress, and errorcode columns into my PowerPivot workbook/ SSAS cube / SQL database via the HiveODBC driver. When I needed the actual error message, I had the errorcode handy and then I ran a Hive query to get the actual error message.
- Related to the above approach, use SQOOP or export the table (or just the errorcode and errormessage columns) out as CSV to ultimately load that data into your final destination.
- If you do not have an errorcode handy, perhaps build a hash of the errormessage field so you can do the above two tricks. Examples of hashes in Hive include (but are not limited to) Hive MD5 UDF, Hive-1262 patch to add security/checksum UDFs, etc.
A tad frustrating at times, but in many BI-related cases, the full string isn’t required so the workarounds suggested above work fine.
Hopefully this can help you too!