Add Built-In Hive UDFs on HDInsight Azure

In the last few weeks, I have had a number of customers ping me about how to utilize various Hive UDFs.  The first ask was how to use some of the UDFs that are already built into Hive.  For example, if you wanted a generated row sequence number (i.e. an IDENTITY column), you can use the Hive UDF UDFRowSequence.  This UDF is already built and included in the hive-contrib-0.9.0.jar that is not already loaded in the distributed cache (run list jars from the Hive CLI to verify).  Below is a quick code snippet that allows you to run the generated row sequence by accessing the UDFRowSequence Hive UDF.

— Compression codecs (not necessary for Hive UDFs, just habit)
set hive.exec.compress.intermediate=true;

— Drop the table
DROP TABLE hivesampletable_rowseq;

— Add the built-in Hive jar that contains the Hive UDF UDFRowSequence
ADD JAR /apps/dist/hive-0.9.0/lib/hive-contrib-0.9.0.jar;

— Create a function that references UDFRowSequence  that can be used within the Hive statement directly
CREATE TEMPORARY FUNCTION rowSequence AS ‘org.apache.hadoop.hive.contrib.udf.UDFRowSequence’;

— Running the Hive statement with rowSequence() call
CREATE TABLE hivesampletable_rowseq
rowSequence() as lid,
FROM hivesampletable;

Note that running the ADD JAR statement still uses Unix paths (even though we’re on Windows) which references the local file path (vs. HDFS or ASV).


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s