Add Built-In Hive UDFs on HDInsight Azure

In the last few weeks, I have had a number of customers ping me about how to utilize various Hive UDFs.  The first ask was how to use some of the UDFs that are already built into Hive.  For example, if you wanted a generated row sequence number (i.e. an IDENTITY column), you can use the Hive UDF UDFRowSequence.  This UDF is already built and included in the hive-contrib-0.9.0.jar that is not already loaded in the distributed cache (run list jars from the Hive CLI to verify).  Below is a quick code snippet that allows you to run the generated row sequence by accessing the UDFRowSequence Hive UDF.

– Compression codecs (not necessary for Hive UDFs, just habit)
set mapred.compress.map.output=true;
set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set hive.exec.compress.intermediate=true;

– Drop the table
DROP TABLE hivesampletable_rowseq;

– Add the built-in Hive jar that contains the Hive UDF UDFRowSequence
ADD JAR /apps/dist/hive-0.9.0/lib/hive-contrib-0.9.0.jar;

– Create a function that references UDFRowSequence  that can be used within the Hive statement directly
CREATE TEMPORARY FUNCTION rowSequence AS ‘org.apache.hadoop.hive.contrib.udf.UDFRowSequence';

– Running the Hive statement with rowSequence() call
CREATE TABLE hivesampletable_rowseq
AS
SELECT
rowSequence() as lid,
clientid,
querytime
FROM hivesampletable;

Note that running the ADD JAR statement still uses Unix paths (even though we’re on Windows) which references the local file path (vs. HDFS or ASV).

Enjoy!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s