Using Avro with HDInsight (Hadoop on Azure) at 343

By Michael Wetzel, Tamir Melamed, Mark Vayman, Denny Lee
Reviewed by Pedro Urbina Escos, Brad Sarsfield, Rui Martins
Thanks to Krishnan Kaniappan, Che Chou, Jennifer Yi, and Rob Semsey

Note that this post was published on March 12th, 2013. This information may not be current, but I have kept this post for posterity.

As noted in the Windows Azure Customer Solution Case Study, Halo 4 developer 343 Industries Gets New User Insights from Big Data in the Cloud, a critical component to achieving faster Hadoop query and processing performance AND keeping file sizes small (thus Azure storage savings, faster query performance, and reduced network overhead) was to utilize Avro sequence files.

Avro was designed for Hadoop to help make Hadoop more interoperable with other languages; within our context, Avro has a C# API. Another popular format is protobuf (Google’s data interchange format) which was also under consideration. Altogether, we had compared the various formats (avro, protobuf, compressed JSON, compressed CSV, etc.) for our specific scenarios, we had found Avro to be smallest and fastest. In general, we found that in our case, Avros resulted in Hive queries with 1/3 the duration (i.e. 3x faster) while only 4% of the original data size. For more information about Avro, please refer to the latest Apache Avro documentation (as of this post, it is Apache Avro 1.7.3).

We will review some of the key components to get Avro up and running with HDInsight on Azure.

The Architecture

Many of the queries we executed against our HDInsight on Azure cluster were using the Hive Data Warehousing framework. Working within the context of HDInsight on Azure, all of the data is stored in the Azure Blob Storage using Block Blobs (the top set of boxes in the diagram below). Within HDInsight on Azure preview mode, the context of Azure Blob Storage is also known as the ASV protocol (Azure Storage Vault).

HDInsight itself is running on the Azure Compute nodes so that the Map Reduce computation is performed here while the source and target “tables” are in ASV.

To improve query performance, HDInsight on Azure utilizes Azure’s Flat Network storage – a grid network between compute and storage – to provide high bandwidth connectivity for compute and storage. For more information, please refer to Brad Calder’s informative post: Windows Azure’s Flat Network Storage and 2012 Scalability Targets.

Hive and Sequence Files

What’s great about Hive is that it is a data warehouse framework on top of Hadoop. What Hive really does is allow you to write HiveQL (a SQL-like language) that ultimately is translated to Map Reduce jobs. This means that Hive can take advantage of Hadoop’s ubiquity to read and store files of different formats and compressions.

In our particular case, we had our own binary format of data files that we had to process and store within Hadoop. Since we had to convert the files – we decided to test various formats including CSV, JSON, Avro, and Protobuf. We had the option to build a Serializer / Deserializer (SerDe) so Hive could read our binary format directly but in our case it was faster and easier to write our own mechanism vs. the custom SerDe.

Ultimately, whatever format chosen, the files would be stored in Azure Blob Storage. HDInsight on Azure would ultimately have a Hive table that would point to these files so we could query them as tables.

Jump into Avro with Haivvreo

A slight detour here in that because 343 Industries Foundation Services Team went into production on a preview instance of HDInsight on Azure, to get avro to work with HDInsight, we needed to utilize Jakob Homan GitHub project Haivvreo. It is a Hive SerDe that LinkedIn had developed to process Avro-encoded data in hive.

Note, Haivvreo is part of the standard Hive installation as of Hive 0.9.1 via HIVE-895. But if you are using a version of Hive prior to Hive 0.9.1, you will need to build Haivvreo yourself. Fortunately, the documentation for Haivvreo is straightforward and very complete. For more information on Haivvreo, please go to: https://github.com/jghoman/haivvreo

A quick configuration note: Haivvreo uses the two Jackson libraries:

Jackson-core-asl-1.4.2.jar
Jackson-mapper-asl-1.4.2.jar

As of this post, the Jackson libraries in HDInsight are version 1.8 and are fully backward compatible therefore, you will not need to manually add these jars.

Jump into Protobuf with Elephant-Bird

While we ultimately had chosen Avro, Protobuf was a solid choice and if our data profile was different, we may have very well chosen Protobuf. This format is so popular that it is now native within Hadoop 2.0 / YARN (which as of this writing is still in beta). But for systems that are currently running on the Hadoop 1.0 tree (such as the current version of HDInsight), one of the most complete is Kevin Weil’s Elephant-Bird GitHub project – it is Twitter’s collection of LZO and Protocol Buffer code-base for Hadoop, Pig, Hive, and HBase.

Format Performance Review

As noted earlier, we had tested in their respective C# libraries the various file formats for compression and query performance. Below is a sample of the same set file in JSON (with Newtonsoft), Avro (with .NET 4.5 Deflate Fastest), and Protobuf (with LZO) format.

Format	Time (ms)	Size (KB)
Raw JSON	18,405	634,070.09
Avro	5,269	30,979.52
Protobuf	6,952	36,163.22

Based on these and other tests, it was observed that avro was faster and smaller in size. Faster is always good but smaller in size is more important than most may realize. The smaller size means there is less data to send across the network between storage and compute helping with query performance. But more importantly, it means you can save on Azure Blob Storage costs because you are using less storage space.

Additional tests were performed and the overall pro/con comparison can be seen below.

Method	Pros	Cons
Sequence (Avro)	Smallest Size Compress block at a time; splittable Object structure maintained Supports reading old data w/ new schema	Need to use .NET 4.5 to make best use of it Potentially slower serialization To read/write data, need a schema
JSON	Smaller size than TSV Object structure maintained Ubiquitous	Relatively large size
Text (TSV, CSV)	Simple and fast serialization	Largest size NULL issues Cannot handle arrays, binaries Object format issues

After prototyping and evaluating, we had decided to use Avro for the following reasons:

Expressive
Smaller and Faster
Dynamic (schema store with data and APIs permit reading and creating)
Include a file format and a textual encoding
Leverage versioning support
For Hadoop service provide cross language access.

Please note that these were based on the test we had performed with our data profile. Yours may be different, so it will be important for you to test before committing.

Getting Avro data into Azure Blob Storage

The Foundation Services Team at 343 Industries team is already using Windows Azure elastic infrastructure to provide them maximum flexibility – and much of that code is written in C# .NET. Therefore to make sense of the raw game data collected for Halo 4, the Halo Engineering team wrote Azure-based services that converted the raw game data into Avro.

The basic workflow is as follows:

Take the raw data and use the patched C# Avro library to convert the raw data into an Avro object.
As Avro data is always serialized with its schema (thus allowing you to have different versions of Avros but still being able to read them because the schema is embedded with it), we perform a data serialization step to create serialized data block.
This block of data is then compressed using the deflate options within .NET 4.5 (in our case, using the Fastest option). Quick note, we started with .NET 4.0 but switching to .NET 4.5 Deflate Fastest gave us slightly better compression and approximately 3x speed improvement. Ultimately, multiple blocks (with the same schema) are placed into a single Avro data file.
Finally, this Avro data file is sent to Azure Blob Storage.

A quick configuration note, you will need to take the AVRO-823: Support Avro data files in C# to get this up and running.

Configuration Quick Step by Step

As noted earlier, the documentation within Haivvreo is excellent and we definitely used it as our primary source. Therefore, before doing anything, please start with the Avro and Haivvreo documentation.

The configuration steps below are more specific to how to get the avro jars up and running within HDInsight per se.

1) Ensure that you’ve added the Hive bin to your PATH variables

%HIVE_HOME%\bin

2) Ensure that you have already configured your ASV account within the core-site.xml if you have not already done so.

<property>
   <name>fs.azure.account.key.$myStorageAccountName
      <key>$myStorageAccountKey</key>
   </name>
</property>

3) Copy your avro jars created within Haivvreo to your ASV storage account

4) Edit the hive-site.xml to include the path to your jars within ASV

<property>
   <name>hive.aux.jars.path</name>
   <value>
       asv://$container@account/$folder/avro-1.7.2.jar,
       asv://$container@account/$folder/avro-mapred-1.7.2.jar,
       asv://$container@account/$folder/haivvreo-1.0.7-cdh.jar
   </value>
</property>

5) Ensure that the avro schema (when you create an avro object, you also can create an avro schema that defines that object) is placed into HDFS so you can access it. Refer to the below Avros Tips and Tricks section concerning the use of schema.literal and schema.url.

Querying the Data

What’s great about this technique is that now that you have put your Avro data files into a folder within Azure Blob Storage, you need only to create a Hive EXTERNAL table to access and query this data. The Hive external table DDL is in the form of:

CREATE EXTERNAL TABLE GameDataAvro (
  ...
)
ROW FORMAT SERDE 'com.linkedin.haivvreo.AvroSerDe'
STORED AS INPUTFORMAT
   'com.linkedin.haivvreo.AvroContainerInputFormat'
OUTPUTFORMAT
   'com.linkedin.haivvreo.AvroContainerOutputFormat'
LOCATION 'asv://container/folder'

For a quick HDInsight on Azure tutorial with Hive, please check the post Hadoop on Azure: HiveQL query against Azure Blob Storage.

Avro Tips and Tricks

Some quick tips and tricks based on our experiences with Avros:

Compression not only resulted in saving disk space, but less CPU resources were used to process the data
Mixing compressed and uncompressed Avro in the same folder works!
Corrupt files in ASV may cause jobs to fail. To workaround this problem, either manually delete the corrupt files and/or configure Hadoop/Hive to skip x number of errors.
Passing long string literals into the Hive Metastore DB may result in truncation. To avoid this issue, you can use the schema.literal instead of schema.url if the literal is < 8K. If it is too big, then you can embed the schema within HDFS as per below.

CREATE EXTERNAL TABLE GameDataAvro (
   ...
)
ROW FORMAT SERDE ‘com.linkedin.haivvreo.AvroSerDe’
WITH SERDEPROPERTIES (
   'schema.url'='hdfs://namenodehost:9000/schema/schema.avsc'
)
STORED AS INPUTFORMAT 
   'com.linkedin.haivvreo.AvroContainerInputFormat'
OUTPUTFORMAT
   'com.linkedin.haivvreo.AvroContainerOutputFormat'
LOCATION ‘asv://container/folder’

8 responses to “Using Avro with HDInsight on Azure at 343 Industries”

Morten Post (mortenbpost)

March 12, 2013 at 3:36 pm

Great article Denny.

“-Compression not only resulted in saving disk space, but less CPU resources were used to process the data”

I would think it would be the opposite – it would reduce the time spent on IO and increase the time spend on CPU. It would make sense since these kind of jobs are often IO-bound?

-Morten

Loading…

1. dennyglee
  
  March 12, 2013 at 5:21 pm
  
  Thanks Morten – much appreciated! As for the CPU resources, just like you noted in our situation a lot of this is IO bound and the CPU was just waiting, eh?!
  
  Loading…
  
  1. David Hardin
    
    March 13, 2013 at 7:29 pm
    
    So is a better way to say it that you were able to use fewer compute nodes because the better IO performance resulted in you being able to make better use of the CPU in each node?
    
    Did you then use fewer node or use the same number and get the answer quicker?
    
    Loading…
  2. dennyglee
    
    March 13, 2013 at 10:40 pm
    
    Sort of – we didn’t really try to use less CPU nodes. More like we were able to do more with the resources we had since that was our case, eh?!
    
    Loading…
Pedro Urbina

March 13, 2013 at 7:38 pm

Great Write! Quick note Denny; instead of
‘schema.url’=’hdfs://x.x.x.x:9000/schema/schema.avsc’
you should be able to use
‘schema.url’=’hdfs://namenodehost:9000/schema/schema.avsc’

No need for IP address!

Loading…

1. dennyglee
  
  March 13, 2013 at 10:40 pm
  
  Thanks Pedro – I’ll update the post as well – much appreciated! 🙂
  
  Loading…
  
Weekly bookmarks: mars 16th | robertsahlin.com

March 16, 2013 at 3:14 pm

[…] Using Avro with HDInsight on Azure at 343 Industries | Denny Lee – Using Avro with HDInsight on Azure http://t.co/izXNzrpTsF #hadoop #avro […]

Loading…

Halo 4 and the Flood of Big Data at Big Data Camp / PASS BA Conference | Denny Lee

April 3, 2013 at 3:05 pm

[…] – Using Avro with HDInsight on Azure at 343 Industries […]

Loading…

Using Avro with HDInsight on Azure at 343 Industries