Using Avro with HDInsight on Azure at 343 Industries

By Michael Wetzel, Tamir Melamed, Mark Vayman, Denny Lee
Reviewed by Pedro Urbina Escos, Brad Sarsfield, Rui Martins
Thanks to Krishnan Kaniappan, Che Chou, Jennifer Yi, and Rob Semsey

As noted in the Windows Azure Customer Solution Case Study, Halo 4 developer 343 Industries Gets New User Insights from Big Data in the Cloud, a critical component to achieve faster Hadoop query and processing performance AND keep file sizes small (thus Azure storage savings, faster query performance, and reduced network overhead) was to utilize Avro sequence files.

Avro was designed for Hadoop to help make Hadoop more interoperable with other languages; within our context, Avro has a C# API. Another popular format is protobuf (Google’s data interchange format) which was also under consideration. Altogether, we had compared the various formats (avro, protobuf, compressed JSON, compressed CSV, etc.) for our specific scenarios, we had found Avro to be smallest and fastest. In general, we found that in our case, Avros resulted in Hive queries with 1/3 the duration (i.e. 3x faster) while only 4% of the original data size. For more information about Avro, please refer to the latest Apache Avro documentation (as of this post, it is Apache Avro 1.7.3).

We will review some of the key components to get Avro up and running with HDInsight on Azure.

The Architecture

Many of the queries we executed against our HDInsight on Azure cluster were using the Hive Data Warehousing framework. Working within the context of HDInsight on Azure, all of the data is stored in the Azure Blob Storage using Block Blobs (the top set of boxes in the diagram below). Within HDInsight on Azure preview mode, the context of Azure Blob Storage is also known as the ASV protocol (Azure Storage Vault).

clip_image002

HDInsight itself is running on the Azure Compute nodes so that the Map Reduce computation is performed here while the source and target “tables” are in ASV.

To improve query performance, HDInsight on Azure utilizes Azure’s Flat Network storage – a grid network between compute and storage – to provide high bandwidth connectivity for compute and storage. For more information, please refer to Brad Calder’s informative post: Windows Azure’s Flat Network Storage and 2012 Scalability Targets.

Hive and Sequence Files

What’s great about Hive is that it is a data warehouse framework on top of Hadoop. What Hive really does is allow you to write HiveQL (a SQL-like language) that ultimately is translated to Map Reduce jobs. This means that Hive can take advantage of Hadoop’s ubiquity to read and store files of different formats and compressions.

In our particular case, we had our own binary format of data files that we had to process and store within Hadoop. Since we had to convert the files – we decided to test various formats including CSV, JSON, Avro, and Protobuf. We had the option to build a Serializer / Deserializer (SerDe) so Hive could read our binary format directly but in our case it was faster and easier to write our own mechanism vs. the custom SerDe.

Ultimately, whatever format chosen, the files would be stored in Azure Blob Storage. HDInsight on Azure would ultimately have a Hive table that would point to these files so we could query them as tables.

Jump into Avro with Haivvreo

A slight detour here in that because 343 Industries Foundation Services Team went into production on a preview instance of HDInsight on Azure, to get avro to work with HDInsight, we needed to utilize Jakob Homan GitHub project Haivvreo. It is a Hive SerDe that LinkedIn had developed to process Avro-encoded data in hive.

Note, Haivvreo is part of the standard Hive installation as of Hive 0.9.1 via HIVE-895. But if you are using a version of Hive prior to Hive 0.9.1, you will need to build Haivvreo yourself. Fortunately, the documentation for Haivvreo is straightforward and very complete. For more information on Haivvreo, please go to: https://github.com/jghoman/haivvreo

A quick configuration note, Haivvreo uses the two Jackson libraries:

- Jackson-core-asl-1.4.2.jar

- Jackson-mapper-asl-1.4.2.jar

As of this post, the Jackson libraries in HDInsight are version 1.8 and are fully backward compatible therefore you will not need to manually add these jars.

Jump into Protobuf with Elephant-Bird

While we ultimately had chosen Avro, Protobuf was a solid choice and if our data profile was different, we may have very well chosen Protobuf. This format is so popular that it is now native within Hadoop 2.0 / YARN (which as of this writing is still in beta). But for systems that are currently running on the Hadoop 1.0 tree (such as the current version of HDInsight), one of the most complete is Kevin Weil’s Elephant-Bird GitHub project – it is Twitter’s collection of LZO and Protocol Buffer code-base for Hadoop, Pig, Hive, and HBase.

Format Performance Review

As noted earlier, we had tested in their respective C# libraries the various file formats for compression and query performance. Below is a sample of the same set file in JSON (with Newtonsoft), Avro (with .NET 4.5 Deflate Fastest), and Protobuf (with LZO) format.

Format Time (ms) Size (KB)
Raw JSON 18,405 634,070.09
Avro 5,269 30,979.52
Protobuf 6,952 36,163.22

Based on these and other tests, it was observed that avro was faster and smaller in size. Faster is always good but smaller in size is more important than most may realize. The smaller size means there is less data to send across the network between storage and compute helping with query performance. But more importantly, it means you can save on Azure Blob Storage costs because you are using less storage space.

Additional tests were performed and the overall pro/con comparison can be seen below.

Method Pros Cons
Sequence (Avro)
  1. Smallest Size
  2. Compress block at a time; splittable
  3. Object structure maintained
  4. Supports reading old data w/ new schema
  1. Need to use .NET 4.5 to make best use of it
  2. Potentially slower serialization
  3. To read/write data, need a schema
JSON
  1. Smaller size than TSV
  2. Object structure maintained
  3. Ubiquitous
Relatively large size
Text (TSV, CSV)
  1. Simple and fast serialization
  1. Largest size
  2. NULL issues
  3. Cannot handle arrays, binaries
  4. Object format issues

After prototyping and evaluating, we had decided to use Avro for the following reasons:

· Expressive

· Smaller and Faster

· Dynamic (schema store with data and APIs permit reading and creating)

· Include a file format and a textual encoding

· Leverage versioning support

· For Hadoop service provide cross language access.

Please note that these were the based on the test we had performed with our data profile. Yours may be different so it will be important for you to test before committing.

Getting Avro data into Azure Blob Storage

The Foundation Services Team at 343 Industries team is already using Windows Azure elastic infrastructure to provide them maximum flexibility – and much of that code is written in C# .NET. Therefore to make sense of the raw game data collected for Halo 4, the Halo Engineering team wrote Azure-based services that converted the raw game data into Avro.

clip_image004

The basic workflow is as follows:

1) Take the raw data and use the patched C# Avro library to convert the raw data into an Avro object.

2) As Avro data is always serialized with its schema (thus allowing you to have different versions of Avros but still being able to read them because the schema is embedded with it), we perform a data serialization step to create serialized data block.

3) This block of data is then compressed using the deflate options within .NET 4.5 (in our case, using the Fastest option). Quick note, we started with .NET 4.0 but switching to .NET 4.5 Deflate Fastest gave us slightly better compression and approximately 3x speed improvement. Ultimately, multiple blocks (with the same schema) are placed into a single Avro data file.

4) Finally, this Avro data file is sent to Azure Blob Storage.

A quick configuration note, you will need to take the AVRO-823: Support Avro data files in C# to get this up and running.

Configuration Quick Step by Step

As noted earlier, the documentation within Haivvreo is excellent and we definitely used it as our primary source. Therefore, before doing anything, please start with the Avro and Haivvreo documentation.

The configuration steps below are more specific to how to get the avro jars up and running within HDInsight per se.

1) Ensure that you’ve added the Hive bin to your PATH variables

;%HIVE_HOME%\bin

2) Ensure that you have already configured your ASV account within the core-site.xml if you have not already done so.

<property>

<name>fs.azure.account.key.$myStorageAccountName

<key>$myStorageAccountKey</key>

</name>

</property>

3) Copy your avro jars created within Haivvreo to your ASV storage account

4) Edit the hive-site.xml to include the path to your jars within ASV

<property>

<name>hive.aux.jars.path</name>

<value>asv://$container@account/$folder/avro-1.7.2.jar,asv://$container@account/$folder/avro-mapred-1.7.2.jar,asv://$container@account/$folder/haivvreo-1.0.7-cdh.jar</value>

</property>

5) Ensure that the avro schema (when you create an avro object, you also can create an avro schema that defines that object) is placed into HDFS so you can access it. Refer to the below Avros Tips and Tricks section concerning the use of schema.literal and schema.url.

Querying the Data

What’s great about this technique is that now that you have put your Avro data files into a folder within Azure Blob Storage, you need only to create a Hive EXTERNAL table to access and query this data. The Hive external table DDL is in the form of:

CREATE EXTERNAL TABLE GameDataAvro (

)

ROW FORMAT SERDE ‘com.linkedin.haivvreo.AvroSerDe’

STORED AS INPUTFORMAT ‘com.linkedin.haivvreo.AvroContainerInputFormat’

OUTPUTFORMAT

‘com.linkedin.haivvreo.AvroContainerOutputFormat’

LOCATION ‘asv://container/folder’

For a quick HDInsight on Azure tutorial with Hive, please check the post Hadoop on Azure: HiveQL query against Azure Blob Storage.

Avro Tips and Tricks

Some quick tips and tricks based on our experiences with Avros:

- Compression not only resulted in saving disk space, but less CPU resources were used to process the data

- Mixing compressed and uncompressed Avro in the same folder works!

- Corrupt file in ASV may cause jobs to fail. To workaround this problem, either manually delete the corrupt files and/or configure Hadoop/Hive to skip x number of errors.

- Passing long string literals into the Hive Metastore DB may result in truncation. To avoid this issue, you can use the schema.literal instead of schema.url if the literal is < 8K. If it is too big, then you can embed the schema within HDFS as per below.

CREATE EXTERNAL TABLE GameDataAvro (

)

ROW FORMAT SERDE ‘com.linkedin.haivvreo.AvroSerDe’

WITH SERDEPROPERTIES (

‘schema.url’=’hdfs://namenodehost:9000/schema/schema.avsc’)

STORED AS INPUTFORMAT ‘com.linkedin.haivvreo.AvroContainerInputFormat’

OUTPUTFORMAT

‘com.linkedin.haivvreo.AvroContainerOutputFormat’

LOCATION ‘asv://container/folder’

8 thoughts on “Using Avro with HDInsight on Azure at 343 Industries

  1. Great article Denny.

    “-Compression not only resulted in saving disk space, but less CPU resources were used to process the data”

    I would think it would be the opposite – it would reduce the time spent on IO and increase the time spend on CPU. It would make sense since these kind of jobs are often IO-bound?

    -Morten

    • Thanks Morten – much appreciated! As for the CPU resources, just like you noted in our situation a lot of this is IO bound and the CPU was just waiting, eh?!

      • So is a better way to say it that you were able to use fewer compute nodes because the better IO performance resulted in you being able to make better use of the CPU in each node?

        Did you then use fewer node or use the same number and get the answer quicker?

      • Sort of – we didn’t really try to use less CPU nodes. More like we were able to do more with the resources we had since that was our case, eh?!

  2. Great Write! Quick note Denny; instead of
    ‘schema.url’=’hdfs://x.x.x.x:9000/schema/schema.avsc’
    you should be able to use
    ‘schema.url’=’hdfs://namenodehost:9000/schema/schema.avsc’

    No need for IP address!

  3. Pingback: Weekly bookmarks: mars 16th | robertsahlin.com

  4. Pingback: Halo 4 and the Flood of Big Data at Big Data Camp / PASS BA Conference | Denny Lee

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s