By Brad Sarsfield and Denny Lee

One of the questions we are commonly asked concerning HDInsight, Azure, and Azure Blob Storage is why one should store their data into Azure Blob Storage instead of HDFS on the HDInsight Azure Compute nodes.  After all, Hadoop is all about moving compute to data vs. traditionally moving data to compute as noted in Moving data to compute or compute to data? That is the Big Data question.  The network is often the bottleneck and making it performant can be expensive.  Yet the practice for HDInsight on Azure is to place the data into Azure Blob Storage (also known by the moniker ASV – Azure Storage Vault); these storage nodes are separate from the compute nodes that Hadoop uses to perform its calculations.  This seems to be in conflict with the idea of moving compute to the data.


It’s all about the Network

As noted in the above diagram,the typical HDInsight infrastructure is that HDInsight is located on the compute nodes while the data resides in the Azure Blob Storage.  But to ensure that the transfer of data from storage to compute is fast, Azure recently deployed the Azure Flat Network Storage (also known as Quantum 10 or Q10 network) which is a mesh grid network that allows very high bandwidth connectivity for storage clients. For more information, please refer to Brad Calder’s very informative post: Windows Azure’s Flat Network Storage and 2012 Scalability Targets. Suffice it to say, the performance of by utilizing HDFS with local disk or HDFS using ASV is comparable and in some cases, we have seen it run faster on ASV due to the fast performance of the Q10 network.

But all that data movement?!

But if I’m moving all of this data from the storage node to the compute nodes and then back (for storage), wouldn’t that be a lot of data to move – thus slowing query performance.  While that is true, there are technical and business reasons why something like Q10 would satisfy most requirements.


When HDInsight is performing its task, it is streaming data from the storage node to the compute node.  But many of the map, sort, shuffle,and reduce tasks that Hadoop is performing is being done on the local disk residing with the compute nodes themselves.  The map, reduce, and sort tasks typically will be performed on compute nodes with minimal network load while the shuffle tasks will use some network to move the data from the mappers nodes to less reduce nodes.  The final step of storing the dat back to the storage is typically a much smaller dataset (e.g. a query dataset or report).  In the end, the network is being more heavily utilized during the initial and final streaming phases while most of the other tasks are being performed intra-nodally (i.e. minimal network utilization).



Another important aspect is that when one queries data from a Hadoop cluster, they are typically not asking for all of the data in the cluster.  Common queries involve the current day, week, or month of data.  That is, a much smaller subset of data is being transferred from the storage node to compute nodes.

So how is the performance?

The quick summary on performance is:

  • Azure Blob storage provides near identical HDFS access characteristics for reading (performance and task splitting) into map tasks.
  • Azure Blob provides faster write access for Hadoop HDFS; allowing jobs to complete faster when writing data to disk from reduce tasks.

HDFS: Azure Blob Storage vs. Local Disk

Map Reduce uses HDFS which itself is actually just a file system abstraction.  There are two implementations of HDFS file system when running Hadoop in Azure; is either local file system another is Azure Blob.  Both are still HDFS; the code path for map reduce against local file system HDFS or Azure Blob filesystem are identical.   You can specify the file split size (minimum 64MB, max 100GB, default 5GB).  So a single file will be split and read by different mappers, just like local disk HDFS.

Where processing happens…

The processing in both case happens on the Hadoop clusters task tracker/worker nodes.  So if you have a cluster with 5 worker nodes (medium VM’s) and one head node, you will have 2 map and 1 reduce slots per workers; so when map reduce process data from HDFS it will happen on the 5 worker nodes across all of the map / reduce task slots.  Data is read from HDFS, regardless of location (local disk, remote disk, azure blob).

Data locality?

As noted above, we have re-architected our networking infrastructure with Q10 in our datacenters to accommodate the Hadoop scenario.  All up we have an incredibly low overhead / subscription ratio for networking, therefore we can have a lot of throughput between Hadoop and Blob. The worker nodes, Medium VM’s, will each read upto 800Mbps from Azure blob (which is running remotely); this is equivalent to how fast the VM can read off of disk.  With the right storage account placement and settings we can achieve disk speed for aprox 50 worker nodes. It’s screaming fast today; and there are some mind bindingly fast networking speed stuff coming down the pipe in the next year that will likely triple that number. The question at that point is; can the maps consume the data faster than it can read off of disk; this will be the case sometimes, others where computationally intensive will be bottlenecked on the map CPU rather than HDFS bandwidth.

Where Azure Blob HDFS start to win is on writing data out.  Ie. When your map reduce program finishes and it writes to HDFS.  With local disk HDFS, to write 3 copies, first #1 is written and then #2 and #3 are, in parallel written to two other nodes, remotely.  The write is not completed or “sealed” until the two remote copies have been written.  With azure blob storage the write is completed and sealed after #1 is written; and then, Azure blob takes care of, by default 6x replication (three to the local datacenter, three to the remote) on its own time.   This gives us high performance, high durability and high availably.

As well, it is important to note that we do not have a data durability service level objective for data stored in local disk HDFS while Azure provides a high level of data durability and availability.  It is important to note that all of these parameters are highly dependent on the actual workload characteristic.


In reference to Nasuni’s The State of Cloud Storage 2013 Industry Report, notes the following concerning Azure Blob Storage

Speed: Azure was 56% faster than the No. 2 Amazon S3 in write speed, and 39% faster at reading files than the No. 2 HP Cloud Object Storage in read speed.

Availability: Azure’s average response time was 25% faster than Amazon S3, which had the second fastest average time.

Scalability: Amazon S3 varied only 0.6% from its average the scaling tests, with Microsoft Windows Azure varying 1.9% (both very acceptable levels of variance). The two OpenStack-based clouds – HP and Rackspace – showed variance of 23.5% and 26.1%, respectively, with performance becoming more and more unpredictable as object counts increased.