By Brad Sarsfield and Denny Lee
A commonly asked question is why one should store their data in Azure Blob Storage instead of HDFS for HDInsight on Azure. After all, Hadoop is all about moving compute to data vs. traditionally moving data to compute.
Note that this post was published in March 2013. This information may not be current, but I have kept this post for posterity.
The network is often the bottleneck, and making it performant can be expensive. Yet the practice for HDInsight on Azure is to place the data into Azure Blob Storage (also known by the moniker
ASV or Azure Storage Vault). These storage nodes are separate from the compute nodes that Hadoop uses to perform its calculations. This seems to be in conflict with the idea of moving compute to the data.
It’s all about the Network
Per the above, the typical HDInsight infrastructure is that HDInsight is located on the compute nodes while the data resides in the Azure Blob Storage. To ensure fast data transfer speeds, Azure recently deployed the Azure Flat Network Storage (also known as Quantum 10 or Q10 network), It is a mesh grid network with high bandwidth connectivity for storage clients. For more information, please refer to Brad Calder’s very informative post: Windows Azure’s Flat Network Storage and 2012 Scalability Targets. Suffice it to say, the performance of utilizing HDFS with local disk or HDFS using ASV is comparable and in some cases, faster on
ASV due to the Q10 network.
But all that data movement?!
But if I’m moving all of this data from the storage node to the compute nodes and then back (for storage), wouldn’t that be a lot of data to move? And wouldn’t this translate to slow query performance? While that is true, there are technical and business reasons why something like Q10 would satisfy most requirements.
When HDInsight is performing its task, it is streaming data from the storage node to the compute node. But many of the map, sort, shuffle, and reduce tasks are processed on the local disk residing with the compute nodes themselves. They are done on compute nodes with minimal network load. Meanwhile, the shuffle tasks will use some network to move the data from the mapper nodes to less reducer nodes. The final step of transferring the data back to the storage is typically a much smaller dataset (e.g., a query dataset or report). In the end, the network is being more heavily utilized during the initial and final streaming phases while most of the other tasks are being performed intra-nodally (i.e., minimal network utilization).
Another important aspect is that when one queries data from a Hadoop cluster, they are typically not asking for all of the data in the cluster. Common queries involve the current day, week, or month of data. That is, a much smaller subset of data is being transferred from the storage node to the compute nodes.
So how is the performance?
A quick summary of performance is as follows:
- Azure Blob storage provides near-identical HDFS access characteristics for reading (performance and task splitting) into map tasks.
- Azure Blob provides faster write access for Hadoop HDFS. This allows jobs to be completed faster when writing data to disk via the
HDFS: Azure Blob Storage vs. Local Disk
Map Reduce uses HDFS, which itself is just a file system abstraction. There are two implementations of the HDFS file system when running Hadoop in Azure. It is either a local file system or through Azure Blob. Both are still HDFS as the code path for map/reduce against local file system HDFS or Azure Blob filesystem are identical. You can specify the file split size (minimum 64MB, max 100GB, default 5GB). So, a single file will be split and read by different mappers, just like local disk HDFS.
Where processing happens…
The processing in both cases happens on the Hadoop clusters task tracker/worker nodes. So if you have a cluster with five worker nodes (medium VMs) and one head node, you will have two
map and one
reduce slots per worker. When you process data from HDFS, it will utilize all five worker nodes across all of its map/reduce task slots. Data is read from HDFS, regardless of location (local disk, remote disk, azure blob).
As noted above, we have re-architected our networking infrastructure with Q10 in our data centers to accommodate the Hadoop scenario. All up, we have an incredibly low overhead/subscription ratio for networking. Therefore, we can have a lot of throughput between Hadoop and Blob. The worker nodes will each read up to 800Mbps from Azure blob (which runs remotely). This is equivalent to how fast the VM can read off of disk. With the right storage account placement and settings, we can achieve disk speed for approximately 50 worker nodes. It’s screaming fast today.
There are some mind-bending fast networking speeds coming down the pipe in the next year that will likely triple that number. The question at that point is: Can the maps consume the data faster than it can read off of disk? At these times, it will be computationally intensive calculations that will bottleneck the map CPU rather than HDFS bandwidth.
Where Azure Blob HDFS starts to win is in writing data out. Ie. When your map/reduce program finishes, it writes to HDFS. With local disk HDFS, to write three copies, first #1 is written, and then #2 and #3 are written in parallel to two other nodes remotely. The write is not completed or “sealed” until the two remote copies have been written. With Azure Blob Storage, the write is completed and sealed after #1 is written. Then, Azure Blob, by default, performs a 6x replication (three to the local data center and three remote). This gives us high performance, high durability, and high availability.
It is also important to note that we do not have a data durability service level objective for data stored in local disk HDFS, while Azure provides a high level of data durability and availability. It is important to note that all of these parameters are highly dependent on the actual workload characteristic.
Nasuni’s The State of Cloud Storage 2013 Industry Report notes the following concerning Azure Blob Storage
- Speed: Azure was 56% faster than the No. 2 Amazon S3 in write speed and 39% faster at reading files than the No. 2 HP Cloud Object Storage in read speed.
- Availability: Azure’s average response time was 25% faster than Amazon S3, which had the second fastest average time.
- Scalability: Amazon S3 varied only 0.6% from its average scaling tests, with Microsoft Windows Azure varying by 1.9% (both very acceptable levels of variance). The two OpenStack-based clouds – HP and Rackspace – showed a variance of 23.5% and 26.1%, respectively, with performance becoming more and more unpredictable as object counts increased.