When working with Hadoop on Azure, you may be used to the idea of putting your data in the Cloud. In addition to using Azure Blob Storage, another option is connecting your Hadoop on Azure cluster to query data against Amazon S3. To configure Hadoop on Azure to connect to it, below are the steps (with the presumption that you already have an Amazon AWS / S3 account) and have uploaded data into your S3 account.
1) Log into your Amazon AWS Account and click onto Security Credentials
2) Obtain your access credentials – you’ll need both your Access Key ID and Secret Access Key.
3) From here, log into your Hadoop on Azure account, click the Manage Cluster live tile, and click on Set up S3. From here, enter your Access Key and Secret Key and click Save Settings.
4) Once you have successfully saved your Amazon S3 settings, you can access your Amazon S3 files from Hadoop on Azure. For example, I have a bucket called tardis6 with folder weblog with a sample weblog file.
[…] Microsoft’s Denny Lee’s original blog post, which inspired me to try this out. Also, in case you are wondering, here is the source code […]
What if you want to use publically available data in S3? Is this possible? And if so how do you configure Azure to access that data? An example can be found here: http://aws.amazon.com/1000genomes/
What if you want to connect to public data in s3 can that be done? example can be found here: s3.amazonaws.com/1000genomes
Good question! Let me get back to you on this one, eh?!
Quick question – What are the internals of this process? Is data moved to Azure Hadoop HDFS or we are only using computing power (running MapReduce jobs), while data is stored on S3? I understand how HDFS and MapReduce work on Hadoop cluster, but I am trying to wrap my mind around how it works in the context of S3 or Azure Blog storage. Thanks.
Actually, data is streamed from the blob store over to Azure so the MR jobs can be executed against. The temporary files that are created will then reside in HDFS. You can first physically copy the files over if you use the hadoop fs -copyToLocal from your S3 or Azure Blob Storage account so that way the processing can be done without the latency of streaming from the storage account to your compute nodes. But if you have a lot of data, don’t care about the time, the cost structure of data transfer is okay – then go for it, eh?!
Got it. Thanks for your response.
[…] noted in my previous post Connecting Hadoop on Azure to your Amazon S3 Blob storage, you could easily setup HDInsight Azure to go against your Amazon S3 / S3N storage. With the […]