Quick Tip for Compressing Many Small Text Files within HDFS via Pig

One of the good (or bad depending on your point of view) habits when working with Hadoop is that you can push your files into the Hadoop cluster and worry about making sense of this data at a later time.  One of the many issues with this approach is that you may rapidly run out of disk space on your cluster or your cloud storage.  A good way to alleviate this issue (outside of deleting the data) is to compress the data within HDFS. More information on how the script works are embedded within the comments. /* ** Pig Script:…

Rate this:

Getting your Pig to eat ASV blobs in Windows Azure HDInsight

Recently I was asked how could I get my Pig scripts to access files stored in Azure Blob Storage through the command line prompt.  While it is possible to do this from HDInsight Interactive JavaScript console, to automate scripts and use the grunt interactive shell, it is easier to run these commands from the command line.  To do this, you will need to: Ensure your HDInsight Azure cluster is connected to Azure Blob Storage subscription / account Familiarize yourself with the pig / grunt interactive shell Connecting HDInsight Azure to Azure Blob Storage 1) To do this, go to the…

Rate this: