Quick Tip for Compressing Many Small Text Files within HDFS via Pig

One of the good (or bad depending on your point of view) habits when working with Hadoop is that you can push your files into the Hadoop cluster and worry about making sense of this data at a later time.  One of the many issues with this approach is that you may rapidly run out of disk space on your cluster or your cloud storage.  A good way to alleviate this issue (outside of deleting the data) is to compress the data within HDFS. More information on how the script works are embedded within the comments.

/*
** Pig Script: Compress HDFS data
**
** Purpose:
** Compress HDFS data while keeping the original folder structure
**
** Paramater
** $date - in the format YYYYMMDD (following HDFS folder structure)
**
** Example call:
** pig -param date=20140106 CompressHDFSData.pig
*/

-- set compression
-- Chose Gzip as it has the best compression, but it may not have the best performance
-- More info at: http://dennyglee.com/2012/09/06/thursday-techtips-hadoop-1-01-and-compression-codecs/
set output.compression.enabled true;
set output.compression.codec org.apache.hadoop.io.compress.GzipCodec;

-- set large split size to merge small files together and then compress
-- This number is in bytes - which translates to 2.5GB
-- This is a large number but only because my files had a 10:1 compression ratio
- i.e.the attempted to create 2.5GB file(s) shrink down to 256MB when compressed
set pig.maxCombinedSplitSize 2684354560;

-- load files and store them again (using compression codec)
-- creates the $date_gz folder so now you have $date and $date_gz folders
inputFiles = LOAD '/lib/continuum/$date/' using PigStorage();
STORE inputFiles INTO '/lib/continuum/$date_gz/' USING PigStorage();

-- remove original folder and rename gzip folder to original
-- remove the $date folder, rename the $date_gz to $date
-- this way prior scripts that were dependent on the $date hierarchy will not have to be re-written
rm hdfs://[HADOOP_CLUSTER]:[PORT]/lib/continuum/$date
mv hdfs://[HADOOP_CLUSTER]:[PORT]/lib/continuum/$date_gz hdfs://[HADOOP_CLUSTER]:[PORT]/lib/continuum/$date

Note for better performance and possibly better compression, another option is to convert your text files into sequence files (e.g. Parquet, ORCfile, protobuf, avro, etc.).  But this is a good quick way (coding-wise anyways) to compress your text files.

Enjoy!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s