I have been playing around with the compression codecs with Hadoop 1.01 over the last few months and wanted to provide quick tech tips on compression codecs and Hadoop. The key piece of advice is for you to get Tom White’s (@tom_e_white) Hadoop: The Definitive Guide. It is easily the must-have guide for Hadoop novices to experts.
The key fundamentals concerning compression codecs is that not all codecs are immediately available within Hadoop. Some of them are native to Hadoop (one needs to remember to compile the native libraries) while others need to be extracted for their source and compiled in.
Below is a handy table reference based on Tom’s book and some of the observations I have noticed from tests as well.
|Compression Format||Codec||Splittable||Compression Space ||Compression Time |
- oahic in the Codec column represents org.apache.hadoop.io.compress
-  Left to right representing least to most compression space
-  Left to right representing slow to fast compression time
- + LZ0 are not natively splittable but you can use the lzop tool to pre-process the file.
- * While gzip is not natively splittable, there is an open jira HADOOP-7076 for this and you can install the patch yourself at SplittableGzip GitHub project.
While each project has its own profile, some key best practices paraphrased and listed in order of effectiveness from Hadoop: The Definitive Guide are:
- Use container file format (Sequence file, RC File, or Avro)
- Use a compression format that supports splitting bz2
- Manually split large files into HDFS block size chunks and compress individually
Some other handy compression tips are noted below.
- LZO vs Snappy vs LZF vs ZLIB, A comparison of compression algorithms for fat cells in HBase
- 10 MapReduce Tips (good in general, not just compression)
- Use Compression with Mapreduce
Hope this helps!