Thursday TechTips: Hadoop 1.01 and Compression Codecs

I have been playing around with the compression codecs with Hadoop 1.01 over the last few months and wanted to provide quick tech tips on compression codecs and Hadoop.  The key piece of advice is for you to get Tom White’s (@tom_e_white) Hadoop: The Definitive Guide.  It is easily the must-have guide for Hadoop novices to experts.

The key fundamentals concerning compression codecs is that not all codecs are immediately available within Hadoop. Some of them are native to Hadoop (one needs to remember to compile the native libraries) while others need to be extracted for their source and compiled in.

Below is a handy table reference based on Tom’s book and some of the observations I have noticed from tests as well.

Compression Format Codec Splittable Compression Space [1] Compression Time [2]
gzip oahic.GzipCodec No* |—x—| |—x—|
bz2 oahic.BZip2Codec Yes |—-x–| |x——|
lz0 oahic.LzopCodec No+ |-x—–| |—–x-|
lz4 oahic.Lz4Codec No |-x—–| |—–x-|
snappy oahic.SnappyCodec No |-x—–| |—–x-|
  • oahic in the Codec column represents org.apache.hadoop.io.compress
  • [1] Left to right representing least to most compression space
  • [2] Left to right representing slow to fast compression time
  • + LZ0 are not natively splittable but you can use the lzop tool to pre-process the file.
  • * While gzip is not natively splittable, there is an open jira HADOOP-7076 for this and you can install the patch yourself at SplittableGzip GitHub project.

While each project has its own profile, some key best practices paraphrased and listed in order of effectiveness from Hadoop: The Definitive Guide are:

  1. Use container file format (Sequence file, RC File, or Avro)
  2. Use a compression format that supports splitting bz2
  3. Manually split large files into HDFS block size chunks and compress individually

Some other handy compression tips are noted below.

Hope this helps!

3 Comments

  1. […] Lee (@dennylee) posted Thursday TechTips: Hadoop 1.01 and Compression Codecs on 9/6/2012: I have been playing around with the compression codecs with Hadoop 1.01 over the last […]

  2. […] Gzip as it has the best compression, but it may not have the best performance – More info at: https://dennyglee.com/2012/09/06/thursday-techtips-hadoop-1-01-and-compression-codecs/ set output.compression.enabled true; set output.compression.codec […]

  3. […] fun with Hue – specifically Beeswax to execute Hive queries from a nice web UI. As noted in Hadoop compression codecs and optimizing Hive joins (and using compression to do it), using compression gives you more space […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s