A Quick HBase Primer from a SQLBI Perspective

One of the questions I’m often asked – especially from a BI perspective – is how a BI person should look at HBase.  After all, HBase is often described quickly as an in-memory column store database – isn’t that what SSAS Tabular is?   Yet calling HBase an in-memory column store database isn’t quite right because in this case, the terms column, database, tables, and rows do not quite mean the same thing as one would think from a relational database aspect of things.

Setting the Context

imageHow I usually start off is by providing a completely different context before I go back to BI.  The best way to kick this off is to know that HBase is an integral part of Facebook’s messaging system.  Facebook’s New Real-time Messaging System: HBase to Store 135+ Billion Messages a Month is a great blog post providing you the architecture details on how HBase allows Facebook to deal with excessively large volumes of volatile messages.  This isn’t something you would typically would see in the BI world, eh?!

Understanding HBase and BigTable

With the context set, let’s go back to understanding more about HBase by reviewing Google’s BigTable.  Understanding HBase and BigTable is a must read, the blog’s key concepts are noted below but attribution goes to Jim R. Wilson.

Concepts refresher for the definition of BigTable is:

A BigTable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.

With the keywords here being:
map | persistent | distributed | sorted | multidimensional

Following JSON notation, a map can be seen as:

{
  "zzzzz" : "woot",
  "xyz" : "hello",
  "aaaab" : "world",
  "1" : "x",
  "aaaaa" : "y"
}

while a sorted map is

{
  "1" : "x",
  "aaaaa" : "y",
  "aaaab" : "world",
  "xyz" : "hello",
  "zzzzz" : "woot"
}

and a multidimensional sorted map looks like

{
  "1" : {
    "A" : "x",
    "B" : "z"
  },
  "aaaaa" : {
    "A" : "y",
    "B" : "w"
  },
  "aaaab" : {
    "A" : "world",
    "B" : "ocean"
  },
  "xyz" : {
    "A" : "hello",
    "B" : "there"
  },
  …
}

As for the description of persist and distributed, the best way to see this is through a picture from Lars George’s great post: HBase Architecture 101 – Storage

image

Some of the key concepts here are:

  • HBase is extremely efficient at Random Reads/Writes
  • Distributed, large scale data store
  • Availability and Distribution is defined by Regions (more info at: http://hbase.apache.org/book/regions.arch.html)
  • Utilizes Hadoop for persistence
  • Both HBase and Hadoop are distributed

If the concepts are still a little vague, please do read Jim R. Wilson’s post   Understanding HBase and BigTable and then read it again!  Smile

So what about Analytics?

Through all this terminology and architecture design, can HBase do analytics?  And the answer is a yes, but its design is not about creating star schemas in the relational sense but creating column families which fits nicely into real time analytics.  A great way to dive into this is to check out Dani Abel Rayan’s Making Sense of Streaming Big Data Flume- HBase.

There are many other excellent references here and this is hardly an exhaustive post. But the key thing here is that while there are many similarities, HBase analytics and SQL Business Intelligence have different contexts.  Because of its in-memory column families, its easy to adapt HBase for real time analytics (in addition to its ability to handle volatile messages).  On the other hand, BI is about the ability to slide and dice using familiar BI tools against immense amounts of historical data.  Over time we may see the concepts merge – Google BigQuery comes immediately to mind – but it’ll be awhile.

Meanwhile, I encourage folks that are comfortable with SQLBI to stretch the envelope a bit into the world of NoSQL like HBase.  It’s a paradigm shift in some ways,… yet it also isn’t! Winking smile

References

2 Comments

  1. […] Lee (@dennylee) posted A Quick HBase Primer from a SQLBI Perspective on 8/29/2012: One of the questions I’m often asked – especially from a BI perspective – is […]

  2. Great intro, thanks!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s