Scale Up or Scale Out your Data Problems? A Space Analogy

As I am writing more about Big Data, I’m been asked whether we need to have traditional relational or cube systems now that we have Big Data / NoSQL / Hadoop.  My responses are to note that these are different systems that serve different purposes even though both are used to better understand data.

But before we dive into the specifics surrounding relational databases compared to Hadoop / Big Data, we need to first talk about the differences between solving the a data problem by scaling up the problem or scaling it out.

One way to understand the difference is to use a space analogy.  (If you’re part of the SQL Twitter community, you’ll notice a prevalence of NASA and Space tweets)

Scale Up Space Analogy

Antennae Galaxies

The amazing image above is The Antennae Galaxies / NGC 4038-4039 from the Hubble Telescope – click on the image to see the full extra-large image, it is magnificent.

Database

In terms of space analogies, the Hubble Telescope is analogous of a scale up technology such as your traditional relational database.

  • It is often utilizes non-commodity hardware (or if it is commodity, it’s enterprise commodity hardware).
  • Specialized equipment was designed and built for the Hubble telescope.  While not as astronomic (!) in price, enterprise database performance requires more specialized (and expensive) hardware.
  • The Hubble telescope in itself is a single point of failure in that if we were to lose the telescope (or a lens), we would lose the ability to get all of these amazing images.

This is NOT to say scale up is a bad thing, after all we get these amazing images from the Hubble telescope precisely because we (well, NASA scientists) have focused on creating Hubble with non-commodity specialized hardware.  Following the same analogy, relational database systems (or cube systems) also benefit from this scale up approach because it becomes possible to provide query results quickly.

Scale Out Space Analogy

Galaxies to SETI

The flow of the above image is my representation of the Search for Extra Terrestrial Intelligence (SETI) project.  SETI itself has the ultimate example of scale out distribution with their SETI@Home project.  The search for extra terrestrial intelligence:

  • Begins with the search of radio waves through out the galaxies.  The image above is that of the Giant Galactic Nebula NGC 3603 (also from the Hubble telescope)
  • Those radio waves are detected by radio observatories located in various locations on the planet.
  • All of this data is stored and initially crunched by super computers like the Barcelona Supercomputing Center.
  • Yet how a lot of this crunching is done by BOINC-based projects like SETI@Home – i.e. making use of screensavers on one’s home machine.

That is, a sizeable chunk in the search for radio waves in the astrological heavens is being done by home computer screensavers – around 5.2 Million participants processing 769 teraFLOPS (11/14/2009) of data!

The key facets of commoditized distributed computing are then:

  • The problems can be broken down into small enough chunks so they can be distributed and calculated locally.
  • The system is designed – right from the beginning – to engage with hundreds or thousands of machines.
  • The system can easily handle many points of failure transparently.  Auto-replication is one of the ways to prevent this from being a problem.

distro

BOINC projects like SETI@Home are designed to handle the problems associated with distributed computing – network latency, loss of connectivity, tracking tasks, tracking jobs, etc. The fundamental being the ability to break down the problem into a small enough chunks so data can be easily transferred, processed, and transferred back – while keeping track of all of those chunks to ensure that data processing has been completed.

Elephant

Bring this back to data systems, Hadoop and distributed data systems are able to take the problem and distribute this across tens / hundreds / thousands of machines with ease.  This is because they were designed with the idea of distributed processing in the first place (e.g. replication, fault tolerance, task restart-ability, etc.).

Discussion

There are many more concepts that need to be covered when we really dive into relational databases compared to Hadoop / Big Data systems.  But the fundamental to start with is that of “scale up” and “scale out”.

As you can see with the Hubble Telescope / SETI projects analogy – both are important and both solve their respective problems in different ways.  This doesn’t mean one is right and one is wrong – this is really more about the adage of “Use the right tool for the right problem”.   After all, it would be really hard to get the amazing images of Hubble telescope by using hundreds or thousands of smaller commodity telescopes from Costco.   Nor would it be possible for a single powerful telescope to examine all of the radio waves from the observatories on Earth.

So when it comes to scaling up or scale out for data problems – it’s not about which one to use, it’s about which one to use when.

9 Comments

  1. An interesting and accurate analogy, Denny.

    A little typo in the 7th paragraph – chuck, i think you meant chunk.

    1. Thanks! on both the positive sentiment and the typo!

  2. So where would this place SAP HANA? It being an appliance I would place it in the scale up category, what are your thoughts?

    1. Well do note that I am an employee of Microsoft so there probably is a bias here. One of the key aspect of HANA is its abilities to perform calculations against a column store in memory, but also being able to utilize a row store and persisting the data. By itself, I would consider this a scale up technology because to improve performance, it still is about adding more RAM or faster CPUs to a single box.

  3. I couldn’t agree more. The answer is as usual “it depends” since the number of factors to be considered and goals for scaling differ from case to case.
    And I love the concept of S@H as a fan of distributed computing – participated in Seti, Folding or Prime95 for years.

  4. — around 5.2 Million participants processing 769 teraFLOPS (11/14/2009) of data!

    Bad analogy: a teraFLOP is a measure of floating point processing, not byte footprint.

    The “network is the computer” meme, associated with Scott McNealy, embodies the scientific LAN paradigm: lots o cpu cycles (distributed to many cpus) processing small amounts of data. The current NoSql, et al, paradigm is to ship Godzilla quantities of data around for minimal processing. Not a smart paradigm.

    The RM/RDBMS paradigm is rooted in the TPM, classically embodied in CICS. If transaction doesn’t matter to you, or you’re willing to ignore transaction, then any file based system will do.

    1. Fair enough on the definition of teraFLOP, but that’s still a lot of floating point processing. And unless SETI is processing via wasted cycles, that’s still a lot of work. But you are right, that isn’t the amount of data per se.

      As for the current paradigm of NoSQL, while we’re talking about Godzilla amounts of data, the whole point is to not move the data around – its really about moving compute to the data vs. data to the compute. I have a blog post coming up on Tuesday that gets into this.

      And no disagreement on the RM/RDBMS paradigm. In fact, while I’m currently espousing the benefits of Hadoop / Big Data / NoSQL in general, there was never an attempt to minimize the utter importance of the RDBMS paradigm. In fact, my background is that of the SQL world, eh?!

  5. […] There are two blog posts that go with the above slides that provide the details. Concerning the concepts of Scaling Up or Scaling Out, check out Scale Up or Scale Out your Data Problems? A Space Analogy. […]

  6. […] But for now, if your primary concern is speed of the query, provided you have the time and resources to process the data, traditional RDBMS and BI systems will typically return their query results substantially faster.  If your primary concern is flexibility and the ability to query in a distributed environment, then these “Big Data” systems are the right tool of choice.  It’s back to choosing the right tool for the right problem or job (as I had noted in Scale Up or Scale Out Your Data Problems: A Space Analogy). […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s