Denny Lee

Scale Up or Scale Out your Data Problems? A Space Analogy

Within the context of Big Data, do we need traditional relational systems now that we have Big Data? Simply put, different systems serve different purposes; both are useful to better understand data. But before we dive into these specifics, let’s discuss the differences between scaling up the problem or scaling it out.

One way to understand the difference is to use a space analogy.  (If you’re part of the SQL Twitter community, you’ll notice a prevalence of NASA and Space tweets)

Scale Up Space Analogy

In terms of space analogies, the Hubble Telescope is analogous of a scaling up such as your traditional relational database.

The Antennae Galaxies / NGC 4038-4039 from the Hubble Telescope is my analogy of scale up
The Antennae Galaxies / NGC 4038-4039 from the Hubble Telescope is my analogy of scale up
  • It is often utilizes non-commodity hardware (or if it is commodity, it’s enterprise commodity hardware).
  • Specialized equipment was designed and built for the Hubble telescope.  While not as astronomic (!) in price, enterprise database performance requires more specialized (and expensive) hardware.
  • The Hubble telescope is a single point of failure. If we lose the telescope (or a lens), we would lose the ability to get all of these amazing images.

This is NOT to say scale up is a bad thing. After all we get these amazing images from the Hubble telescope precisely because we (well, NASA scientists) have focused on creating Hubble with non-commodity specialized hardware.  Following the same analogy, relational database systems benefit from this scale up to provide fast queries.

Scale Out Space Analogy

The flow of the following image is my representation of the Search for Extra Terrestrial Intelligence (SETI) project.  SETI itself has the ultimate example of scale out distribution with their SETI@Home project.  The search for extra terrestrial intelligence:

Flow from SETI project to SETI @ Home is my analogy for scale out
Flow from SETI project to SETI @ Home is my analogy for scale out
  • Begins with the search of radio waves through out the galaxies.  The image above is that of the Giant Galactic Nebula NGC 3603 (also from the Hubble telescope)
  • Observatories around the world detect these radio waves.
  • Supercomputers, such as the ones in the Barcelona Supercomputer Center (BSC) store and crunch all of this data.

Amazingly, home computer screensavers via SETI@Home calculate a sizeable chunk in the search for radio waves in the astrological heavens. As of this writing, there are 5.2 Million participants processing 769 teraFLOPS (11/14/2009) of data!

The key facets of commoditized distributed computing are then:

  • Breaking down the problems into small chunks to distribute and calculate locally.
  • The system is designed – right from the beginning – to engage with hundreds or thousands of machines.
  • The system can easily handle many points of failure transparently.  Auto-replication is one of the ways to prevent this from being a problem.
distro

BOINC projects like SETI@Home are designed to handle the problems associated with distributed computing: network latency, loss of connectivity, tracking tasks, tracking jobs, etc. That is, break down the problem into a small chunks. Therefore, the data can be easily transferred and processed. Meanwhile, the system keeps track of these chunks to ensure that data processing has been completed.

Elephant

Let’s bring this back to data systems such as Hadoop (or Apache Spark). They take the problem and distribute this across tens to thousands of machines with ease.  This is because they were designed for distributed processing (e.g. replication, fault tolerance, task restart-ability, etc.).

Discussion

There are many more concepts that need to be covered when we really dive into relational databases compared to Hadoop / Big Data systems.  But the fundamental to start with is that of “scale up” and “scale out”.

As you can see with the Hubble Telescope / SETI projects analogy – both are important and both solve their respective problems in different ways.  This doesn’t mean one is right and one is wrong. This is really more about the adage of “Use the right tool for the right problem”.   After all, it is virtually impossible to get the amazing images of Hubble telescope by using hundreds or thousands of smaller commodity telescopes from Costco.   Nor would it be possible for a single powerful telescope to examine all of the radio waves from the observatories on Earth.

So when it comes to scaling up or scale out for data problems – it’s not about which one to use, it’s about which one to use when.

9 responses to “Scale Up or Scale Out your Data Problems? A Space Analogy”

  1. An interesting and accurate analogy, Denny.

    A little typo in the 7th paragraph – chuck, i think you meant chunk.

    1. Thanks! on both the positive sentiment and the typo!

  2. So where would this place SAP HANA? It being an appliance I would place it in the scale up category, what are your thoughts?

    1. Well do note that I am an employee of Microsoft so there probably is a bias here. One of the key aspect of HANA is its abilities to perform calculations against a column store in memory, but also being able to utilize a row store and persisting the data. By itself, I would consider this a scale up technology because to improve performance, it still is about adding more RAM or faster CPUs to a single box.

  3. I couldn’t agree more. The answer is as usual “it depends” since the number of factors to be considered and goals for scaling differ from case to case.
    And I love the concept of S@H as a fan of distributed computing – participated in Seti, Folding or Prime95 for years.

  4. — around 5.2 Million participants processing 769 teraFLOPS (11/14/2009) of data!

    Bad analogy: a teraFLOP is a measure of floating point processing, not byte footprint.

    The “network is the computer” meme, associated with Scott McNealy, embodies the scientific LAN paradigm: lots o cpu cycles (distributed to many cpus) processing small amounts of data. The current NoSql, et al, paradigm is to ship Godzilla quantities of data around for minimal processing. Not a smart paradigm.

    The RM/RDBMS paradigm is rooted in the TPM, classically embodied in CICS. If transaction doesn’t matter to you, or you’re willing to ignore transaction, then any file based system will do.

    1. Fair enough on the definition of teraFLOP, but that’s still a lot of floating point processing. And unless SETI is processing via wasted cycles, that’s still a lot of work. But you are right, that isn’t the amount of data per se.

      As for the current paradigm of NoSQL, while we’re talking about Godzilla amounts of data, the whole point is to not move the data around – its really about moving compute to the data vs. data to the compute. I have a blog post coming up on Tuesday that gets into this.

      And no disagreement on the RM/RDBMS paradigm. In fact, while I’m currently espousing the benefits of Hadoop / Big Data / NoSQL in general, there was never an attempt to minimize the utter importance of the RDBMS paradigm. In fact, my background is that of the SQL world, eh?!

  5. […] There are two blog posts that go with the above slides that provide the details. Concerning the concepts of Scaling Up or Scaling Out, check out Scale Up or Scale Out your Data Problems? A Space Analogy. […]

  6. […] But for now, if your primary concern is speed of the query, provided you have the time and resources to process the data, traditional RDBMS and BI systems will typically return their query results substantially faster.  If your primary concern is flexibility and the ability to query in a distributed environment, then these “Big Data” systems are the right tool of choice.  It’s back to choosing the right tool for the right problem or job (as I had noted in Scale Up or Scale Out Your Data Problems: A Space Analogy). […]

Leave a Reply

Discover more from Denny Lee

Subscribe now to keep reading and get access to the full archive.

Continue reading