As I am writing more about Big Data, I’m been asked whether we need to have traditional relational or cube systems now that we have Big Data / NoSQL / Hadoop. My responses are to note that these are different systems that serve different purposes even though both are used to better understand data.
But before we dive into the specifics surrounding relational databases compared to Hadoop / Big Data, we need to first talk about the differences between solving the a data problem by scaling up the problem or scaling it out.
One way to understand the difference is to use a space analogy. (If you’re part of the SQL Twitter community, you’ll notice a prevalence of NASA and Space tweets)
Scale Up Space Analogy
The amazing image above is The Antennae Galaxies / NGC 4038-4039 from the Hubble Telescope – click on the image to see the full extra-large image, it is magnificent.
In terms of space analogies, the Hubble Telescope is analogous of a scale up technology such as your traditional relational database.
- It is often utilizes non-commodity hardware (or if it is commodity, it’s enterprise commodity hardware).
- Specialized equipment was designed and built for the Hubble telescope. While not as astronomic (!) in price, enterprise database performance requires more specialized (and expensive) hardware.
- The Hubble telescope in itself is a single point of failure in that if we were to lose the telescope (or a lens), we would lose the ability to get all of these amazing images.
This is NOT to say scale up is a bad thing, after all we get these amazing images from the Hubble telescope precisely because we (well, NASA scientists) have focused on creating Hubble with non-commodity specialized hardware. Following the same analogy, relational database systems (or cube systems) also benefit from this scale up approach because it becomes possible to provide query results quickly.
Scale Out Space Analogy
The flow of the above image is my representation of the Search for Extra Terrestrial Intelligence (SETI) project. SETI itself has the ultimate example of scale out distribution with their SETI@Home project. The search for extra terrestrial intelligence:
- Begins with the search of radio waves through out the galaxies. The image above is that of the Giant Galactic Nebula NGC 3603 (also from the Hubble telescope)
- Those radio waves are detected by radio observatories located in various locations on the planet.
- All of this data is stored and initially crunched by super computers like the Barcelona Supercomputing Center.
- Yet how a lot of this crunching is done by BOINC-based projects like SETI@Home – i.e. making use of screensavers on one’s home machine.
That is, a sizeable chunk in the search for radio waves in the astrological heavens is being done by home computer screensavers – around 5.2 Million participants processing 769 teraFLOPS (11/14/2009) of data!
The key facets of commoditized distributed computing are then:
- The problems can be broken down into small enough chunks so they can be distributed and calculated locally.
- The system is designed – right from the beginning – to engage with hundreds or thousands of machines.
- The system can easily handle many points of failure transparently. Auto-replication is one of the ways to prevent this from being a problem.
BOINC projects like SETI@Home are designed to handle the problems associated with distributed computing – network latency, loss of connectivity, tracking tasks, tracking jobs, etc. The fundamental being the ability to break down the problem into a small enough chunks so data can be easily transferred, processed, and transferred back – while keeping track of all of those chunks to ensure that data processing has been completed.
Bring this back to data systems, Hadoop and distributed data systems are able to take the problem and distribute this across tens / hundreds / thousands of machines with ease. This is because they were designed with the idea of distributed processing in the first place (e.g. replication, fault tolerance, task restart-ability, etc.).
There are many more concepts that need to be covered when we really dive into relational databases compared to Hadoop / Big Data systems. But the fundamental to start with is that of “scale up” and “scale out”.
As you can see with the Hubble Telescope / SETI projects analogy – both are important and both solve their respective problems in different ways. This doesn’t mean one is right and one is wrong – this is really more about the adage of “Use the right tool for the right problem”. After all, it would be really hard to get the amazing images of Hubble telescope by using hundreds or thousands of smaller commodity telescopes from Costco. Nor would it be possible for a single powerful telescope to examine all of the radio waves from the observatories on Earth.
So when it comes to scaling up or scale out for data problems – it’s not about which one to use, it’s about which one to use when.