Within the context of Big Data, do we need traditional relational systems now that we have Big Data? Simply put, different systems serve different purposes; both are useful to better understand data. But before we dive into these specifics, let’s discuss the differences between scaling up the problem or scaling it out.
One way to understand the difference is to use a space analogy. (If you’re part of the SQL Twitter community, you’ll notice a prevalence of NASA and Space tweets)
Scale Up Space Analogy
In terms of space analogies, the Hubble Telescope is analogous of a scaling up such as your traditional relational database.
- It is often utilizes non-commodity hardware (or if it is commodity, it’s enterprise commodity hardware).
- Specialized equipment was designed and built for the Hubble telescope. While not as astronomic (!) in price, enterprise database performance requires more specialized (and expensive) hardware.
- The Hubble telescope is a single point of failure. If we lose the telescope (or a lens), we would lose the ability to get all of these amazing images.
This is NOT to say scale up is a bad thing. After all we get these amazing images from the Hubble telescope precisely because we (well, NASA scientists) have focused on creating Hubble with non-commodity specialized hardware. Following the same analogy, relational database systems benefit from this scale up to provide fast queries.
Scale Out Space Analogy
The flow of the following image is my representation of the Search for Extra Terrestrial Intelligence (SETI) project. SETI itself has the ultimate example of scale out distribution with their SETI@Home project. The search for extra terrestrial intelligence:
- Begins with the search of radio waves through out the galaxies. The image above is that of the Giant Galactic Nebula NGC 3603 (also from the Hubble telescope)
- Observatories around the world detect these radio waves.
- Supercomputers, such as the ones in the Barcelona Supercomputer Center (BSC) store and crunch all of this data.
Amazingly, home computer screensavers via SETI@Home calculate a sizeable chunk in the search for radio waves in the astrological heavens. As of this writing, there are 5.2 Million participants processing 769 teraFLOPS (11/14/2009) of data!
The key facets of commoditized distributed computing are then:
- Breaking down the problems into small chunks to distribute and calculate locally.
- The system is designed – right from the beginning – to engage with hundreds or thousands of machines.
- The system can easily handle many points of failure transparently. Auto-replication is one of the ways to prevent this from being a problem.
BOINC projects like SETI@Home are designed to handle the problems associated with distributed computing: network latency, loss of connectivity, tracking tasks, tracking jobs, etc. That is, break down the problem into a small chunks. Therefore, the data can be easily transferred and processed. Meanwhile, the system keeps track of these chunks to ensure that data processing has been completed.
Let’s bring this back to data systems such as Hadoop (or Apache Spark). They take the problem and distribute this across tens to thousands of machines with ease. This is because they were designed for distributed processing (e.g. replication, fault tolerance, task restart-ability, etc.).
There are many more concepts that need to be covered when we really dive into relational databases compared to Hadoop / Big Data systems. But the fundamental to start with is that of “scale up” and “scale out”.
As you can see with the Hubble Telescope / SETI projects analogy – both are important and both solve their respective problems in different ways. This doesn’t mean one is right and one is wrong. This is really more about the adage of “Use the right tool for the right problem”. After all, it is virtually impossible to get the amazing images of Hubble telescope by using hundreds or thousands of smaller commodity telescopes from Costco. Nor would it be possible for a single powerful telescope to examine all of the radio waves from the observatories on Earth.
So when it comes to scaling up or scale out for data problems – it’s not about which one to use, it’s about which one to use when.