Dorky attempts at geek Shakespere aside; as the volume, complexity, and variability of your data systems increase in … entropy …, this becomes a fundamental question in whether one scales up or scale out their data problem.
Apologies for the nerdy chemistry references in advance – which starts with this picture of Dr. Arthur Grosser (more later)
As noted in the previous post Scale Up or Scale Out your Data Problems? A Space Analogy, the decision to scaling up or scaling out your data problem is a key facet in your Big Data problem. But just as important as the ability to distribute the data across commoditized hardware, another key facet is the movement of data.
Latencies (i.e. slower performance) are introduced when you need to move data from one location to another. To solve this problem within the data world, you can solve this by making it easier to move the data faster (e.g. compression, delta transfer, faster connectivity, etc.) or you design a system that reduces the need to move the data in the first place (i.e. moving data to compute or compute to data).
Scaling Up the Problem / Moving Data to Compute
To help describe the problem, the diagram below is a representation of a scale up traditional RDBMS. The silver database boxes on the left represent the database servers (each with blue platters representing local disks), the box with 9 blue platters represents a disk array (e.g. SAN, DAS, etc.), the blue arrows represent fiber channel connections (between the server and disk array), and the green arrows represent the network connectivity.
In an optimized scale up RDBMS, we often will setup DAS or SANs to quickly transfer data from the disk array to the RDBMS server or compute node (often allocating the local disk for the compute node to hold temp/backup/cache files). This scenario works great under the specific scenario that you can ensure low latencies.
And this is where things can get complicated, because if you were to lose disks on the array and/or fiber channel connectivity to the disk array – the RDBMS would go offline. But as described in the above diagram, perhaps you setup active clustering so the secondary RDBMS can take over.
Yet, if you were to lose network connectivity (e.g. the secondary RDBMS is not aware the primary is offline) or lose fiber channel connectivity, you would also lose the secondary.
The Importance of ACID
It is important to note that many RDBMS systems have features or designs that work around these problems. But to ensure availability and redundancy, if often requires more expensive hardware to work around the problematic network and disk failure points.
As well, this is not to say that RDBMS are based design – they are designed with ACID in mind – atomicity, consistency, isolation, and durability – to guarantee the reliability and robustness of database transactions (for more info, check out the Wikipedia entry: ACID).
Scaling Out the Problem / Moving Compute to Data
In a scale out or distributed solution, the idea is to have many commodity servers; they are many points of failure but there are also many paths for success.
Key to a distributed system is that as data comes in (the blue file icon on the right represent data such as web logs), the data is distributed and replicated in chunks to many nodes within the cluster. In the case of Hadoop, files are broken into 64MB / 128MB chunks and each of these chunks are placed into three different locations (if you set the replication factor to 3).
While you are using more disk space to replicate the data, now that you have placed the data into the system, you have ensured redundancy by replicating the data within it.
What is great about these types of distributed systems, they are designed right from the beginning to handle latency issues whether they be disk or network connectivity problems to out right losing a node. In the above diagram, a user is requesting data, but there is a loss to some disks and some network connections.
Nevertheless, there are other nodes that do have network connectivity and the data has been replicated so it is available. Systems that are designed to scale out and distribute like Hadoop can ensure availability of the data and will complete the query just as long as the data exists (it may take longer if nodes are lost, but the query will be completed).
The importance of BASE
By using many commodity boxes, you distribute and replicate your data to multiple systems. But as there are many moving parts, distributed systems like these cannot ensure the reliability and robustness of database transactions. Instead, they fall under the domain of eventual consistency where over a period of time (i.e. eventually) the data within the entire system will be consistent (e.g. all data modifications will be replicated throughout the cluster). This concept is also known as BASE (as opposed to ACID) – Basically Available, Soft State, Eventually Consistent. For more information, check out the Wikipedia reference: Eventual Consistency.
Similar to the post Scale Up or Scale Out your Data Problems? A Space Analogy, choosing whether ACID or BASE works for you is not a matter of which one to use – but which one to use when. For example, as noted in the post What’s so BIG about “Big Data”?, the Yahoo! Analysis Services cube is 24 TB (certainly a case of moving data to compute with my obsession on random IO with SSAS) and the source of this cube is a 2PB of data from a huge Hadoop cluster (moving compute to data).
Each one has its own set of issues – scaling out increases the complexity of maintaining so many nodes, scaling up becomes more expensive to ensure availability and reliability, etc. It will be important to understand the pros/cons of each type – often it will be a combination of these two. Another great example can be seen in Dave Mariani (@mariani)’s post: Big Data, Bigger Brains at Klout’s blog.
ACID and BASE each have their own set of problems, the good news is that mixing them together often neutralizes the problems.