Big Data’s Extra Node Paradox
The behavior and requirements of distributed data storage systems are frequently counterintuitive (eventual consistency, anyone?). One of these counterintuitive behaviors is something I call “the extra node paradox”: if you add an extra node to your cluster, you actually increase your likelihood of data loss. This follows from two simple observations:
Say you have a cluster of n nodes, with each data block replicated to r of those. Then
For reasonable configurations, any r-node simultaneous failure results in data loss.
There are only (n choose r) possible ways to replicate a data block in a cluster. This number is probably far smaller than the number of data blocks that you’re storing, which implies that if your data is evenly distributed across the available replications, any r-node loss will hit some blocks replicated to only those r nodes. To make it concrete: say you have a 20 node cluster with 3TB storage per node, a block size of 512kB, and a replication factor of 3, so that you can store 1TB (~2m blocks) per node, or about 40m blocks total. The number of unique replications available is only (20 choose 3)=1140, implying a loss of roughly 35k blocks for any 3 node failure. Holding everything else constant, if you bump the cluster size to 200 nodes, you jump to 1,313,400 unique replications, guaranteeing a loss of 30 blocks. To have any chance of not losing any of your 40m blocks, your cluster would need to have 623 nodes, each storing only ~64k blocks (~32GB). For the same data set but with larger blocks (say 64MB, the hdfs default), you would have a total of 3.125k blocks. With 20 nodes you would lose 274 blocks, and you’d need 125 nodes to have any chance of not losing data, each storing only ~2.5k blocks (~162GB).
As n goes up, so does the probability of at least r nodes failing simultaneously.
By analogy, if you flip 2 coins and fewer than 2 land on heads, you’re not very surprised, but if you flip 50 coins and fewer than 2 land on heads, you are. If you have more nodes, there are more nodes that can fail, and so more will (read up on the binomial distribution if you need to convince yourself).
The ‘paradox’ is that if you’re considering the impact of adding an extra node on your chance of losing any given block, that probability goes down – you’re getting an increasing overall likelihood of data loss from a group of blocks that are individually growing less likely to be lost.
Finding the actual probability of a simultaneous r-node failure in a real-world system is complicated by the fact that ‘simultaneous failure’ actually means ‘within the window of time when the data from the first failed node is being redistributed to good nodes’, but it doesn’t fundamentally change the conclusion. The main takeaways from this are that you can’t stop worrying about single-node reliability just because you’re distributed, and that the actual level of safety you get from a given replication factor is sensitive to the size of the cluster.