Big Data’s Extra Node Paradox
The behavior and requirements of distributed data storage systems are frequently counterintuitive (eventual consistency, anyone?). One of these counterintuitive behaviors is something I call “the extra node paradox”: if you add an extra node to your cluster, you actually increase your likelihood of data loss. This follows from two simple observations:
Say you have a cluster of n nodes, with each data block replicated to r of those. Then

For reasonable configurations, any rnode simultaneous failure results in data loss.
There are only (n choose r) possible ways to replicate a data block in a cluster. This number is probably far smaller than the number of data blocks that you’re storing, which implies that if your data is evenly distributed across the available replications, any rnode loss will hit some blocks replicated to only those r nodes. To make it concrete: say you have a 20 node cluster with 3TB storage per node, a block size of 512kB, and a replication factor of 3, so that you can store 1TB (~2m blocks) per node, or about 40m blocks total. The number of unique replications available is only (20 choose 3)=1140, implying a loss of roughly 35k blocks for any 3 node failure. Holding everything else constant, if you bump the cluster size to 200 nodes, you jump to 1,313,400 unique replications, guaranteeing a loss of 30 blocks. To have any chance of not losing any of your 40m blocks, your cluster would need to have 623 nodes, each storing only ~64k blocks (~32GB). For the same data set but with larger blocks (say 64MB, the hdfs default), you would have a total of 3.125k blocks. With 20 nodes you would lose 274 blocks, and you’d need 125 nodes to have any chance of not losing data, each storing only ~2.5k blocks (~162GB).

As n goes up, so does the probability of at least r nodes failing simultaneously.
By analogy, if you flip 2 coins and fewer than 2 land on heads, you’re not very surprised, but if you flip 50 coins and fewer than 2 land on heads, you are. If you have more nodes, there are more nodes that can fail, and so more will (read up on the binomial distribution if you need to convince yourself).
The ‘paradox’ is that if you’re considering the impact of adding an extra node on your chance of losing any given block, that probability goes down – you’re getting an increasing overall likelihood of data loss from a group of blocks that are individually growing less likely to be lost.
Finding the actual probability of a simultaneous rnode failure in a realworld system is complicated by the fact that ‘simultaneous failure’ actually means ‘within the window of time when the data from the first failed node is being redistributed to good nodes’, but it doesn’t fundamentally change the conclusion. The main takeaways from this are that you can’t stop worrying about singlenode reliability just because you’re distributed, and that the actual level of safety you get from a given replication factor is sensitive to the size of the cluster.
Yes. By keeping replication constant and increasing nodes you become less reliable. I think of it this way, in a massive datacenter of 10000 nodes what are the odds that 3 nodes are down. Pretty high. What are the odds for a 100000 node datacenter? Even higher.
Last I read Google used rf equal to 5 which make sense for massive clusters. Also one thing to take into account that techniques exist like read Solomon. Essentially they store parity similar to raid5. They do things like make the effective replication 3 when the physical replication is around 2.33. On small scale this is negligible but on a larger cluster that is a big win.
It would be very very interesting to do a chart which shows rf 2 3 4 …. taking into account mean time to failure for servers and disks. This would be used judge when is an acceptable time to move from rf 3 to rf 4.
Posted by edward capriolo on 20 August 2012 @ 4pm