Hadoop Map/Reduce can put a heavy toll on your network. Just how heavy, though, isn’t obvious. This is an especially important consideration when you are expanding your cluster. LiveRamp recently encountered this situation, and in the process we devised a neat theoretical model for analyzing how network topology affects Map/Reduce.
When does Hadoop put the most stress on the network?
The two phases of a Map/Reduce job that are candidates for high network stress are shuffling and reduce output. During the shuffle phase, each of your reducers will contact every other machine in the cluster to collect intermediate files. During the reduce output phase, the final results of the whole job will be written to HDFS, usually with three replicas.
Intuitively, since the data is written out three times, it might seem like the reduce output stage is the most intense period of network traffic; however, it turns out that the shuffle has the most potential to max out your network due to the fact that each node contacts each other node, rather than exactly two other nodes. Note, however, that the reduce output phase might take longer than the shuffle – it just won’t stress the network as much.
Just what do you mean, “stress the network”?
When I speak of “stressing the network”, I’m talking about throughput. More specifically, I am talking about the average aggregate peak throughput. There’s a lot in that term, so let’s break it down. “Throughput” refers to the rate at which we can transfer data, usually measured in Gbps. “Aggregate” refers to the fact that we’re summing up all the throughput across the whole cluster. “Average” refers to the fact that while some components may see higher or lower values, we want the average. Finally, “peak” refers to the fact that we’re interested in the throughput at the peak of stress, rather than at some other time.
Put another way, the average aggregate peak throughput is the aggregate throughput at which some component in the network saturates, that is when it is carrying its maximum throughput capacity. For a link like an Ethernet cable, the max capacity is determined by the rating of the cable and the ports it’s plugged into. For instance, Gigabtit Ethernet is rated for 1Gbps of throughput. For a switch, the rating of the backplane, also specified in Gbps, determines its capacity. Once one component in the network saturates, even if there are other unsaturated components, the job as a whole won’t be able to go any faster.
The objective of our model is to figure out exactly what average aggregate peak throughput (henceforth abbreviated T) a given network topology can bear before some component saturates. To do this, we’ll figure out what proportion of T each link and switch in the network has to carry, then solve some simple equations to determine what value of T causes each component to saturate. The one with the lowest value of T is the one that we care about.
First, let’s quickly look at what happens during the shuffle phase. Let’s assume that we’re operating on a common star-topology network, with four individual top-of-rack switches (labeled A through D) connected via a central backbone switch:
Each of the racks in this cluster contains the same number of nodes, so when the map phase is over, each rack has exactly 1/4 of the total intermediate data, represented in the diagram by the small boxes behind each cloud of nodes. On an individual rack, say Rack A, 1/4 of the data will be transferred to another host in the same rack, with the remaining 3/4 to be split evenly amongst racks B, C, and D. Likewise, each of the other racks will be sending 1/4 of their data to Rack A.
Since this network is symmetrical, each rack behaves identically. And since each rack has 1/4 of the total data, you can multiply things out to determine the following: 1/16 of the total data is going to stay in place in each rack, 3/16 will be transferred in, and 3/16 will be transferred out.
If you sum up all the numbers, you can annotate the diagram like so:
There are a few things worth noting here. Firstly, the links between switches have two separate numbers, one for each direction. This is because virtually all network connections are full-duplex, meaning they can carry their rated capacity in both directions. If you happen to have a half-duplex connection, then you must sum both of the numbers together to get the true proportion. Second, the backbone switch carries more of the throughput than the top-of-rack switches. This is because the backbone has to carry all of the traffic that leaves each of the rack switches. Also, it’s important to avoid double-counting the traffic that passes through the backbone. Think of a switch as a toll booth that can take traffic in either direction – you only need to count traffic that comes in and out once. Thus, the 3/4 proportion in the backbone switch is computed by summing up only the traffic that is coming in.
Now that we have computed what proportion of T passes through each component, we can solve for the value of T that causes the component to saturate. To do that, we’ll need concrete numbers for the capacities of our links and switches. Let’s assume that the links between the rack switches and the backbone are 8Gbps, and that all the switches are inexpensive 48-port Dell switches that have a backplane capacity of 184Gbps. The connections to individual machines are made via 1Gbps Ethernet. If a switch can handle up to 184Gbps, and it must carry 7/16 of T, then for what T does the switch carry 184Gbps? You can produce an equation from this question: 7/16 T = 184Gbps, which you can solve easily: T = 184 * 16 / 7 = 420Gbps. Here’s what the diagram looks like with this step applied to all switches and links:
I’ve highlighted the inter-switch links in this version, because as you can see, they become saturated when T is only 42Gbps, as opposed to the switches, which saturate at between 250Gbps and 420Gbps. This means that at its most intense, the shuffle phase will average about 42Gbps of throughput across the entire cluster. This architecture could support up to 160 machines; if they are sharing fairly, this means that each machine would have about 42Gbps / 160 = 250Mbps of throughput available for their transfers.