Analyzing some interesting networks for Map/Reduce clusters


Learn from our challenges and triumphs as our talented engineering team offers insights for discussion and sharing.

Analyzing some interesting networks for Map/Reduce clusters


In a previous post, I described how LiveRamp had built a conceptual model for determining the average aggregate peak throughput (which we’ll call T) that a given network architecture could support. This post applies that model to a variety of network topologies you might consider for your cluster.

Just as a brief refresher, T represents the max throughput that the shuffle operation will be able to demand during its peak, and we compute it by determining which component of the network will saturate first.

The “triangle”
When LiveRamp had only 120 machines in three racks, our network looked like this:

The Dell switches we’d purchased have a configuration that allows them to be “stacked” together via a proprietary 48Gbps link. The interesting thing about this configuration is that communications from any one rack switch to another rack switch only have to cross one link and two of the three total switches. If we compute the T numbers for this architecture, we get:

This architecture provides a whopping 331Gbps T! That’s actually more than all the machines connected to the network can actually even use – with all 144 ports in use, we could only produce 288Gbps of traffic. It’s really interesting that this network architecture is so speedy, particularly because it’s so cheap to put together – in the neighborhood of $6000 for the entire network. This is a great starter network. You can build it up organically from one to three switches and never have to worry about your network being a bottleneck.

It has some downsides, though. Using both the provided stacking modules prevents you from taking advantage of any of the SFP ports, so you don’t have any high-speed options for connecting to an external non-cluster network. This means that getting data in and out of the cluster can be a challenge, and that there really isn’t any way to expand this architecture with more computing power. In practice, you’ll have to sacrifice 4-8 of the ports on each switch to be bonded together into an uplink to another network. Also, the stacking cables only reach a max of 10 feet, so your racks will have to be physically close to each other.

Naive expansion of the “triangle”
When we were planning to expand from three to four racks, the first thing we considered was connecting a fourth switch to our existing triangle network. The best uplink we could have managed was an 8Gbps bonded Ethernet link like so:

You might already see where this is going. Take a look at the T numbers:

It turns out that the weakest link in the entire network, the new 8Gbps link, has to carry a really large proportion of T, and saturates at a measly 42G! It’s a massive reduction over the configuration described above. If you were to implement this network, you might find yourself in the unenviable position of having your jobs execute more slowly despite having added more machines.

I don’t think that anyone should ever build this network. Even if 42G is enough for your application, this network just won’t grow with you.

Daisy chains
A network topology you might find tempting when reading about various “stacking” options is a daisy chain. Per Dell’s product sheet, you can stack up to 12 of their switches together via the 48Gbps links:

With the T values computed:

It should come as no surprise that the network link in the dead center of the chain will be where saturation occurs first – about 1/4 of all traffic is going to have to cross this link in each direction, leading to a saturation point of 192Gbps. An interesting observation, though, is that no matter the number of racks in the chain, the middle link will always have the same saturation point, since it will always see 1/4 of the traffic.

Since it has such a high saturation point, this topology is somewhat attractive. However, it can be impractical to implement. Since the stacking cables are so short, you are required to locate all your cabinets right next to one another – or to keep all of the switches in one rack and run 480 Ethernet cables to the individual racks! Neither of these approaches are conducive to incremental growth and easy management in a shared facility. Also, once you expand to the max of 12 switches, you are left with no incremental growth path. Finally, you are still subject to all the other limitations of the stacking configuration as described above.

This style of network only really seems to be merited when you know you won’t grow beyond a certain size and you have great control over your racks’ physical positioning.

“Star” networks
The star network topology is one of the most common, and it benefits from being easy to assemble and well understood. Consider this sample from my original post on the analysis technique, pre-annotated with T values:

It consists of four rack switches and one central switch, all the aforementioned Dell 6248s, and connected by 8Gbps links. It saturates at a relatively low 42Gbps. Unlike the triangle network, all the traffic headed to another rack must pass through a single link to the backbone, and unlike the daisy chain network, the links aren’t strong enough to make up for that.

However, this topology still has room for more racks. Let’s see what happens if we plugged in another two racks’ worth of machines like so:

Note that while the uplinks still saturate first, the T that it takes to saturate them has gone up! This is because the relative proportion of the total traffic that goes over each link has actually reduced. The thing to take away is that some network topologies, like this one, keep scaling as you add new nodes. It might make sense to leave yourself room to grow, rather than buying the minimum that fits your needs.

LiveRamp’s current network has taken these lessons to heart. Here’s what we’re using today:

We switched from Dell to Juniper for our hardware, using this switch as our backbone and one of these at the top of each rack. The links between the rack switches and the backbone are made over a single 10G Ethernet connection. We also reduced the size of a rack from 40 machines to 20, since we’re power- and cooling-limited to 20 nodes per cabinet anyway. If you run the T numbers, you end up with:

The links still saturate first at a respectable 92Gbps. This performance seems to be suitable for our application, but the really great thing about this topology is its flexibility. We can expand from our current 160 machines all the way up to 480 machines without any of the switches ever becoming overwhelmed. Or, if we discover that our applications are especially bandwidth-hungry, we can bond an extra 10G Ethernet from each rack switch to the backbone, doubling our T for the price of new cables.