SFP, fastlink and funnies with a Dell switch

A couple of weeks ago I dropped in on the UWC Student Cluster Competition team to see how they were progressing with their cluster configuration. and I discovered that they were struggling with the networking on their cluster. As a test, they’d set up two Dell rack mounted servers, connected them to a 10 Gb switch (a Dell 8100 series switch as I recall) and then connected the switch to the campus network to try and get an IP via DHCP. The switch was getting an IP, but the server weren’t. The servers are running CentOS by the way.

As a test, we set up a DHCP server on Nicole’s laptop (we tried on Eugene’s first, but I just couldn’t quite get my head around how Arch Linux does things) and watched the traffic. After some time, we saw DHCP traffic and got DHCP working, but in a mysterious way: if we restarted the networking, the interface would fail to acquire an IP. Then if we ran ifup some time later, it would acquire an IP quite fine. After I left I decided to google around a bit (having discovered the LINKDELAY setting in CentOS network scripts), and lo and behold, someone else reported exactly the same problem and suggested that fastlink be enabled.

So what is this fastlink thing? Seems that in other contexts it is called Port Fast and its a known way to solve DHCP negotiation issues. By default network switches implement the Spanning Tree Protocol (STP) on their ports in order to configure into a spanning tree (and avoid loops or unreachable ports). This involves a delay as a port becomes active, the delay that caused the DHCP query to time out and the problem we saw. If you know you have a host connected to a port, you can set Port Fast on that port, thereby avoiding the delay. Ah, well everyone has to encounter some or other funny the first time they set up a server. And by the way, for a rhythmic description of STP, consult the Algorhyme (or listen to it).