To replace my aging Alcatel Speedtouch 510, I'd recently bought a Draytek Vigor 2800G. The fact this included a wireless base station meant I could dispense with the additional wired and wireless hubs I'd had to install to supplement the Speedtouch - two less wall warts.
The other benefit of the Vigor 2800G was that the four wired ports ran at 100Mb, rather the sluggish 10Mb of the Speedtouch. While this improved the throughput of the LAN, it turned out that it tickled an obscure bug in FreeBSD.
The network interface for crimson, the FreeBSD-based web, email, DNS and DHCP server, is a NetGear FA331. It's based on the National Semiconductor DP83815 chipset and is managed by the sis driver. When I bounced the Vigor router as part of some configuration changes, I found that crimson had dropped off the network. While ifconfig claimed it was up, pinging crimson from outside (and pinging anything from crimson) failed.
After a little experimentation, I discovered that a break in the LAN connection of less than around five seconds hung the sis driver. If I disconnected the driver for more than five seconds the interface worked properly on re-connection. At first, I thought this might have something to do with the message:
sis0: Applying short cable fix (reg=0)
which appeared appeared after a disconnection (and also at initial boot). No such message appeared when running the FA331 at 10Mb speeds. A short disconnect, which caused interface failure, always had a reg value of 0. After a longer disconnection, the reg value was some non-zero value (e.g. fe).
However, the web told me this was a known problem with the NatSemi DP83815 chip. Despite this, I patched the if_sic.c driver to ignore the short cable fix if the reg value was zero. Nope, didn't change the behaviour one jot.
I decided not to worry about it, as the only time this would likely be a problem was if there were a power outage, and in that case the router would be up well before crimson, so the issue would not occur.
And that would have been true if the router had not started resetting spontaneously. This caused all the clients to fail as they could not renew their DHCP address since crimson was inaccessible. A manual unplugging of the LAN cable to crimson for more than five seconds was required, thereby making the interface work again. Why the router was rebooting is a mystery and under investigation. As a workaround, I dropped the crimson interface to 10Mb using:
ifconfig sis0 media 10BaseT/UDP
which I also placed into the /etc/rc.conf
file.
When I had a little time to spare, I decided to investigate the
FreeBSD sis driver a little further. Looking at the source code did
not provide any instant enlightenment - no surprise there. However,
I remembered that gold, the Debian box, had the same Netgear FA331
NIC. Did Debian exhibit the same problem? No it did not. I could
pull gold's ethernet cable for a brief period and the interface
worked properly on reconnection. I figured I could compare the
FreeBSD if_sis.c
and Linux natsemi.c
source code
to try and identify the difference. Luckily, there was a very
explicit comment in the natsemi.c
driver which gave me a
clue:
/* On page 78 of the spec, they recommend some settings for "optimum performance" to be done in sequence. These settings optimize some of the 100Mbit autodetection circuitry. They say we only want to do this for rev C of the chip, but engineers at NSC (Bradley Kennedy) recommends always setting them. If you don't, you get errors on some autonegotiations that make the device unusable. It seems that the DSP needs a few usec to reinitialize after the start of the phy. Just retry writing these values until they stick. */
Comparing what looked like the same code in FreeBSD (helpfully, it
also referenced page 78 of the spec), I could see that the DP83815D
chip in the FA311 had was treated differently from Linux. A D rev
chip was not given the same initialisation as the C rev. So,
following the advice in the natsemi.c
code, I ensured that
my D rev based NIC was treated exactly the same as the C rev chip, using
the following patch:
--- if_sis.c.orig Fri Apr 27 12:41:14 2007 +++ if_sis.c Fri Apr 27 12:41:32 2007 @@ -1909,7 +1909,7 @@ if (sc->sis_type == SIS_TYPE_83815 && sc->sis_srr <= NS_SRR_15D) { CSR_WRITE_4(sc, NS_PHY_PAGE, 0x0001); CSR_WRITE_4(sc, NS_PHY_CR, 0x189C); - if (sc->sis_srr == NS_SRR_15C) { + if (sc->sis_srr <= NS_SRR_15D) { /* set val for c2 */ CSR_WRITE_4(sc, NS_PHY_TDATA, 0x0000); /* load/kill c2 */
I rebuilt the kernel with (-DNO_CLEAN to shorten the kernel build time):
make buildkernel -DNO_CLEAN KERNCONF=CRIMSON
installed it, and tested. Success - the problem no longer occurred.
I've submitted a PR (112179) with the fix attached. If I'm lucky, maybe it will work it's way into baseline.