Dr. I Doctor's Informational Juggernaut
Dear Doctor,
We have several T1 and T3 circuits between our main office and branch locations. These all connect to a central router, which relays traffic between the branches and to our centralized server farm. Lately we've experienced a series of circuit outages that we can't diagnose. We suspect the telco provider, but it always runs a "loop-back" test to our equipment at each location and reports that the failed circuit is operating normally, indicating a hardware failure with our gear. But after the provider conducts its tests, the circuit mysteriously comes back online. We ask for support from our hardware vendor, and the vendor says it's the telco's fault. How can we stop the finger pointing and get to the bottom of this?
Gentle User,
The failure is not just with your routers and circuits, but with the hardware vendor and telco. You are the victim of misdirection from both parties.
In its loop-back test, the telco sends a special code to its premises equipment (called the Network Interface Unit) that creates a temporary connection between the send-side and receive-side of the circuit, so that test data transmitted by the telco "loops back" to the telco where it can be verified. The telco's misdirection is that a successful loop-back test proves a circuit is working normally. It does no such thing, because the telco loop-back test doesn't exercise two critical components in a circuit: the cabling between your router and the NIU, and the "customer" side of the NIU's electronics and connectors.
The second bit of misdirection is from your hardware vendor, which rather than finger pointing should be performing its own end-to-end loop-back tests. Only when both tests are correlated can you identify the failing component in the circuit. Dr. I Doctor recommends that you get the hardware vendor and telco on a three-way call, then insist that they run the full set of tests before unhostering their fingers.
Posted by mbeckman on March 1, 2008 at 4:42 PM | Comments (0)
Dear Doctor,
Our System i is in a secure collocation facility with redundant upstream Internet feeds connected to redundant firewalls. Between the firewalls and the System i are two Ethernet switches, one connected to each of the two System i Ethernet NICs. I'm trying to use the i5/OS Virtual IP Addressing feature to automatically fail over from one port to the other should any one of the firewalls, switches, or Ethernet NICs fail. To that end, I've configured NIC A on the System i with IP address 10.0.0.1 and NIC B with 10.0.0.2; the VIPA IP address is 10.0.0.3. Alas, when I test this by failing any one of the redundant components, failover doesn't seem to happen.
Gentle User,
Ethernet resilience is one of those technical achievements that engenders many approaches but few agreed-upon standards. The failover problem is simple: getting other devices in a network to send traffic to a different physical interface once the Ethernet hardware (MAC) address for a particular IP address has been resolved using Address Resolution Protocol (ARP). Dr. I Doctor is certain that in your situation the devices haven't agreed upon a solution, hence, the failure to failover.
One common solution to this problem is to generate a fictitious MAC address and assign that to the VIPA IP address. In normal operation, the primary Ethernet NIC responds to the fictitious MAC, but when a failure occurs, the secondary NIC takes on that responsibility. Intervening switches will see the MAC change as a device move and reroute traffic accordingly. Other devices in the network are none the wiser, and the failover is completely transparent to them.
i5/OS uses an older approach called Proxy ARP. Proxy ARP lets a particular NIC respond to ARP requests for an IP address even if that address is not physically assigned to the NIC. In normal operation, the primary NIC performs Proxy ARP, and thus ARP requests resolve to the physical MAC of that interface. If that NIC fails, the secondary NIC takes over Proxy ARP, giving out the secondary NIC's MAC for ARP queries to the VIPA IP address.
The problem at this point is that other devices on the LAN have cached the MAC address of the primary NIC, and they may not re-ARP the address for many minutes, or even hours. In failover, these other devices continue to address packets to the failed primary NIC's MAC. The fix, then, is to somehow force these devices to re-ARP more often.
Usually this is a configurable option in the TCP/IP settings for the device, with an attribute name of "ARP timer" or "ARP timeout." You should set this value to the longest time you're willing to delay failover, keeping in mind that shorter timeouts will increase the amount of broadcast traffic on your LAN. A value of 30 seconds isn't unreasonable and shouldn't greatly increase broadcast traffic. With this setting, failover should never take more than 30 seconds.
Posted by mbeckman on March 1, 2008 at 4:40 PM | Comments (0)

| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | ||
| 6 | 7 | 8 | 9 | 10 | 11 | 12 |
| 13 | 14 | 15 | 16 | 17 | 18 | 19 |
| 20 | 21 | 22 | 23 | 24 | 25 | 26 |
| 27 | 28 | 29 | 30 |
We welcome your comments and opinions and encourage lively debate on the issues. However, Penton Media reserves the right to delete or move any content that it may determine, in its sole discretion, violates or may violate its Terms of Use or is otherwise unacceptable. For more information, see Penton Media's Terms of Use.