Previous

Content

 

1.8.- Failure detection  

Google
MPLS, by its specifications, requires to recover from a failure within 60 ms. Also, the specification calls that in case of failures, the traffic must continue to flowing with the same quality as it did before the failure occurred. Then, MPLS networks have to detect a problem and switch the traffic on the faulty path over a new path of equal quality within 60 ms.
The first problem to be resolved is how to detect a failure. Two methods are used for this: heartbeat detection (polling) and error messagging.
Heartbeat detection
This is some kind of keep-alive method. With this, each device in the network advertise to a network manager that it is alive, every prescribed interval of time (by using timers). If the heartbeat is missed, the path, link, device or node is declared as failed and a switchover is performed. This method requires considerable overhead, because, in order to achieve a 40-50 ms failure information, the heartbeat or keep-alive messages have to be flooded at least every 10 ms.
Error messagging
In this method when a network device detects an error, it sends a message to its neighbor to redirect traffic to a path or router that is working. The network overhead is low but it takes some time to send the error-and-redirect message to the network components. It could, indeed, never arrive to its destination. This method is the preferred option when the switch over time is not critical; on the contrary, the heartbeat method is the better choice.
 
Network protection
 
In a network, there are several possible area for failures. Two major failure are link failure and node failure. Minor failures include switching hardware and software, and/or link degradation.
 
On MPLS networks you should pre-provision a spare path with exact the same QoS and traffic characteristics of the path to be replaced. This path would be spatially diverse from the original one, and would be continuosly subjugated to exercises and tests for operation. It shouldn't be placed online unless there were a failure in the primary protected path.
 
In this scheme, called one-by-one operation scheme, one spare is reserved to protect each primary path. It yields the most protection and reliability but its cost could be prohibitive. On the other hand, the one-to-many redundary protection scheme maintains one spare path for all paths to be protected. If one path fails, the backup path takes over. This scheme can handle a single path failure, but not two or more path failures. However, the cost of implementation of this solution is affordable.
 
Another protection scheme to be considered is having fault tolerance devices where they feature in-built redundant functions, from power supplies to network cards.
 
Normally, layer-1, layer-2 and layer-3 protocols perform error detection and correction. Nevertheless, MPLS requires more than this because its failure recovery specification calls for rapid switching and because it has to ensure too, that the new selected path is enough qualified to take the new traffic load while maintaining the original QoS conditions. When traffic load becomes a problem, MPLS implements emergency mechanisms to redirect lower-priority traffic to other links, and in extreme situations, to disconnect them totally.
 
When using the RSVP-TE protocol for signaling, the heartbeat error detection method is already implemented for free because the state characteristic of this protocol. RSVP is a soft-state protocol that requires refreshing, i.e., if the link is not refreshed then it will be torn down. No error messaging is required and rapid reroute is possible having already the pre-provisioned path. Having RSVP already in use, the additional overhead for error recovery is really insignificant.
 
   

 

Thrashing
Thrashing is a phenomenum that occurs when paths are quickly switched back and forth. This is caused by intermittent failures of primary paths and pre-programmed switchback timers. In order to overcome this problem, the MPLS protocol and the switches used to implement it, must use hold-down timers. For example, one minute is allowed for the first hold-down time and a trigger is set such that, on the second switchback, operation intervention is required to perform a next switchover and to prevent thrashing.

   


Previous

Content