|
Previous
|
Content
|
|
|
|
|
| MPLS, by its specifications, requires to
recover from a failure within 60 ms. Also, the specification calls
that in case of failures, the traffic must continue to flowing with
the same quality as it did before the failure occurred. Then, MPLS
networks have to detect a problem and switch the traffic on the faulty path
over a new path of equal quality within 60 ms. |
| The first problem to be resolved is how to
detect a failure. Two methods are used for this: heartbeat detection
(polling) and error messagging. |
| Heartbeat detection |
| This is some kind of keep-alive method.
With this, each device in the network advertise to a network manager that it
is alive, every prescribed interval of time (by using timers). If the
heartbeat is missed, the path, link, device or node is declared as
failed and a switchover is performed. This method requires considerable
overhead, because, in order to achieve a 40-50 ms failure
information, the heartbeat or keep-alive messages have to be
flooded at least every 10 ms. |
| Error messagging |
| In this method when a network device detects an
error, it sends a message to its neighbor to redirect traffic to a path or
router that is working. The network overhead is low but it takes some time
to send the error-and-redirect message to the network components. It
could, indeed, never arrive to its destination. This method is the preferred
option when the switch over time is not critical; on the contrary,
the heartbeat method is the better choice. |
| |
| Network protection |
| |
| In a network, there are several possible area
for failures. Two major failure are link failure and node failure.
Minor failures include switching hardware and software, and/or
link degradation. |
| |
| On MPLS networks you should
pre-provision a spare path with exact the same QoS and traffic
characteristics of the path to be replaced. This path would be
spatially diverse from the original one, and would be continuosly
subjugated to exercises and tests for operation. It shouldn't be
placed online unless there were a failure in the primary protected
path. |
| |
| In this scheme, called one-by-one
operation scheme, one spare is reserved to protect each primary
path. It yields the most protection and reliability but its cost could
be prohibitive. On the other hand, the one-to-many redundary protection
scheme maintains one spare path for all paths to be
protected. If one path fails, the backup path takes over. This scheme can
handle a single path failure, but not two or more path failures. However,
the cost of implementation of this solution is affordable. |
| |
| Another protection scheme to be considered is
having fault tolerance devices where they feature in-built
redundant functions, from power supplies to network cards. |
| |
| Normally, layer-1, layer-2 and
layer-3 protocols perform error detection and correction.
Nevertheless, MPLS requires more than this because its failure
recovery specification calls for rapid switching and because it
has to ensure too, that the new selected path is enough qualified to take
the new traffic load while maintaining the original QoS conditions.
When traffic load becomes a problem, MPLS implements emergency
mechanisms to redirect lower-priority traffic to other links, and in
extreme situations, to disconnect them totally. |
| |
| When using the RSVP-TE protocol for
signaling, the heartbeat error detection method is already
implemented for free because the state characteristic of this
protocol.
RSVP is a soft-state protocol that requires refreshing,
i.e., if the link is not refreshed then it will be torn down. No
error messaging is required and rapid reroute is possible having
already the pre-provisioned path. Having RSVP already in use,
the additional overhead for error recovery is really insignificant. |
|
|
|
|
| Thrashing |
| Thrashing is a phenomenum that occurs
when paths are quickly switched back and forth. This is caused by
intermittent failures of primary paths and pre-programmed switchback
timers. In order to overcome this problem, the MPLS protocol and
the switches used to implement it, must use hold-down timers. For
example, one minute is allowed for the first hold-down time and a
trigger is set such that, on the second switchback, operation
intervention is required to perform a next switchover and to prevent
thrashing. |
|
|
|
|
|
Previous
|
Content
|
|