|
Previous
|
Content
|
Next
|
|
|
2.0.-
Congestion Avoidance and Control |
|
 |
|
|
In november, 1988, Jacobson & Karels wrote
Congestion Avoidance and Control, where they described five of seven new
algorithms included into BSD TCP 4.0. Next is a summary of their paper. |
| The seven new algorithms included into BSD TCP
4.0. were: |
- Round-trip time variance estimation.
- Exponential retransmit timer backoff.
- Slow-Start.
- More aggressive receiver ACK policy.
- Dynamic window sizing on congestion.
- Karn's clamped retransmit backoff.
- Fast retransmitt.
|
| The flow on a TCP connection should obey a
conservation of packets principle; if this principle were obey,
congestion collapse would become the exception rather than the rule. By
conservation of packets we mean that for a connection in equilibrium,
i.e., running stably with a full window of data in transit, the packet flow
is what physicist would call conservative, i.e., a new packet
isn't put into the network until an old packet leaves. |
| There are only three ways for packet
conservation to fail: |
- The connection doesn't get to equilibrium.
- A sender injects a new packet before an old one has exited.
- Equilibrium can't be reached because of lack of path resources.
|
| Failure 1 has to be from a connection that is
either starting or restarting after a packet loss. Another way to look at
the conservation property is to say that the sender uses acks as a clock
to strobe new packets into the network. Since the receiver can generate acks
no faster than data packets can get through the network, the protocol is
self-clocking. But the same thing that makes a self-clocked system
stable when it's running, makes it hard to start; i.e., to get data flowing
there must be acks to clock out packets, but to get acks there must be data
flowing. To start the clock the slow-start algorithm was developed to
gradually increase the amount of data in transit: |
- Add a congestion window, cwnd, to the per-connection state.
- When starting or restarting after a loss, set cwnd to one
packet.
- On each ack for new data, increase cwnd by one packet.
- When sending, send the minimum of the receiver's advertised window
and cwnd.
|
| The algorithm guarantee that a connection will
send data at a rate at most twice the maximum possible on the path. |
| Failure 2 has to deal with conservation at
equilibrium; it requires a good round-trip timing. A good
round trip time estimator, the core of the retransmit timer, is the single
most important feature of any protocol implementation that expects to
survive heavy load. One mistake is not estimating the variation, s,
of the round trip time, RTT. From queueing theory we know that RTT
and the variation of RTT increase quickly with load. |
| The TCP protocol specification suggests
estimating mean round trip time via the low-pass filter: |
| |
SRTT =
α * SRTT + (
1-α ) * RTT |
|
| Where SRTT is the average RTT
estimate, RTT is a round trip time measurement from the most recently
acked data packet, and α is a filter
gain constant with a suggested value of 0.9. Once the SRTT estimate
is updated, the retransmit timeout interval, RTO, for the next packet
sent is set to β * SRTT. The
parameter β account for RTT
variation. The suggested β=2 can
adapt to loads of at most 30%. Above this point, a connection will
respond to load increases by retransmitting packets that have only be
delayed in transit. This force the network to do useless work, wasting
bandwidth on duplicates of packets that will eventually be delivered, at a
time when it's known to be having trouble with useful work. |
|
|
|
| Van Jacobson & Karels developed a cheap
method for estimating variation, and the resulting retransmit timer
essentially eliminate spurious retransmissions. A pleasant side effect of
estimating β, rather than using a
fixed value, is that low load as well as high load performance improves,
particulary over high delay paths. The method basically calculates RTO
using a dynamic value of β
instead of a fixed value (β=2).
The algorithm calculates the mean deviation of SRTT (average of
differences) and uses this calculation to estimate a value of RTO
based on this mean deviation. |
| |
RTT = round-trip time.
SRTT = smoothed value of RTT.
RTO = round-trip time timeout.
SRTT(i+1) = (1-α) * SRTT(i) +
α * RTT(i+1)
RTO = β * SRTT |
|
| Now β
is dynamic, and its value depends of mean deviation of SRTT values
respect to real measured values. |
| In equation SRTT(i) = (1-α)
* SRTT(i-1) + α * RTT(i), it's
probably obvious that SRTT will oscillate randomly around the true
average and we can get one standard deviation of SRTT to be sure
we cover most of posibilities. |
The standard deviation of SRTT will be
α * stddev(RTT). Instead of
using standard deviation is better to use mean deviation, because:
- Takes less cpu resources to be calculated; it doesn't require to take
the square root of a square.
- It's more conservative (i.e. larger) estimate of variation than
standard deviation.
There's often a simple relation between meandev and stddev;
if the prediction error are normally distributed then:
| |
meandev = sqrt(π)/2
* stddev ~ 1.25 * stddev. |
|
This way mean deviation is a good aproximation of standard
deviation and is much easier to compute. |
| The new algorithm proposed to calculate
RTO(i) (at time i) is as follows: |
| |
RTO(i) = RTT(i) + 4 * V(i) |
|
Where RTT(i) is round-trip time at time
i and V(i) mean deviation at time i. This algorithm was
designed to prevent spurious retransmission timeout during slow-start. A
spurious retransmission occurs if the RTO(i) computed at the end of a
slow-start round i, is ever less than or equal to the actual
RTT of the next round. In the worst case of all the delay been due
to window, RTT doubles each round (since the window size doubles).
Thus:
| |
RTT(i+1) = 2 * RTT(i)
V(i) = RTT(i) - RTT(i-1) = RTT(i) - ½ * RTT(i) = ½ * RTT(i) |
|
Then if we define RTO(i) as RTT(i) + 4 * V(i) we have: |
| |
RTO(i) = RTT(i) + 4 * V(i)
RTO(i) = RTT(i) + 4 * ½ * RTT(i) = 3 * RTT(i)
RTO(i) = 3 * RTT(i) > 2 * RTT(i) |
|
| But because 2 * RTT(i) = RTT(i+1) then: |
|
|
| So spurious retransmit timeouts cannot occur. |
| Finally talking about the failure 3, if the
timers are in good shape, it's possible to state with some confidence that a
timeout indicates a lost packet and not a broken timer. But in actual
networks 99% of packet loss is due to congestion in the network. A
congestion avoidance strategy will have two component: |
- The network must be able to signal the transport endpoints that
congestion is occurring (or about to occur).
- The endpoints must have a policy that decreases utilization if the
signal is received and increase utilization if the signal isn't received.
|
| Packet loss is almost always due to congestion
and a timeout is almost always due to a lost packet, then a timeout is
almost always a signal that the network is congested. If network
load is measure by average queue length over fixed intervals of time near
the round-trip time, an uncongested network can be modeled by saying that
queue length changes slowly compared to the sampling time; i.e. L(i) = N
(where N is a constant). If the network is subject to congestion, this
zeroth order model breaks down. The average queue length becomes the sum of
two terms, the N above that account for the average arrival rate of
new traffic and intrinsic delay, and a new term that account for the
fraction of traffic left over from the last time interval and the effect of
this left-over traffic: |
|
|
| When the network is congested, g must be
large and the queue length will start increasing exponentially. The
system will stabilize only if the traffic source throttle back at least as
quickly as the queues are growing. Since a source controls load in a
window-based protocol by adjusting the size of the window, W, we end
up with the sender policy: |
| |
On congestion: W(i) = d * W(i-1) where d < 1 |
|
| The network announces, via a dropped packet,
when demand is excessive but says nothing if a connection is using less than
its fair share; thus a connection has to increase its bandwidth utilization
to find out the current limit. What should be the increase policy be?
Without justification we'll state that the best increase policy it to make
small, constant changes to the window size: |
| |
On no congestion: W(i) = W(i-1) +
μ where
μ <= Wmax |
|
| Where Wmax is the pipesize (the
delay-bandwidth product of the path minus protocol overhead - i.e. the
largest sensible window for the unloaded path). For implementing the
congestion avoidance algorithm the authors selected d=0.5 and
m=1.0. The algorithm is then: |
- On any timeout, set cwnd to half the current window size (this
is a multiplicative decrease).
- On each ack for new data, increase cwnd by 1/cwnd (this
is the additive increase).
- When sending, send the minimum of the receiver's advertised window
and cwnd.
|
| Combined slow-start and congestion avoidance
algorithm |
| Sender keeps two state variables: a congestion
window (cwnd) and a threshold size (ssthresh) to switch
between the two algorithms. Sender's output routine always sends the minimum
of cwnd and the receiver's advertised window. On a timeout
ssthresh is set to ½ of the current window size (this is the
multiplicative decrease part of the congestion avoidance algorithm); then
cwnd is set to 1 packet initiating again slow-start. When
new data is acked, the sender does: |
| |
| If
cwnd < ssthresh |
System
is doing slow-start; window must be opened exponentially; then for
each ack -> cwnd += 1.
|
|
else |
System
is doing congestion avoidance; window must be opened additively;
then for each ack -> cwnd += 1/cwnd. |
|
|
| Thus slow-start opens the window quickly
to what congestion avoidance thinks is a safe operating point (half
the window that got us into trouble), the congestion avoidance takes
over and slowly increase the window size to probe more bandwidth becoming
available on the path. |
| Authors also note that some higher level
protocols like SMTP and NNTP generates some spurious
retransmissions because they have a "negotiation" phase were a few packets
are exchanged stop-and-wait, following by a data transfer phase where all of
a mail message or news article is sent. Unfortunately, the "negotiation"
exchanges open the congestion window so the start of the data transfer
phase will dump several packets into the network with no slow-start
and, on a bandwidth dominated path (low bandwidth), faster than RTO
can trace the RTT increase caused by these packets. |
| The root cause of this problem is dumping
too many packets into an empty pipe (the pipe is empty since the
negotiation exchange was conducted stop-and-wait) with no ack "clock". The
fix propose to this problem is going back to the slow-start phase.
Detection is simple: pipe is empty because we haven't sent anything for at
least a round-trip time, so if nothing has been sent for at least one
RTT, the next send should set cwnd to one packet to force
a slow-start. i.e. if the connection state variable lastsnd
holds the time the last packet was sent, the following code should appear
early in the TCP output routine: |
| |
int idle = (snd_max
== snd_una);
if ( idle && now() - lastsnd > RTO)
cwnd = 1; |
|
| Boolean idle is true if there is no data
in transit (all data sent has been acked) so the if says: if there's
nothing in transit and we haven't sent anything for a long time (long
time > RTO), then slow-start. |
| Note: these algorithms were included in what
was called Tahoe TCP implementation. |
|
|
|
|
|
Previous
|
Content
|
Next
|