Previous

Content

Next 


2.0.- Congestion Avoidance and Control

 

In november, 1988, Jacobson & Karels wrote Congestion Avoidance and Control, where they described five of seven new algorithms included into BSD TCP 4.0. Next is a summary of their paper.

The seven new algorithms included into BSD TCP 4.0. were:
  1. Round-trip time variance estimation.
  2. Exponential retransmit timer backoff.
  3. Slow-Start.
  4. More aggressive receiver ACK policy.
  5. Dynamic window sizing on congestion.
  6. Karn's clamped retransmit backoff.
  7. Fast retransmitt.
The flow on a TCP connection should obey a conservation of packets principle; if this principle were obey, congestion collapse would become the exception rather than the rule. By conservation of packets we mean that for a connection in equilibrium, i.e., running stably with a full window of data in transit, the packet flow is what physicist would call conservative, i.e., a new packet isn't put into the network until an old packet leaves.
There are only three ways for packet conservation to fail:
  1. The connection doesn't get to equilibrium.
  2. A sender injects a new packet before an old one has exited.
  3. Equilibrium can't be reached because of lack of path resources.
Failure 1 has to be from a connection that is either starting or restarting after a packet loss. Another way to look at the conservation property is to say that the sender uses acks as a clock to strobe new packets into the network. Since the receiver can generate acks no faster than data packets can get through the network, the protocol is self-clocking. But the same thing that makes a self-clocked system stable when it's running, makes it hard to start; i.e., to get data flowing there must be acks to clock out packets, but to get acks there must be data flowing. To start the clock the slow-start algorithm was developed to gradually increase the amount of data in transit:
  • Add a congestion window, cwnd, to the per-connection state.
  • When starting or restarting after a loss, set cwnd to one packet.
  • On each ack for new data, increase cwnd by one packet.
  • When sending, send the minimum of the receiver's advertised window and cwnd.
The algorithm guarantee that a connection will send data at a rate at most twice the maximum possible on the path.
Failure 2 has to deal with conservation at equilibrium; it requires a good round-trip timing. A good round trip time estimator, the core of the retransmit timer, is the single most important feature of any protocol implementation that expects to survive heavy load. One mistake is not estimating the variation, s, of the round trip time, RTT. From queueing theory we know that RTT and the variation of RTT increase quickly with load.
The TCP protocol specification suggests estimating mean round trip time via the low-pass filter:
  SRTT = α * SRTT + ( 1-α ) * RTT
Where SRTT is the average RTT estimate, RTT is a round trip time measurement from the most recently acked data packet, and α is a filter gain constant with a suggested value of 0.9. Once the SRTT estimate is updated, the retransmit timeout interval, RTO, for the next packet sent is set to β * SRTT. The parameter β account for RTT variation. The suggested β=2 can adapt to loads of at most 30%. Above this point, a connection will respond to load increases by retransmitting packets that have only be delayed in transit. This force the network to do useless work, wasting bandwidth on duplicates of packets that will eventually be delivered, at a time when it's known to be having trouble with useful work.  

Van Jacobson & Karels developed a cheap method for estimating variation, and the resulting retransmit timer essentially eliminate spurious retransmissions. A pleasant side effect of estimating β, rather than using a fixed value, is that low load as well as high load performance improves, particulary over high delay paths. The method basically calculates RTO using a dynamic value of β instead of a fixed value (β=2). The algorithm calculates the mean deviation of SRTT (average of differences) and uses this calculation to estimate a value of RTO based on this mean deviation.
  RTT = round-trip time.
SRTT = smoothed value of RTT.
RTO = round-trip time timeout.

SRTT(i+1) = (1-α) * SRTT(i) +  α * RTT(i+1)
RTO = β * SRTT
Now β is dynamic, and its value depends of mean deviation of SRTT values respect to real measured values.
In equation SRTT(i) = (1-α) * SRTT(i-1) + α * RTT(i), it's probably obvious that SRTT will oscillate randomly around the true average and we can get one standard deviation of SRTT to be sure we cover most of posibilities.
The standard deviation of SRTT will be α * stddev(RTT). Instead of using standard deviation is better to use mean deviation, because:
  1. Takes less cpu resources to be calculated; it doesn't require to take the square root of a square.
  2. It's more conservative (i.e. larger) estimate of variation than standard deviation.

There's often a simple relation between meandev and stddev; if the prediction error are normally distributed then:

   meandev = sqrt(π)/2 * stddev ~ 1.25 * stddev.

This way mean deviation is a good aproximation of standard deviation and is much easier to compute.

The new algorithm proposed to calculate RTO(i) (at time i) is as follows:
  RTO(i) = RTT(i) + 4 * V(i)
Where RTT(i) is round-trip time at time i and V(i) mean deviation at time i. This algorithm was designed to prevent spurious retransmission timeout during slow-start. A spurious retransmission occurs if the RTO(i) computed at the end of a slow-start round i, is ever less than or equal to the actual RTT of the next round. In the worst case of all the delay been due to window, RTT doubles each round (since the window size doubles). Thus:
 
  RTT(i+1) = 2 * RTT(i)
V(i) = RTT(i) - RTT(i-1) = RTT(i) - ½ * RTT(i) = ½ * RTT(i)

Then if we define RTO(i) as RTT(i) + 4 * V(i) we have:

  RTO(i) = RTT(i) + 4 * V(i)
RTO(i) = RTT(i) + 4 * ½ * RTT(i) = 3 * RTT(i)
RTO(i) = 3 * RTT(i) > 2 * RTT(i)
But because 2 * RTT(i) = RTT(i+1) then:
  RTO(i) > RTT(i+1)
So spurious retransmit timeouts cannot occur.
Finally talking about the failure 3, if the timers are in good shape, it's possible to state with some confidence that a timeout indicates a lost packet and not a broken timer. But in actual networks 99% of packet loss is due to congestion in the network. A congestion avoidance strategy will have two component:
  1. The network must be able to signal the transport endpoints that congestion is occurring (or about to occur).
  2. The endpoints must have a policy that decreases utilization if the signal is received and increase utilization if the signal isn't received.
Packet loss is almost always due to congestion and a timeout is almost always due to a lost packet, then a timeout is almost always a signal that the network is congested. If network load is measure by average queue length over fixed intervals of time near the round-trip time, an uncongested network can be modeled by saying that queue length changes slowly compared to the sampling time; i.e. L(i) = N (where N is a constant). If the network is subject to congestion, this zeroth order model breaks down. The average queue length becomes the sum of two terms, the N above that account for the average arrival rate of new traffic and intrinsic delay, and a new term that account for the fraction of traffic left over from the last time interval and the effect of this left-over traffic:
  L(i) = N + g * L(i-1)
When the network is congested, g must be large and the queue length will start increasing exponentially. The system will stabilize only if the traffic source throttle back at least as quickly as the queues are growing. Since a source controls load in a window-based protocol by adjusting the size of the window, W, we end up with the sender policy:
  On congestion: W(i) = d * W(i-1) where d < 1
The network announces, via a dropped packet, when demand is excessive but says nothing if a connection is using less than its fair share; thus a connection has to increase its bandwidth utilization to find out the current limit. What should be the increase policy be? Without justification we'll state that the best increase policy it to make small, constant changes to the window size:
  On no congestion: W(i) = W(i-1) + μ where μ <= Wmax
Where Wmax is the pipesize (the delay-bandwidth product of the path minus protocol overhead - i.e. the largest sensible window for the unloaded path). For implementing the congestion avoidance algorithm the authors selected d=0.5 and m=1.0. The algorithm is then:
  1. On any timeout, set cwnd to half the current window size (this is a multiplicative decrease).
  2. On each ack for new data, increase cwnd by 1/cwnd (this is the additive increase).
  3. When sending, send the minimum of the receiver's advertised window and cwnd.
Combined slow-start and congestion avoidance algorithm
Sender keeps two state variables: a congestion window (cwnd) and a threshold size (ssthresh) to switch between the two algorithms. Sender's output routine always sends the minimum of cwnd and the receiver's advertised window. On a timeout ssthresh is set to ½ of the current window size (this is the multiplicative decrease part of the congestion avoidance algorithm); then cwnd is set to 1 packet initiating again slow-start. When new data is acked, the sender does:
 
If cwnd < ssthresh System is doing slow-start; window must be opened exponentially; then for each ack -> cwnd += 1.
 
else System is doing congestion avoidance; window must be opened additively; then for each ack -> cwnd += 1/cwnd.
Thus slow-start opens the window quickly to what congestion avoidance thinks is a safe operating point (half the window that got us into trouble), the congestion avoidance takes over and slowly increase the window size to probe more bandwidth becoming available on the path.
Authors also note that some higher level protocols like SMTP and NNTP generates some spurious retransmissions because they have a "negotiation" phase were a few packets are exchanged stop-and-wait, following by a data transfer phase where all of a mail message or news article is sent. Unfortunately, the "negotiation" exchanges open the congestion window so the start of the data transfer phase will dump several packets into the network with no slow-start and, on a bandwidth dominated path (low bandwidth), faster than RTO can trace the RTT increase caused by these packets.
The root cause of this problem is dumping too many packets into an empty pipe (the pipe is empty since the negotiation exchange was conducted stop-and-wait) with no ack "clock". The fix propose to this problem is going back to the slow-start phase. Detection is simple: pipe is empty because we haven't sent anything for at least a round-trip time, so if nothing has been sent for at least one RTT, the next send should set cwnd to one packet to force a slow-start. i.e. if the connection state variable lastsnd holds the time the last packet was sent, the following code should appear early in the TCP output routine:
  int idle = (snd_max == snd_una);
if ( idle && now() - lastsnd > RTO)
    cwnd = 1;
Boolean idle is true if there is no data in transit (all data sent has been acked) so the if says: if there's nothing in transit and we haven't sent anything for a long time (long time > RTO), then slow-start.
Note: these algorithms were included in what was called Tahoe TCP implementation.

 

 


Previous

Content

Next