Qian's Blog

从Pingmesh引申到Linux中TCP重传等待时间（RTO）

2017-04-01 #TCP #Linux #RTO

关于Pingmesh

之前读了微软发表在SIGCOMM上的关于网络质量监控的pingmesh的论文，所以现在我们自己正在自己开发一套这玩意，让网络组的同学可以回答“网络到底有没有问题”这个问题从而减少网络组与业务开发之间的撕逼。

其中提到了一个关于丢包率的算法，这个算法并不是交换机的数据，也不是传统的丢包率，因为交换机可能会不可靠，也无法知道是否是应用层的丢包，所以呢采取了TCP ping和HTTP ping的方式，然后根据连接建立成功的时间也就是RTT来判断是否丢包，连接建立成功的RTT也就是请求方发出SYN，接收方收到后回复一个ACK，请求方受到ACK这个时间。丢包率公式如下：

（一次探测失败 + 两次探测失败） ／ （成功次数）

其中呢，一次失败和两次失败在计算丢包率的时候都只算一次，而且不统计探测超时的情况。判断一次探测失败的规则是这个连接的时间超过了3s，而判断两次失败的规则是这个连接的时间超过了9s。

为什么要这样呢，论文中提到，不统超时的探测是因为超时的原因有多种，可能是应用有问题，也有可能网络原因也有可能是其他原因，所以统计了之后所反映的并不是准确的丢包率。只算一次失败和两次失败的丢包是因为微软的机房只重连两次，第二次不通就算连接超时，不再重试了。而第一次重试时间的重传等待时间是3s，第二次是9s，所以判断是否重传是根据包的连接成功所耗费的时间 而并不是传统意义的丢包率。至于为什么丢包两次也只算一次，这是因为，在第一次丢包的情况下，发生第二次丢包的概率非常大，他们很有可能是相同的原因导致，所以算两次丢包会对用户有一些误导。

好了问题来了，因为微软的机房大部分都是Windows的服务器，可以沿用这套公式，但是在Linux下就不适用了，因为Linux的丢包重传算法与Windows不同。我们经过抓包分析后发现Linux的重传的间隔时间是，1s，3s，7s，15s，31s，63s。

但是看到网上有写文章说Linux的间隔为3s，9s，21s，这样与windows是一致，这就疑惑了，为啥会有两套说辞，所以开始查阅资料，

Linux如何判断超时的时间

我们可以先看看超时时间计算的内核源码如下:

#define TCP_RTO_MAX     ((unsigned)(120*HZ))
#define TCP_RTO_MIN     ((unsigned)(HZ/5))
#define TCP_TIMEOUT_INIT ((unsigned)(1*HZ))     /* RFC2988bis initial RTO value */
#define TCP_TIMEOUT_FALLBACK ((unsigned)(3*HZ)) /* RFC 1122 initial RTO value, now
                                                 * used as a fallback RTO for the
                                                 * initial data transmission if no
                                                 * valid RTT sample has been acquired,
                                                 * most likely due to retrans in 3WHS.
                                                 */

其中提到了现在的initial RTO的值为1HZ，Linux下HZ是1000个Tick，也就是1s，这是RFC2988bis规定的，但是之前RFC1122 规定了RTO初始值是3HZ，现在这个是fallback RTO的值。

有了初始值之后，接下来每次的增长就是指数增长了，也就是说接下来每次的间隔是2，4，8，16，可以看下源码中对超时判断的实现

/**
 *  retransmits_timed_out() - returns true if this connection has timed out
 *  @sk:       The current socket
 *  @boundary: max number of retransmissions
 *  @timeout:  A custom timeout value.
 *             If set to 0 the default timeout is calculated and used.
 *             Using TCP_RTO_MIN and the number of unsuccessful retransmits.
 *  @syn_set:  true if the SYN Bit was set.
 *
 * The default "timeout" value this function can calculate and use
 * is equivalent to the timeout of a TCP Connection
 * after "boundary" unsuccessful, exponentially backed-off
 * retransmissions with an initial RTO of TCP_RTO_MIN or TCP_TIMEOUT_INIT if
 * syn_set flag is set.
 */
static bool retransmits_timed_out(struct sock *sk,
                                  unsigned int boundary,
                                  unsigned int timeout,
                                  bool syn_set)
{
        unsigned int linear_backoff_thresh, start_ts;
        unsigned int rto_base = syn_set ? TCP_TIMEOUT_INIT : TCP_RTO_MIN; //SYN包则rto_base为 TCP_TIMEOUT_INIT也就是1s，否则为200ms

        if (!inet_csk(sk)->icsk_retransmits)
                return false;

        start_ts = tcp_sk(sk)->retrans_stamp;
        if (unlikely(!start_ts))
                start_ts = tcp_skb_timestamp(tcp_write_queue_head(sk));

        if (likely(timeout == 0)) {
                linear_backoff_thresh = ilog2(TCP_RTO_MAX/rto_base); // 指数和线性增长的阈值

                if (boundary <= linear_backoff_thresh)
                        timeout = ((2 << boundary) - 1) * rto_base; // 小于阈值指数增长
                else
                        timeout = ((2 << linear_backoff_thresh) - 1) * rto_base +  //大于阈值线性增长
                                (boundary - linear_backoff_thresh) * TCP_RTO_MAX;
        }
        return (tcp_time_stamp - start_ts) >= timeout; //当前的时间戳减去开始时间
}

阈值为log2(120)=9，所以可以看到如果是SYN包的话以默认重传5次来说，SYN的超时时间为63”；而普通包默认15次，初始值200ms来计算的话超时时间是15’25”，在这期间就会一直重发，重发的间隔RTO则用的RFC6298里面的规则，具体有些微调，可以阅读[Calculating TCP RTO] (http://sgros.blogspot.com/2012/02/calculating-tcp-rto.html)和RTO对tcp超时的影响，了解更多的细节。总的来说，增加的时候正常增加基本不干涉，但是下降的手要保证RTO的平滑。

三次握手

由于pingmesh的丢包率是用三次握手建立连接的时间来计算的，所以下面研究了三次握手的详细过程，首先三次握手内核函数调用流程图如下 three handshakes

首先是掉用 tcp_v4_connect，接着调用ip_route_connect，探测IP层的路由是否可达，然后发送syn包，并且将状态设置为SYN_SENT，然后调用tcp_connect函数，这个函数首先调用tcp_transmit_skb将包发送到IP层，然后调用inet_csk_reset_xmit_timer，来启动一个内核的计时器，使得在超时时间内，可以重复的发送未收到ack的包。

设置重传定时器后，每当定时器结束，则会调用重传的处理函数tcp_retransmit_timer_handler，函数如下

/* Called with bottom-half processing disabled.
   Called by tcp_write_timer() */
void tcp_write_timer_handler(struct sock *sk)
{
	struct inet_connection_sock *icsk = inet_csk(sk);
	int event;

  // state 是CLOSE或者未安装定时器则直接go out
	if (((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN)) ||
	    !icsk->icsk_pending)
		goto out;

	if (time_after(icsk->icsk_timeout, jiffies)) {
		sk_reset_timer(sk, &icsk->icsk_retransmit_timer, icsk->icsk_timeout);
		goto out;
	}

	event = icsk->icsk_pending;

	switch (event) {
	case ICSK_TIME_EARLY_RETRANS:
		tcp_resume_early_retransmit(sk);
		break;
	case ICSK_TIME_LOSS_PROBE:
		tcp_send_loss_probe(sk);
		break;
	case ICSK_TIME_RETRANS: //正常情况的重传
		icsk->icsk_pending = 0;
		tcp_retransmit_timer(sk);
		break;
	case ICSK_TIME_PROBE0:
		icsk->icsk_pending = 0;
		tcp_probe_timer(sk);
		break;
	}

out:
	sk_mem_reclaim(sk);
}

可以看到正常的重传逻辑还是在tcp_retransmit_timer函数中

/**
 *  tcp_retransmit_timer() - The TCP retransmit timeout handler
 *  @sk:  Pointer to the current socket.
 *
 *  This function gets called when the kernel timer for a TCP packet
 *  of this socket expires.
 *
 *  It handles retransmission, timer adjustment and other necesarry measures.
 *
 *  Returns: Nothing (void)
 */
void tcp_retransmit_timer(struct sock *sk)
{
    //...
    // 若开启了TFO，则重传SYN／ACK
	if (tp->fastopen_rsk) { 
		WARN_ON_ONCE(sk->sk_state != TCP_SYN_RECV &&
			     sk->sk_state != TCP_FIN_WAIT1);
		tcp_fastopen_synack_timer(sk);
		/* Before we receive ACK to our SYN-ACK don't retransmit
		 * anything else (e.g., data or FIN segments).
		 */
		return;
	}
    // 包是否已经全部确认
	if (!tp->packets_out)
		goto out;

    // ... 

	if (!tp->snd_wnd && !sock_flag(sk, SOCK_DEAD) &&
	    !((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV))) {
		/* Receiver dastardly shrinks window. Our retransmits
		 * become zero probes, but we should not timeout this
		 * connection. If the socket is an orphan, time it out,
		 * we cannot allow such beasts to hang infinitely.
		 */
		struct inet_sock *inet = inet_sk(sk);
		if (sk->sk_family == AF_INET) {
			net_dbg_ratelimited("Peer %pI4:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
					    &inet->inet_daddr,
					    ntohs(inet->inet_dport),
					    inet->inet_num,
					    tp->snd_una, tp->snd_nxt);
		}

        //...
        // 如果超过了最长时间还未收到对端的确认，则报错，且关闭连接
		if (tcp_time_stamp - tp->rcv_tstamp > TCP_RTO_MAX) {
			tcp_write_err(sk);
			goto out;
		}
		tcp_enter_loss(sk); // 进入拥塞控制的LOSS阶段
		tcp_retransmit_skb(sk, tcp_write_queue_head(sk), 1); // 重传发送队列的手包
		__sk_dst_reset(sk);
		goto out_reset_timer;
	}

	if (tcp_write_timeout(sk)) // 重传等待时间超时或者orphan socket消耗资源过多
		goto out;

    // 第一次重传
	if (icsk->icsk_retransmits == 0) {
		int mib_idx;
        //更新MIB数据库
	}

	tcp_enter_loss(sk); // 进入拥塞控制loss状态

    // 重传发送队列首包失败，不退避，直接重设timer
	if (tcp_retransmit_skb(sk, tcp_write_queue_head(sk), 1) > 0) {
		/* Retransmission failed because of local congestion,
		 * do not backoff.
		 */
		if (!icsk->icsk_retransmits)
			icsk->icsk_retransmits = 1;
		inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
					  min(icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL),
					  TCP_RTO_MAX);
		goto out;
	}

	/* Increase the timeout each time we retransmit.  Note that
	 * we do not increase the rtt estimate.  rto is initialized
	 * from rtt, but increases here.  Jacobson (SIGCOMM 88) suggests
	 * that doubling rto each time is the least we can get away with.
	 * In KA9Q, Karn uses this for the first few times, and then
	 * goes to quadratic.  netBSD doubles, but only goes up to *64,
	 * and clamps at 1 to 64 sec afterwards.  Note that 120 sec is
	 * defined in the protocol as the maximum possible RTT.  I guess
	 * we'll have to use something other than TCP to talk to the
	 * University of Mars.
	 *
	 * PAWS allows us longer timeouts and large windows, so once
	 * implemented ftp to mars will work nicely. We will have to fix
	 * the 120 second clamps though!
	 */
    // 指数退避
	icsk->icsk_backoff++;
	icsk->icsk_retransmits++;

out_reset_timer:
	/* If stream is thin, use linear timeouts. Since 'icsk_backoff' is
	 * used to reset timer, set to 0. Recalculate 'icsk_rto' as this
	 * might be increased if the stream oscillates between thin and thick,
	 * thus the old value might already be too high compared to the value
	 * set by 'tcp_set_rto' in tcp_input.c which resets the rto without
	 * backoff. Limit to TCP_THIN_LINEAR_RETRIES before initiating
	 * exponential backoff behaviour to avoid continue hammering
	 * linear-timeout retransmissions into a black hole
	 */
	if (sk->sk_state == TCP_ESTABLISHED &&
	    (tp->thin_lto || sysctl_tcp_thin_linear_timeouts) &&
	    tcp_stream_is_thin(tp) &&
	    icsk->icsk_retransmits <= TCP_THIN_LINEAR_RETRIES) {
		icsk->icsk_backoff = 0;
		icsk->icsk_rto = min(__tcp_set_rto(tp), TCP_RTO_MAX);
	} else {
		/* Use normal (exponential) backoff */
		icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX); // double一下超时时间
	}
	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
	if (retransmits_timed_out(sk, net->ipv4.sysctl_tcp_retries1 + 1, 0, 0))
		__sk_dst_reset(sk);

out:;
}

所以每次发送的间隔为1s、2s、4s、8s

结论

网上流传的两个版本的重传RTT的时间关系是因为之前的Linux版本中初始值为3s，现在的版本都是初始值为1s，然后每次指数增长，默认次数为5，所以最多是等待63s。如果需要修改初始值必须重新编译内核，但是我们可以通过限制重传的次数，以及RTO对tcp超时的影响提到的修改路由表的rto_min的方法，来修改重传等待的时间由于在重连的计算是以上次RTT为基准计算的，所以，连接还未建立的时候，3WHS时不受此规则影响，会一直指数退避

References

RTO对tcp超时的影响(提到了RTO计算方法，认为修改方法)

Calculating TCP RTO…(详细介绍了RTO计算方法）

TCP内核源码分析笔记（介绍了tcp连接中内核的函数调用，以及关键函数）

重传定时器(介绍了Linux重传定时器的设置原因，时间点和作用)

Under Standing Linux Kernel page236 介绍了Linux的计时器架构

Professional Linux Kernel Architecture 1st Edition page788 Transport Layer