Systemtap examples, Network - 5 Monitoring Network Packets Drops in Kernel

4 minute read

背景

例子来自dropwatch.stp脚本, 可用于分析网络协议栈中丢包的确切位置. 确切的位置是使用symname或者symdata将内存地址翻译出来的函数信息, 翻译必须使用stap --all-modules选项以便加载所有的模块的符号表.  
       --all-modules  
              Equivalent to specifying "-dkernel" and a "-d" for each kernel module that is  currently  loaded.   Cau-  
              tion: this can make the probe modules considerably larger.  
  
脚本内容以及注解  
[root@db-172-16-3-150 network]# cd /usr/share/systemtap/testsuite/systemtap.examples/network  
[root@db-172-16-3-150 network]# cat dropwatch.stp  
#!/usr/bin/stap  
  
############################################################  
# Dropwatch.stp  
# Author: Neil Horman <nhorman@redhat.com>  
# An example script to mimic the behavior of the dropwatch utility  
# http://fedorahosted.org/dropwatch  
############################################################  
  
# Array to hold the list of drop points we find  
global locations  
  
# Note when we turn the monitor on and off  
probe begin { printf("Monitoring for dropped packets\n") }  
probe end { printf("Stopping dropped packet monitor\n") }  
  
# increment a drop counter for every location we drop at  
probe kernel.trace("kfree_skb") { locations[$location] <<< 1 }  
// locations数组索引为$location, 记录kfree_skb被调用时的location参数信息, 即各模块符号表中的内存地址;  
// 使用symname()或者symdata()可以将地址转换成符号信息.  
  
# Every 5 seconds report our drop locations  
probe timer.sec(5)  
{  
  printf("\n")  
  foreach (l in locations-) {  
    printf("%d packets dropped at %s\n",  
           @count(locations[l]), symname(l))  
  }  
  delete locations  
}  
// 每5秒输出一次  
// 按@count(locations[i]) 倒序输出  
// 输出包含符号名, 以及丢的包个数.  
如果不加载模块, symname无法正确的翻译出函数名.  
[root@db-172-16-3-150 network]# stap ./dropwatch.stp   
Monitoring for dropped packets  
  
5 packets dropped at 0xffffffff814a104a  
1 packets dropped at 0xffffffff8147483b  
1 packets dropped at 0xffffffff814dc92c  
  
6 packets dropped at 0xffffffff814a104a  
3 packets dropped at 0xffffffff8147483b  
1 packets dropped at 0xffffffff814714c8  
1 packets dropped at 0xffffffffa02078bb  
1 packets dropped at 0xffffffff814dc92c  
  
5 packets dropped at 0xffffffff814a104a  
1 packets dropped at 0xffffffff814dc92c  
加载模块(--all-modules)后的执行输出举例 :   
[root@db-172-16-3-150 network]# stap  --all-modules ./dropwatch.stp   
Monitoring for dropped packets  
  
5 packets dropped at tcp_v4_rcv  
1 packets dropped at unix_stream_connect  
  
5 packets dropped at tcp_v4_rcv  
1 packets dropped at nf_hook_slow  
1 packets dropped at unix_stream_connect  
  
5 packets dropped at tcp_v4_rcv  
1 packets dropped at unix_stream_connect  
  
6 packets dropped at tcp_v4_rcv  
1 packets dropped at nf_hook_slow  
1 packets dropped at unix_stream_connect  
^CStopping dropped packet monitor  
  
如果要输出模块信息以及函数在模块中的起始位置偏移量, 可以把symname替换成symdata来输出.  
修改dropwatch.stp , 同时输出内存地址, 地址对应的符号表中的信息.  
[root@db-172-16-3-150 network]# vi dropwatch.stp   
#!/usr/bin/stap  
  
############################################################  
# Dropwatch.stp  
# Author: Neil Horman <nhorman@redhat.com>  
# An example script to mimic the behavior of the dropwatch utility  
# http://fedorahosted.org/dropwatch  
############################################################  
  
# Array to hold the list of drop points we find  
global locations  
  
# Note when we turn the monitor on and off  
probe begin { printf("Monitoring for dropped packets\n") }  
probe end { printf("Stopping dropped packet monitor\n") }  
  
# increment a drop counter for every location we drop at  
probe kernel.trace("kfree_skb") { locations[$location] <<< 1 }  
  
# Every 5 seconds report our drop locations  
probe timer.sec(5)  
{  
  printf("\n")  
  foreach (l in locations-) {  
    printf("%d packets dropped at %p, %s, %s\n",  
           @count(locations[l]), l, symname(l), symdata(l))  
  }  
  delete locations  
}  
使用修改后的dropwatch.stp 输出如下 :   
[root@db-172-16-3-150 network]# stap  --all-modules ./dropwatch.stp   
Monitoring for dropped packets  
  
6 packets dropped at 0xffffffff814a104a, tcp_v4_rcv, tcp_v4_rcv+0xaa/0x8d0 [kernel]  
1 packets dropped at 0xffffffff814dc92c, unix_stream_connect, unix_stream_connect+0x1dc/0x4a0 [kernel]  
  
5 packets dropped at 0xffffffff814a104a, tcp_v4_rcv, tcp_v4_rcv+0xaa/0x8d0 [kernel]  
1 packets dropped at 0xffffffff8147483b, nf_hook_slow, nf_hook_slow+0xeb/0x110 [kernel]  
1 packets dropped at 0xffffffff814dc92c, unix_stream_connect, unix_stream_connect+0x1dc/0x4a0 [kernel]  
^CStopping dropped packet monitor  
  
如果无法使用symname和symdata转换, 手工从文件/boot/System.map-2.6.32-358.el6.x86_64中解读也是可以得到对应的函数的.  
/boot/System.map-2.6.32-358.el6.x86_64这个符号表记录了函数的起始地址和函数的对应关系.  
[root@db-172-16-3-150 network]# sort -k 1 /boot/System.map-2.6.32-358.el6.x86_64 |less  
ffffffff814a0f50 t tcp_v4_reqsk_destructor  
ffffffff814a0fa0 T tcp_v4_rcv  
ffffffff814a1870 T tcp_v4_conn_request  
从$location以及符号表匹配函数 :   
6 packets dropped at 0xffffffff814a104a, tcp_v4_rcv, tcp_v4_rcv+0xaa/0x8d0 [kernel]  
0xffffffff814a104a这个地址在ffffffff814a0fa0和ffffffff814a1870之间, 所以也可以得到tcp_v4_rcv.  
  
本文用到的kernel.trace("kfree_skb")原型.  
/usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/include/trace/events/skb.h  
/*  
 * Tracepoint for free an sk_buff:  
 */  
TRACE_EVENT(kfree_skb,  
  
        TP_PROTO(struct sk_buff *skb, void *location),  
  
        TP_ARGS(skb, location),  
  
        TP_STRUCT__entry(  
                __field(        void *,         skbaddr         )  
                __field(        unsigned short, protocol        )  
                __field(        void *,         location        )  
        ),  
  
        TP_fast_assign(  
                __entry->skbaddr = skb;  
                if (skb) {  
                        __entry->protocol = ntohs(skb->protocol);  
                }  
                __entry->location = location;  
        ),  
  
        TP_printk("skbaddr=%p protocol=%u location=%p",  
                __entry->skbaddr, __entry->protocol, __entry->location)  
);  
  
本文例子中每次排第一位的都是tcp_v4_rcv函数, 通过源码, 找到这个函数的定义, 在函数定义中可以找到kfree_skb函数, 也就是本文用到的trace.  
/usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/net/ipv4/tcp_ipv4.c  
/*  
 *      From tcp_input.c  
 */  
  
int tcp_v4_rcv(struct sk_buff *skb)  
{  
        const struct iphdr *iph;  
        struct tcphdr *th;  
        struct sock *sk;  
        int ret;  
        struct net *net = dev_net(skb->dev);  
  
        if (skb->pkt_type != PACKET_HOST)  
                goto discard_it;  
  
        /* Count it even if it's bad */  
        TCP_INC_STATS_BH(net, TCP_MIB_INSEGS);  
  
        if (!pskb_may_pull(skb, sizeof(struct tcphdr)))  
                goto discard_it;  
  
        th = tcp_hdr(skb);  
  
        if (th->doff < sizeof(struct tcphdr) / 4)  
                goto bad_packet;  
        if (!pskb_may_pull(skb, th->doff * 4))  
                goto discard_it;  
  
        /* An explanation is required here, I think.  
         * Packet length and doff are validated by header prediction,  
         * provided case of th->doff==0 is eliminated.  
         * So, we defer the checks. */  
        if (!skb_csum_unnecessary(skb) && tcp_v4_checksum_init(skb))  
                goto bad_packet;  
  
        th = tcp_hdr(skb);  
        iph = ip_hdr(skb);  
        TCP_SKB_CB(skb)->seq = ntohl(th->seq);  
        TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +  
                                    skb->len - th->doff * 4);  
        TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);  
        TCP_SKB_CB(skb)->when    = 0;  
        TCP_SKB_CB(skb)->flags   = iph->tos;  
        TCP_SKB_CB(skb)->sacked  = 0;  
  
        sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);  
        if (!sk)  
                goto no_tcp_socket;  
  
process:  
        if (sk->sk_state == TCP_TIME_WAIT)  
                goto do_time_wait;  
  
        if (unlikely(iph->ttl < sk_get_min_ttl(sk))) {  
                NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);  
                goto discard_and_relse;  
        }  
  
        if (!xfrm4_policy_check(sk, XFRM_POLICY_IN, skb))  
                goto discard_and_relse;  
        nf_reset(skb);  
  
        if (sk_filter(sk, skb))  
                goto discard_and_relse;  
  
        skb->dev = NULL;  
  
        inet_rps_save_rxhash(sk, skb->rxhash);  
  
        bh_lock_sock_nested(sk);  
        ret = 0;  
        if (!sock_owned_by_user(sk)) {  
#ifdef CONFIG_NET_DMA  
                struct tcp_sock *tp = tcp_sk(sk);  
                if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)  
                        tp->ucopy.dma_chan = dma_find_channel(DMA_MEMCPY);  
                if (tp->ucopy.dma_chan)  
                        ret = tcp_v4_do_rcv(sk, skb);  
                else  
#endif  
                {  
                        if (!tcp_prequeue(sk, skb))  
                                ret = tcp_v4_do_rcv(sk, skb);  
                }  
        } else if (unlikely(sk_add_backlog(sk, skb))) {  
                bh_unlock_sock(sk);  
                NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);  
                goto discard_and_relse;  
        }  
        bh_unlock_sock(sk);  
  
        sock_put(sk);  
  
        return ret;  
  
no_tcp_socket:  
        if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))  
                goto discard_it;  
  
        if (skb->len < (th->doff << 2) || tcp_checksum_complete(skb)) {  
bad_packet:  
                TCP_INC_STATS_BH(net, TCP_MIB_INERRS);  
        } else {  
                tcp_v4_send_reset(NULL, skb);  
        }  
  
discard_it:  
        /* Discard frame. */  
        kfree_skb(skb);  
        return 0;  
  
discard_and_relse:  
        sock_put(sk);  
        goto discard_it;  
  
do_time_wait:  
        if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) {  
                inet_twsk_put(inet_twsk(sk));  
                goto discard_it;  
        }  
  
        if (skb->len < (th->doff << 2) || tcp_checksum_complete(skb)) {  
                TCP_INC_STATS_BH(net, TCP_MIB_INERRS);  
                inet_twsk_put(inet_twsk(sk));  
                goto discard_it;  
        }  
        switch (tcp_timewait_state_process(inet_twsk(sk), skb, th)) {  
        case TCP_TW_SYN: {  
                struct sock *sk2 = inet_lookup_listener(dev_net(skb->dev),  
                                                        &tcp_hashinfo,  
                                                        iph->daddr, th->dest,  
                                                        inet_iif(skb));  
                if (sk2) {  
                        inet_twsk_deschedule(inet_twsk(sk), &tcp_death_row);  
                        inet_twsk_put(inet_twsk(sk));  
                        sk = sk2;  
                        goto process;  
                }  
                /* Fall through to ACK */  
        }  
        case TCP_TW_ACK:  
                tcp_v4_timewait_ack(sk, skb);  
                break;  
        case TCP_TW_RST:  
                goto no_tcp_socket;  
        case TCP_TW_SUCCESS:;  
        }  
        goto discard_it;  
}  
查询更多的丢包点如下 :   
[root@db-172-16-3-150 network]# grep -rn kfree_skb /usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/|grep -v "\.h:"|less  
/usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/drivers/infiniband/hw/cxgb3/iwch_ev.c:231:     dev_kfree_skb_irq(skb);  
/usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/drivers/infiniband/hw/cxgb3/iwch_cm.c:146:             kfree_skb(skb);  
/usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/drivers/infiniband/hw/cxgb3/iwch_cm.c:151:             kfree_skb(skb);  
/usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/drivers/infiniband/hw/cxgb3/iwch_cm.c:162:             kfree_skb(skb);  
/usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/drivers/infiniband/hw/cxgb3/iwch_cm.c:167:             kfree_skb(skb);  
... 略.  

参考

1. /usr/share/systemtap/testsuite/systemtap.examples

2. https://sourceware.org/systemtap/SystemTap_Beginners_Guide/useful-systemtap-scripts.html

3. systemtap-testsuite

4. https://sourceware.org/systemtap/examples/

5. /usr/share/systemtap/testsuite/systemtap.examples/index.txt

6. /usr/share/systemtap/testsuite/systemtap.examples/keyword-index.txt

7. /usr/share/systemtap/tapset

8. http://blog.yufeng.info/archives/2497

9. /usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/net/ipv4/tcp_ipv4.c

10. https://sourceware.org/systemtap/tapsets/API-symdata.html

11. https://sourceware.org/systemtap/tapsets/API-symname.html

Flag Counter

digoal’s 大量PostgreSQL文章入口