Systemtap examples, Network - 5 Monitoring Network Packets Drops in Kernel
背景
例子来自dropwatch.stp脚本, 可用于分析网络协议栈中丢包的确切位置. 确切的位置是使用symname或者symdata将内存地址翻译出来的函数信息, 翻译必须使用stap --all-modules选项以便加载所有的模块的符号表.
--all-modules
Equivalent to specifying "-dkernel" and a "-d" for each kernel module that is currently loaded. Cau-
tion: this can make the probe modules considerably larger.
脚本内容以及注解
[root@db-172-16-3-150 network]# cd /usr/share/systemtap/testsuite/systemtap.examples/network
[root@db-172-16-3-150 network]# cat dropwatch.stp
#!/usr/bin/stap
############################################################
# Dropwatch.stp
# Author: Neil Horman <nhorman@redhat.com>
# An example script to mimic the behavior of the dropwatch utility
# http://fedorahosted.org/dropwatch
############################################################
# Array to hold the list of drop points we find
global locations
# Note when we turn the monitor on and off
probe begin { printf("Monitoring for dropped packets\n") }
probe end { printf("Stopping dropped packet monitor\n") }
# increment a drop counter for every location we drop at
probe kernel.trace("kfree_skb") { locations[$location] <<< 1 }
// locations数组索引为$location, 记录kfree_skb被调用时的location参数信息, 即各模块符号表中的内存地址;
// 使用symname()或者symdata()可以将地址转换成符号信息.
# Every 5 seconds report our drop locations
probe timer.sec(5)
{
printf("\n")
foreach (l in locations-) {
printf("%d packets dropped at %s\n",
@count(locations[l]), symname(l))
}
delete locations
}
// 每5秒输出一次
// 按@count(locations[i]) 倒序输出
// 输出包含符号名, 以及丢的包个数.
如果不加载模块, symname无法正确的翻译出函数名.
[root@db-172-16-3-150 network]# stap ./dropwatch.stp
Monitoring for dropped packets
5 packets dropped at 0xffffffff814a104a
1 packets dropped at 0xffffffff8147483b
1 packets dropped at 0xffffffff814dc92c
6 packets dropped at 0xffffffff814a104a
3 packets dropped at 0xffffffff8147483b
1 packets dropped at 0xffffffff814714c8
1 packets dropped at 0xffffffffa02078bb
1 packets dropped at 0xffffffff814dc92c
5 packets dropped at 0xffffffff814a104a
1 packets dropped at 0xffffffff814dc92c
加载模块(--all-modules)后的执行输出举例 :
[root@db-172-16-3-150 network]# stap --all-modules ./dropwatch.stp
Monitoring for dropped packets
5 packets dropped at tcp_v4_rcv
1 packets dropped at unix_stream_connect
5 packets dropped at tcp_v4_rcv
1 packets dropped at nf_hook_slow
1 packets dropped at unix_stream_connect
5 packets dropped at tcp_v4_rcv
1 packets dropped at unix_stream_connect
6 packets dropped at tcp_v4_rcv
1 packets dropped at nf_hook_slow
1 packets dropped at unix_stream_connect
^CStopping dropped packet monitor
如果要输出模块信息以及函数在模块中的起始位置偏移量, 可以把symname替换成symdata来输出.
修改dropwatch.stp , 同时输出内存地址, 地址对应的符号表中的信息.
[root@db-172-16-3-150 network]# vi dropwatch.stp
#!/usr/bin/stap
############################################################
# Dropwatch.stp
# Author: Neil Horman <nhorman@redhat.com>
# An example script to mimic the behavior of the dropwatch utility
# http://fedorahosted.org/dropwatch
############################################################
# Array to hold the list of drop points we find
global locations
# Note when we turn the monitor on and off
probe begin { printf("Monitoring for dropped packets\n") }
probe end { printf("Stopping dropped packet monitor\n") }
# increment a drop counter for every location we drop at
probe kernel.trace("kfree_skb") { locations[$location] <<< 1 }
# Every 5 seconds report our drop locations
probe timer.sec(5)
{
printf("\n")
foreach (l in locations-) {
printf("%d packets dropped at %p, %s, %s\n",
@count(locations[l]), l, symname(l), symdata(l))
}
delete locations
}
使用修改后的dropwatch.stp 输出如下 :
[root@db-172-16-3-150 network]# stap --all-modules ./dropwatch.stp
Monitoring for dropped packets
6 packets dropped at 0xffffffff814a104a, tcp_v4_rcv, tcp_v4_rcv+0xaa/0x8d0 [kernel]
1 packets dropped at 0xffffffff814dc92c, unix_stream_connect, unix_stream_connect+0x1dc/0x4a0 [kernel]
5 packets dropped at 0xffffffff814a104a, tcp_v4_rcv, tcp_v4_rcv+0xaa/0x8d0 [kernel]
1 packets dropped at 0xffffffff8147483b, nf_hook_slow, nf_hook_slow+0xeb/0x110 [kernel]
1 packets dropped at 0xffffffff814dc92c, unix_stream_connect, unix_stream_connect+0x1dc/0x4a0 [kernel]
^CStopping dropped packet monitor
如果无法使用symname和symdata转换, 手工从文件/boot/System.map-2.6.32-358.el6.x86_64中解读也是可以得到对应的函数的.
/boot/System.map-2.6.32-358.el6.x86_64这个符号表记录了函数的起始地址和函数的对应关系.
[root@db-172-16-3-150 network]# sort -k 1 /boot/System.map-2.6.32-358.el6.x86_64 |less
ffffffff814a0f50 t tcp_v4_reqsk_destructor
ffffffff814a0fa0 T tcp_v4_rcv
ffffffff814a1870 T tcp_v4_conn_request
从$location以及符号表匹配函数 :
6 packets dropped at 0xffffffff814a104a, tcp_v4_rcv, tcp_v4_rcv+0xaa/0x8d0 [kernel]
0xffffffff814a104a这个地址在ffffffff814a0fa0和ffffffff814a1870之间, 所以也可以得到tcp_v4_rcv.
本文用到的kernel.trace("kfree_skb")原型.
/usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/include/trace/events/skb.h
/*
* Tracepoint for free an sk_buff:
*/
TRACE_EVENT(kfree_skb,
TP_PROTO(struct sk_buff *skb, void *location),
TP_ARGS(skb, location),
TP_STRUCT__entry(
__field( void *, skbaddr )
__field( unsigned short, protocol )
__field( void *, location )
),
TP_fast_assign(
__entry->skbaddr = skb;
if (skb) {
__entry->protocol = ntohs(skb->protocol);
}
__entry->location = location;
),
TP_printk("skbaddr=%p protocol=%u location=%p",
__entry->skbaddr, __entry->protocol, __entry->location)
);
本文例子中每次排第一位的都是tcp_v4_rcv函数, 通过源码, 找到这个函数的定义, 在函数定义中可以找到kfree_skb函数, 也就是本文用到的trace.
/usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/net/ipv4/tcp_ipv4.c
/*
* From tcp_input.c
*/
int tcp_v4_rcv(struct sk_buff *skb)
{
const struct iphdr *iph;
struct tcphdr *th;
struct sock *sk;
int ret;
struct net *net = dev_net(skb->dev);
if (skb->pkt_type != PACKET_HOST)
goto discard_it;
/* Count it even if it's bad */
TCP_INC_STATS_BH(net, TCP_MIB_INSEGS);
if (!pskb_may_pull(skb, sizeof(struct tcphdr)))
goto discard_it;
th = tcp_hdr(skb);
if (th->doff < sizeof(struct tcphdr) / 4)
goto bad_packet;
if (!pskb_may_pull(skb, th->doff * 4))
goto discard_it;
/* An explanation is required here, I think.
* Packet length and doff are validated by header prediction,
* provided case of th->doff==0 is eliminated.
* So, we defer the checks. */
if (!skb_csum_unnecessary(skb) && tcp_v4_checksum_init(skb))
goto bad_packet;
th = tcp_hdr(skb);
iph = ip_hdr(skb);
TCP_SKB_CB(skb)->seq = ntohl(th->seq);
TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
skb->len - th->doff * 4);
TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
TCP_SKB_CB(skb)->when = 0;
TCP_SKB_CB(skb)->flags = iph->tos;
TCP_SKB_CB(skb)->sacked = 0;
sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
if (!sk)
goto no_tcp_socket;
process:
if (sk->sk_state == TCP_TIME_WAIT)
goto do_time_wait;
if (unlikely(iph->ttl < sk_get_min_ttl(sk))) {
NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
goto discard_and_relse;
}
if (!xfrm4_policy_check(sk, XFRM_POLICY_IN, skb))
goto discard_and_relse;
nf_reset(skb);
if (sk_filter(sk, skb))
goto discard_and_relse;
skb->dev = NULL;
inet_rps_save_rxhash(sk, skb->rxhash);
bh_lock_sock_nested(sk);
ret = 0;
if (!sock_owned_by_user(sk)) {
#ifdef CONFIG_NET_DMA
struct tcp_sock *tp = tcp_sk(sk);
if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
tp->ucopy.dma_chan = dma_find_channel(DMA_MEMCPY);
if (tp->ucopy.dma_chan)
ret = tcp_v4_do_rcv(sk, skb);
else
#endif
{
if (!tcp_prequeue(sk, skb))
ret = tcp_v4_do_rcv(sk, skb);
}
} else if (unlikely(sk_add_backlog(sk, skb))) {
bh_unlock_sock(sk);
NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
goto discard_and_relse;
}
bh_unlock_sock(sk);
sock_put(sk);
return ret;
no_tcp_socket:
if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
goto discard_it;
if (skb->len < (th->doff << 2) || tcp_checksum_complete(skb)) {
bad_packet:
TCP_INC_STATS_BH(net, TCP_MIB_INERRS);
} else {
tcp_v4_send_reset(NULL, skb);
}
discard_it:
/* Discard frame. */
kfree_skb(skb);
return 0;
discard_and_relse:
sock_put(sk);
goto discard_it;
do_time_wait:
if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) {
inet_twsk_put(inet_twsk(sk));
goto discard_it;
}
if (skb->len < (th->doff << 2) || tcp_checksum_complete(skb)) {
TCP_INC_STATS_BH(net, TCP_MIB_INERRS);
inet_twsk_put(inet_twsk(sk));
goto discard_it;
}
switch (tcp_timewait_state_process(inet_twsk(sk), skb, th)) {
case TCP_TW_SYN: {
struct sock *sk2 = inet_lookup_listener(dev_net(skb->dev),
&tcp_hashinfo,
iph->daddr, th->dest,
inet_iif(skb));
if (sk2) {
inet_twsk_deschedule(inet_twsk(sk), &tcp_death_row);
inet_twsk_put(inet_twsk(sk));
sk = sk2;
goto process;
}
/* Fall through to ACK */
}
case TCP_TW_ACK:
tcp_v4_timewait_ack(sk, skb);
break;
case TCP_TW_RST:
goto no_tcp_socket;
case TCP_TW_SUCCESS:;
}
goto discard_it;
}
查询更多的丢包点如下 :
[root@db-172-16-3-150 network]# grep -rn kfree_skb /usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/|grep -v "\.h:"|less
/usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/drivers/infiniband/hw/cxgb3/iwch_ev.c:231: dev_kfree_skb_irq(skb);
/usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/drivers/infiniband/hw/cxgb3/iwch_cm.c:146: kfree_skb(skb);
/usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/drivers/infiniband/hw/cxgb3/iwch_cm.c:151: kfree_skb(skb);
/usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/drivers/infiniband/hw/cxgb3/iwch_cm.c:162: kfree_skb(skb);
/usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/drivers/infiniband/hw/cxgb3/iwch_cm.c:167: kfree_skb(skb);
... 略.
参考
1. /usr/share/systemtap/testsuite/systemtap.examples
2. https://sourceware.org/systemtap/SystemTap_Beginners_Guide/useful-systemtap-scripts.html
3. systemtap-testsuite
4. https://sourceware.org/systemtap/examples/
5. /usr/share/systemtap/testsuite/systemtap.examples/index.txt
6. /usr/share/systemtap/testsuite/systemtap.examples/keyword-index.txt
7. /usr/share/systemtap/tapset
8. http://blog.yufeng.info/archives/2497
9. /usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/net/ipv4/tcp_ipv4.c
10. https://sourceware.org/systemtap/tapsets/API-symdata.html
11. https://sourceware.org/systemtap/tapsets/API-symname.html