Systemtap kernel Marker probes

4 minute read

背景

This family of probe points connects to static probe markers inserted into the kernel or a module.   
These markers are special macro calls in the kernel that make probing faster and more reliable than with DWARF-based probes.   
DWARF debugging information is not required to use probe markers.  
mark探针是内核中定义的一些特殊的宏, 在编译内核或模块时就有了(类似前面讲到的userspace mark probe, 例如postgresql中的静态探针), 所以使用mark探针时不需要用到DWARF-based debuginfo包.  
mark探针使用起来比DWARF-based探针更快, 更安全.  
  
Marker probe points begin with a kernel prefix which identifies the source of the symbol table used for finding markers.   
The suffix names the marker itself: mark.("MARK").   
The marker name string, which can contain wildcard characters, is matched against the names given to the marker macros when the kernel or module is compiled.   
Optionally, you can specify format("FORMAT").   
Specifying the marker format string allows differentiation between two markers with the same name but different marker format strings.  
mark探针用法, kernel.mark("MARK")[.format("FORMAT")], .format是可选的, 用于区分格式不同但是同名的mark探针.  
  
The handler associated with a marker probe reads any optional parameters specified at the macro call site named $arg1 through $argNN, where NN is the number of parameters supplied by the macro. Number and string parameters are passed in a type-safe manner.  
在mark的宏定义时的参数, 可以在probe的handler中读取, 参数名的命名规则为$arg1 ... $argNN  
当然使用$$vars也可以输出这些参数.  
  
The marker format string associated with a marker is available in $format. The marker name string is available in $name.  
在handler中如果要输出该mark探针的mark name以及格式name, 以$name和$format表述.  
Here are the marker probe constructs:  
  
kernel.mark("MARK")  
kernel.mark("MARK").format("FORMAT")  
For more information about marker probes, see http://sourceware.org/systemtap/wiki/UsingMarkers.  

举例 :

查看系统中有哪些mark探针, 使用stap -l, 如下

[root@db-172-16-3-39 ~]# stap -l 'kernel.mark("**")'  
kernel.mark("ext4_allocate_blocks").format("dev %s block %llu flags %u len %u ino %lu logical %llu goal %llu lleft %llu lright %llu pleft %llu pright %llu ")  
kernel.mark("ext4_allocate_inode").format("dev %s ino %lu dir %lu mode %d")  
kernel.mark("ext4_da_write_begin").format("dev %s ino %lu pos %llu len %u flags %u")  
kernel.mark("ext4_da_write_end").format("dev %s ino %lu pos %llu len %u copied %u")  
kernel.mark("ext4_da_writepage_result").format("dev %s ino %lu ret %d pages_written %d pages_skipped %ld congestion %d no_nrwrite_index_update %d")  
kernel.mark("ext4_da_writepages").format("dev %s ino %lu nr_t_write %ld pages_skipped %ld range_start %llu range_end %llu nonblocking %d for_kupdate %d for_reclaim %d for_writepages %d range_cyclic %d")  
kernel.mark("ext4_discard_preallocations").format("dev %s ino %lu")  
kernel.mark("ext4_free_blocks").format("dev %s block %llu count %lu flags %d ino %lu")  
kernel.mark("ext4_free_inode").format("dev %s ino %lu mode %d uid %lu gid %lu bocks %llu")  
kernel.mark("ext4_journalled_write_end").format("dev %s ino %lu pos %llu len %u copied %u")  
kernel.mark("ext4_mb_discard_preallocations").format("dev %s needed %d")  
kernel.mark("ext4_mb_new_group_pa").format("dev %s pstart %llu len %u lstart %u")  
kernel.mark("ext4_mb_new_inode_pa").format("dev %s ino %lu pstart %llu len %u lstart %u")  
kernel.mark("ext4_mb_release_group_pa").format("dev %s pstart %llu len %d")  
kernel.mark("ext4_mb_release_inode_pa").format("dev %s ino %lu block %llu count %u")  
kernel.mark("ext4_ordered_write_end").format("dev %s ino %lu pos %llu len %u copied %u")  
kernel.mark("ext4_request_blocks").format("dev %s flags %u len %u ino %lu lblk %llu goal %llu lleft %llu lright %llu pleft %llu pright %llu ")  
kernel.mark("ext4_request_inode").format("dev %s dir %lu mode %d")  
kernel.mark("ext4_sync_file").format("dev %s datasync %d ino %ld parent %ld")  
kernel.mark("ext4_sync_fs").format("dev %s wait %d")  
kernel.mark("ext4_write_begin").format("dev %s ino %lu pos %llu len %u flags %u")  
kernel.mark("ext4_writeback_write_end").format("dev %s ino %lu pos %llu len %u copied %u")  
kernel.mark("ext4_writepage").format("dev %s ino %lu page_index %lu")  
kernel.mark("jbd2_checkpoint").format("dev %s need_checkpoint %d")  
kernel.mark("jbd2_end_commit").format("dev %s transaction %d head %d")  
kernel.mark("jbd2_start_commit").format("dev %s transaction %d")  

探针使用举例 :

查看当前的文件系统

pg94@db-172-16-3-39-> mount  
/dev/sda1 on / type ext3 (rw)  
proc on /proc type proc (rw)  
sysfs on /sys type sysfs (rw)  
devpts on /dev/pts type devpts (rw,gid=5,mode=620)  
tmpfs on /dev/shm type tmpfs (rw)  
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)  
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)  
/dev/mapper/vgdata01-lv01 on /pgdata/digoal/1921/data01 type ext4 (rw)  
/dev/mapper/vgdata01-lv02 on /pgdata/digoal/1921/data02 type ext4 (rw)  
/dev/mapper/vgdata01-lv03 on /pgdata/digoal/1921/data03 type ext4 (rw)  
/dev/mapper/vgdata01-lv04 on /pgdata/digoal/1921/data04 type ext4 (rw)  
/dev/mapper/vgdata01-lv05 on /pgdata/digoal/1921/data05 type ext4 (rw)  
/dev/mapper/vgdata01-lv06 on /pgdata/digoal/1921/data06 type ext4 (rw)  

执行以下stap

[root@db-172-16-3-39 ~]# stap -e 'probe kernel.mark("ext4_allocate_blocks") {printf("%s\n%s\n%s\n", $name, $format, $$vars)}'  

然后在ext4文件系统/pgdata/digoal/1921/data01中做读写文件操作, stap输出如下

ext4_allocate_blocks  
dev %s block %llu flags %u len %u ino %lu logical %llu goal %llu lleft %llu lright %llu pleft %llu pright %llu   
$arg1=dm-2 $arg2=0x70a7a2 $arg3=0x420 $arg4=0x1 $arg5=0x1c041e $arg6=0x0 $arg7=0x708000 $arg8=0x0 $arg9=0x0 $arg10=0x0 $arg11=0x0  
ext4_allocate_blocks  
dev %s block %llu flags %u len %u ino %lu logical %llu goal %llu lleft %llu lright %llu pleft %llu pright %llu   
$arg1=dm-2 $arg2=0x70a7c1 $arg3=0x420 $arg4=0x1 $arg5=0x1c041c $arg6=0x0 $arg7=0x708000 $arg8=0x0 $arg9=0x0 $arg10=0x0 $arg11=0x0  
ext4_allocate_blocks  
dev %s block %llu flags %u len %u ino %lu logical %llu goal %llu lleft %llu lright %llu pleft %llu pright %llu   
$arg1=dm-2 $arg2=0x70abfc $arg3=0x420 $arg4=0x2 $arg5=0x1c0421 $arg6=0x0 $arg7=0x708000 $arg8=0x0 $arg9=0x0 $arg10=0x0 $arg11=0x0  
ext4_allocate_blocks  
dev %s block %llu flags %u len %u ino %lu logical %llu goal %llu lleft %llu lright %llu pleft %llu pright %llu   
$arg1=dm-2 $arg2=0x70a8f3 $arg3=0x420 $arg4=0x1 $arg5=0x1c0418 $arg6=0x0 $arg7=0x708000 $arg8=0x0 $arg9=0x0 $arg10=0x0 $arg11=0x0  

因为mark探针不需要debuginfo支持, 所以在未按照debuginfo包的OS中也可以使用, 例如

[root@db-172-16-3-40 ~]# rpm -qa|grep debuginfo  
[root@db-172-16-3-40 ~]# stap -e 'probe kernel.mark("ext4_allocate_blocks") {printf("%s\n%s\n%s\n", $name, $format, $$vars)}'  
ext4_allocate_blocks  
dev %s block %llu flags %u len %u ino %lu logical %llu goal %llu lleft %llu lright %llu pleft %llu pright %llu   
$arg1=dm-0 $arg2=0x408079 $arg3=0x420 $arg4=0x1 $arg5=0x100463 $arg6=0x0 $arg7=0x408000 $arg8=0x0 $arg9=0x0 $arg10=0x0 $arg11=0x0  
ext4_allocate_blocks  
dev %s block %llu flags %u len %u ino %lu logical %llu goal %llu lleft %llu lright %llu pleft %llu pright %llu   
$arg1=dm-0 $arg2=0x40807b $arg3=0x420 $arg4=0x1 $arg5=0x10045e $arg6=0x0 $arg7=0x408000 $arg8=0x0 $arg9=0x0 $arg10=0x0 $arg11=0x0  
ext4_allocate_blocks  
dev %s block %llu flags %u len %u ino %lu logical %llu goal %llu lleft %llu lright %llu pleft %llu pright %llu   
$arg1=dm-0 $arg2=0x4093b1 $arg3=0x420 $arg4=0x3 $arg5=0x10045f $arg6=0x0 $arg7=0x408000 $arg8=0x0 $arg9=0x0 $arg10=0x0 $arg11=0x0  
ext4_allocate_blocks  
dev %s block %llu flags %u len %u ino %lu logical %llu goal %llu lleft %llu lright %llu pleft %llu pright %llu   
$arg1=dm-0 $arg2=0x4080e2 $arg3=0x420 $arg4=0x1 $arg5=0x10044e $arg6=0x0 $arg7=0x408000 $arg8=0x0 $arg9=0x0 $arg10=0x0 $arg11=0x0  

结构类型参数不能使用suffix 加$$的方法打印结构数据.

[root@db-172-16-3-39 linux]# stap -e 'probe kernel.mark("ext4_allocate_blocks") {printf("%s\n%s\n%s\n%s\n", $name, $format, $$vars, $arg2$$)}'  
semantic error: marker variable '$arg2' may not be pretty-printed: identifier '$arg2$$' at <input>:1:95  
        source: probe kernel.mark("ext4_allocate_blocks") {printf("%s\n%s\n%s\n%s\n", $name, $format, $$vars, $arg2$$)}  
                                                                                                              ^  
  
Pass 2: analysis failed.  Try again with another '--vp 01' option.  

参考

1. https://sourceware.org/systemtap/langref/Probe_points.html

2. http://sourceware.org/systemtap/wiki/UsingMarkers

3.

/usr/share/doc/kernel-doc-2.6.18/Documentation/markers.txt  
/usr/share/man/man3/tapset::stap_staticmarkers.3stap.gz  
/usr/share/systemtap/tapset/stap_staticmarkers.stp  
/usr/src/debug/kernel-2.6.18/linux-2.6.18-348.12.1.el5.x86_64/include/linux/marker.h  
/usr/src/debug/kernel-2.6.18/linux-2.6.18-348.12.1.el5.x86_64/kernel/marker.c  
/usr/src/kernels/2.6.18-348.12.1.el5-x86_64/Module.markers  
/usr/src/kernels/2.6.18-348.12.1.el5-x86_64/include/config/markers.h  
/usr/src/kernels/2.6.18-348.12.1.el5-x86_64/include/config/sample/markers.h  
/usr/src/kernels/2.6.18-348.12.1.el5-x86_64/include/linux/marker.h  
/usr/src/kernels/2.6.18-348.12.1.el5-x86_64/samples/markers  
/usr/src/kernels/2.6.18-348.12.1.el5-x86_64/samples/markers/Makefile  

Flag Counter

digoal’s 大量PostgreSQL文章入口