Systemtap examples, DISK IO - 2 Tracking I/O Time For Each File Read or Write

6 minute read

背景

例子来自iotime.stp 脚本, 该脚本用以监控每个进程对文件的读写字节数以及耗费的时间. 注意syscall.read和syscall.write还提供了请求读写字节数的上下文变量($count), 脚本中没有, 我增加了请求读写字节数的输出.  
修改后的脚本内容以及注解 :   
[root@db-172-16-3-150 network]# cd /usr/share/systemtap/testsuite/systemtap.examples/io  
[root@db-172-16-3-150 io]# cat iotime.stp  
#!/usr/bin/stap  
  
/*  
 * Copyright (C) 2006-2007 Red Hat Inc.  
 *   
 * This copyrighted material is made available to anyone wishing to use,  
 * modify, copy, or redistribute it subject to the terms and conditions  
 * of the GNU General Public License v.2.  
 *  
 * You should have received a copy of the GNU General Public License  
 * along with this program.  If not, see <http://www.gnu.org/licenses/>.  
 *  
 * Print out the amount of time spent in the read and write systemcall  
 * when each file opened by the process is closed. Note that the systemtap   
 * script needs to be running before the open operations occur for  
 * the script to record data.  
 *  
 * This script could be used to to find out which files are slow to load  
 * on a machine. e.g.  
 *  
 * stap iotime.stp -c 'firefox'  
 *  
 * Output format is:  
 * timestamp pid (executabable) info_type path ...  
 *  
 * 200283135 2573 (cupsd) access /etc/printcap read: 0 write: 7063  
 * 200283143 2573 (cupsd) iotime /etc/printcap time: 69  
 *  
 */  
  
global start  // stap开始时的时间戳  
global time_io  // 记录单个进程对单个文件的读和写操作的时间开销 (gettimeofday_us() - @entry(gettimeofday_us())) "@entry()为return event独有."  
  
function timestamp:long() { return gettimeofday_us() - start }  
// 返回stap开始后的时间位移量, 在脚本中没有太大的实际意义  
  
function proc:string() { return sprintf("%d (%s)", pid(), execname()) }  
// 返回字符串, 包含进程id和进程的command name.  
  
probe begin { start = gettimeofday_us() }  
// 开始时, 记录当前的时间戳, 用以后面调用timestamp函数时输出位移量.  
  
global filehandles, fileread, filewrite, afileread, afilewrite  
// 定义几个全局变量, 存储数组  
// filehandles数组, 索引为pid和fd(进程id和文件描述符), 存储的值为文件名.  
// fileread, filewrite, afileread, afilewrite数组, 索引为pid和fd, 存储的值为读写的字节数, 请求读写的字节数.  
  
probe syscall.open.return {  
  filename = user_string($filename)  
  if ($return != -1) {  // 返回值为文件描述符id, -1表示打开文件失败.  
    filehandles[pid(), $return] = filename  
  } else {  
    printf("%d %s access %s fail\n", timestamp(), proc(), filename)  
  }  
}  
// syscall.open.return, 打开文件事件.  
// 写filehandles数组, 索引为pid和fd(进程id和文件描述符), 存储的值为文件名.  
  
probe syscall.read.return {  
  time = gettimeofday_us() - @entry(gettimeofday_us())  
  p = pid()  
  fd = $fd  
  bytes = $return  
  if (bytes > 0)   
    fileread[p, fd] += bytes  
  if ($count > 0)   
    afileread[p, fd] += $count  
  time_io[p, fd] <<< time  
}  
// syscall.read.return, 读文件事件.  
// 写fileread, afileread 数组, 索引为pid和fd, 存储的值为读的字节数, 请求读的字节数.  
// 同时写time_io数组, 索引为pid和fd, 存储的值为统计类型, read消耗的时间.   
// time = gettimeofday_us() - @entry(gettimeofday_us()) 放在handler的第一个语句, 减少误差.  
  
probe syscall.write.return {  
  time = gettimeofday_us() - @entry(gettimeofday_us())  
  p = pid()  
  fd = $fd  
  bytes = $return  
  if (bytes > 0)  
    filewrite[p, fd] += bytes  
  if ($count > 0)  
    afilewrite[p, fd] += $count  
  time_io[p, fd] <<< time  
}  
// syscall.write.return, 读文件事件.  
// 写filewrite, afilewrite 数组, 索引为pid和fd, 存储的值为写的字节数, 请求写的字节数.  
// 同时写time_io数组, 索引为pid和fd, 存储的值为统计类型, write消耗的时间.   
// time = gettimeofday_us() - @entry(gettimeofday_us()) 放在handler的第一个语句, 减少误差.  
  
probe syscall.close {  
  if ([pid(), $fd] in filehandles) {  
    printf("%d %s access %s attempts read: %d write: %d, actual read: %d write: %d\n",  
           timestamp(), proc(), filehandles[pid(), $fd],  
           afileread[pid(), $fd], afilewrite[pid(), $fd],  
	   fileread[pid(), $fd], filewrite[pid(), $fd])  
    if (@count(time_io[pid(), $fd]))  
      printf("%d %s iotime %s time: %d\n",  timestamp(), proc(),  
             filehandles[pid(), $fd], @sum(time_io[pid(), $fd]))  
   }  
  delete afileread[pid(), $fd]  
  delete afilewrite[pid(), $fd]  
  delete fileread[pid(), $fd]  
  delete filewrite[pid(), $fd]  
  delete filehandles[pid(), $fd]  
  delete time_io[pid(),$fd]  
  // 文件关闭后清除这些数组的数据, 以便下一次统计.  
}  
// 文件关闭事件, 文件关闭后输出进程对文件的读写操作的统计信息.  
// 请求读写的字节数, 实际的读写字节数, 读写总共耗费的时间.  
// if ([pid(), $fd] in filehandles) 判断stap脚本是否记录到了open动作,   
// 没有记录到的话不输出fileread,filewrite,afileread,afilewrite数组的值,   
// 因为没记录到打开文件的动作, 数据只包含进程对文件的部分读写操作的统计.  
  
执行输出举例 :   
[root@db-172-16-3-150 io]# stap iotime.stp  
28518687 10992 (postgres) access pg_xlog/archive_status/000000010000001100000008.ready attempts read: 0 write: 0, actual read: 0 write: 0  
28518723 10992 (postgres) access pg_xlog/000000010000001100000008 attempts read: 0 write: 22642688, actual read: 0 write: 22642688  
28518731 10992 (postgres) iotime pg_xlog/000000010000001100000008 time: 30728  
28518887 10994 (postgres) access pg_xlog/archive_status attempts read: 0 write: 0, actual read: 0 write: 0  
28520418 4366 (sh) access /lib64/libtinfo.so.5 attempts read: 832 write: 0, actual read: 832 write: 0  
28520426 4366 (sh) iotime /lib64/libtinfo.so.5 time: 3  
28520525 4366 (sh) access /lib64/libdl.so.2 attempts read: 832 write: 0, actual read: 832 write: 0  
28520529 4366 (sh) iotime /lib64/libdl.so.2 time: 3  
28520622 4366 (sh) access /lib64/libc.so.6 attempts read: 832 write: 0, actual read: 832 write: 0  
28520626 4366 (sh) iotime /lib64/libc.so.6 time: 2  
28520929 4366 (sh) access /usr/lib/locale/locale-archive attempts read: 0 write: 0, actual read: 0 write: 0  
28521157 4366 (sh) access /proc/meminfo attempts read: 1024 write: 0, actual read: 1024 write: 0  
28521162 4366 (sh) iotime /proc/meminfo time: 54  
28521341 4366 (sh) access /usr/lib64/gconv/gconv-modules.cache attempts read: 0 write: 0, actual read: 0 write: 0  
28522690 4366 (date) access /lib64/librt.so.1 attempts read: 832 write: 0, actual read: 832 write: 0  
28522697 4366 (date) iotime /lib64/librt.so.1 time: 3  
28522788 4366 (date) access /lib64/libc.so.6 attempts read: 832 write: 0, actual read: 832 write: 0  
28522792 4366 (date) iotime /lib64/libc.so.6 time: 2  
28522888 4366 (date) access /lib64/libpthread.so.0 attempts read: 832 write: 0, actual read: 832 write: 0  
28522892 4366 (date) iotime /lib64/libpthread.so.0 time: 3  
28523146 4366 (date) access /usr/lib/locale/locale-archive attempts read: 0 write: 0, actual read: 0 write: 0  
28523254 4366 (date) access /etc/localtime attempts read: 8192 write: 0, actual read: 645 write: 0  
28523258 4366 (date) iotime /etc/localtime time: 10  
28523571 10994 (postgres) access pg_xlog/archive_status attempts read: 0 write: 0, actual read: 0 write: 0  
30506631 4361 (postgres) access pg_subtrans/0855 attempts read: 0 write: 8192, actual read: 0 write: 8192  
30506642 4361 (postgres) iotime pg_subtrans/0855 time: 21  
30720464 4361 (postgres) access pg_subtrans/0855 attempts read: 0 write: 8192, actual read: 0 write: 8192  
30720475 4361 (postgres) iotime pg_subtrans/0855 time: 33  
30935841 4361 (postgres) access pg_subtrans/0855 attempts read: 0 write: 8192, actual read: 0 write: 8192  
30935852 4361 (postgres) iotime pg_subtrans/0855 time: 33  
31147483 4361 (postgres) access pg_subtrans/0855 attempts read: 0 write: 8192, actual read: 0 write: 8192  
31147494 4361 (postgres) iotime pg_subtrans/0855 time: 26  
31281002 1581 (irqbalance) access /proc/interrupts attempts read: 8192 write: 0, actual read: 8192 write: 0  
31281013 1581 (irqbalance) iotime /proc/interrupts time: 247  
31281359 1581 (irqbalance) access /proc/stat attempts read: 3072 write: 0, actual read: 3072 write: 0  
31281363 1581 (irqbalance) iotime /proc/stat time: 249  
31281441 1581 (irqbalance) access /proc/irq/15/smp_affinity attempts read: 1024 write: 0, actual read: 9 write: 0  
31281445 1581 (irqbalance) iotime /proc/irq/15/smp_affinity time: 8  
// 读写单位字节, 时间单位微秒 (1μs (微秒)  1微秒=0.000001=10-6秒)  
  
本文用到的几个probe alias原型(包含对应的call原型).  
/usr/share/systemtap/tapset/syscalls2.stp  
# open _______________________________________________________  
# long sys_open(const char __user * filename, int flags, int mode)  
# (obsolete) long sys32_open(const char * filename, int flags, int mode)  
#  
probe syscall.open = kernel.function("compat_sys_open").call ?,  
                     kernel.function("sys32_open").call ?,  
                     kernel.function("sys_open").call ?  
{  
        name = "open"  
        filename = user_string($filename)  
        flags = $flags  
        mode = $mode  
        if (flags & 64)  
                argstr = sprintf("%s, %s, %#o", user_string_quoted($filename),  
                        _sys_open_flag_str($flags), $mode)  
        else  
                argstr = sprintf("%s, %s", user_string_quoted($filename),  
                        _sys_open_flag_str($flags))  
}  
probe syscall.open.return = kernel.function("compat_sys_open").return ?,  
                            kernel.function("sys32_open").return ?,  
                            kernel.function("sys_open").return ?  
{  
        name = "open"  
        retstr = return_str(1, $return)  
}  
# read _______________________________________________________  
# ssize_t sys_read(unsigned int fd, char __user * buf, size_t count)  
probe syscall.read = kernel.function("sys_read").call  
{  
        name = "read"  
        fd = $fd  
        buf_uaddr = $buf  
        count = $count  
        argstr = sprintf("%d, %p, %d", $fd, $buf, $count)  
}  
probe syscall.read.return = kernel.function("sys_read").return  
{  
        name = "read"  
        retstr = return_str(1, $return)  
}  
# write ______________________________________________________  
#  
# ssize_t sys_write(unsigned int fd,  
#            const char __user * buf,  
#            size_t count)  
#  
probe syscall.write = kernel.function("sys_write").call  
{  
        name = "write"  
        fd = $fd  
        buf_uaddr = $buf  
        count = $count  
        argstr = sprintf("%d, %s, %d", $fd, text_strn(user_string($buf), syscall_string_trunc, 1), $count)  
}  
probe syscall.write.return = kernel.function("sys_write").return  
{  
        name = "write"  
        retstr = return_str(1, $return)  
}  
/usr/share/systemtap/tapset/syscalls.stp  
# close ______________________________________________________  
# long sys_close(unsigned int fd)  
probe syscall.close = kernel.function("sys_close").call  
{  
        name = "close"  
        fd = $fd  
        argstr = sprint(fd)  
}  
probe syscall.close.return = kernel.function("sys_close").return  
{  
        name = "close"  
        retstr = return_str(1, $return)  
}  
  
这几个函数的源码位置信息: (源码这里就不贴出来了)  
[root@db-172-16-3-150 io]# stap -L 'kernel.function("compat_sys_open").call'  
kernel.function("compat_sys_open@fs/compat.c:1343").call $filename:char const* $flags:int $mode:int  
[root@db-172-16-3-150 io]# stap -L 'kernel.function("sys32_open").call'  
[root@db-172-16-3-150 io]# stap -L 'kernel.function("sys_open").call'  
kernel.function("sys_open@fs/open.c:913").call $filename:char const* $flags:int $mode:int $ret:long int  
[root@db-172-16-3-150 io]# stap -L 'kernel.function("sys_read").call'  
kernel.function("sys_read@fs/read_write.c:389").call $fd:unsigned int $buf:char* $count:size_t $fput_needed:int  
[root@db-172-16-3-150 io]# stap -L 'kernel.function("sys_write").call'  
kernel.function("sys_write@fs/read_write.c:407").call $fd:unsigned int $buf:char const* $count:size_t $fput_needed:int  
[root@db-172-16-3-150 io]# stap -L 'kernel.function("sys_close").call'  
kernel.function("sys_close@fs/open.c:982").call $fd:unsigned int  
  
[root@db-172-16-3-150 io]# stap -L 'kernel.function("compat_sys_open").return'  
kernel.function("compat_sys_open@fs/compat.c:1343").return $return:long int $filename:char const* $flags:int $mode:int  
[root@db-172-16-3-150 io]# stap -L 'kernel.function("sys32_open").return'  
[root@db-172-16-3-150 io]# stap -L 'kernel.function("sys_open").return'  
kernel.function("sys_open@fs/open.c:913").return $return:long int $filename:char const* $flags:int $mode:int $ret:long int  
[root@db-172-16-3-150 io]# stap -L 'kernel.function("sys_read").return'  
kernel.function("sys_read@fs/read_write.c:389").return $return:long int $fd:unsigned int $buf:char* $count:size_t $fput_needed:int  
[root@db-172-16-3-150 io]# stap -L 'kernel.function("sys_write").return'  
kernel.function("sys_write@fs/read_write.c:407").return $return:long int $fd:unsigned int $buf:char const* $count:size_t $fput_needed:int  
[root@db-172-16-3-150 io]# stap -L 'kernel.function("sys_close").return'  
kernel.function("sys_close@fs/open.c:982").return $return:long int $fd:unsigned int  

参考

1. https://sourceware.org/systemtap/SystemTap_Beginners_Guide/mainsect-disk.html

2. https://sourceware.org/systemtap/examples/

3. /usr/share/systemtap/testsuite/systemtap.examples

4. systemtap-testsuite

5. /usr/share/systemtap/testsuite/systemtap.examples/index.txt

6. /usr/share/systemtap/testsuite/systemtap.examples/keyword-index.txt

7. /usr/share/systemtap/tapset

8. https://sourceware.org/systemtap/tapsets/API-ctime.html

9. https://sourceware.org/systemtap/tapsets/API-gettimeofday-s.html

Flag Counter

digoal’s 大量PostgreSQL文章入口