Systemtap EXP: fix process probe global variables output BUG?(PostgreSQL checkpoint__done)

3 minute read

背景

最近在测试PostgreSQL checkpoint__done以及buffer__sync__done动态跟踪时, 输出这个探针的变量出现错误, 详见

http://blog.163.com/digoal@126/blog/static/163877040201391622459221/

已经在pgsql-bug报告中提交了这个bug, 还未得到答复.  
  
本文将分析一下出错的原因, 以及如何解决这个问题.  
checkpoint__done输出变量NBuffers以及CheckpointStats时错误 :   
[root@db-172-16-3-150 postgresql-9.3.1]# stap --vp 10000 -e 'probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("checkpoint__done") {println($$vars)}'  
Pass 1: parsed user script and 96 library script(s) using 152024virt/25212res/2104shr/23932data kb, in 230usr/20sys/252real ms.  
WARNING: Can't parse SDT_V3 operand 'CheckpointStats+40(%rip)': identifier '$$vars' at <input>:1:87  
 source: probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("checkpoint__done") {println($$vars)}  
                                                                                               ^  
WARNING: Can't parse SDT_V3 operand 'NBuffers(%rip)': identifier '$$vars' at :1:87  
 source: probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("checkpoint__done") {println($$vars)}  
                                                                                               ^  
WARNING: Can't parse SDT_V3 operand 'CheckpointStats+44(%rip)': identifier '$$vars' at :1:87  
 source: probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("checkpoint__done") {println($$vars)}  
                                                                                               ^  
WARNING: Can't parse SDT_V3 operand 'CheckpointStats+48(%rip)': identifier '$$vars' at :1:87  
 source: probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("checkpoint__done") {println($$vars)}  
                                                                                               ^  
WARNING: Can't parse SDT_V3 operand 'CheckpointStats+52(%rip)': identifier '$$vars' at :1:87  
 source: probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("checkpoint__done") {println($$vars)}  
                                                                                               ^  
semantic error: unable to find local 'arg1', [man error::dwarf] dieoffset 0xd00f5 in /home/pg93/pgsql9.3.1/bin/postgres, near pc 0x4b9789 in CreateCheckPoint xlog.c (alternatives: $flags $shutdown $checkPoint $recptr $Insert $rdata $freespace $_logSegNo $vxids $nvxids $__func__): identifier '$$vars' at :1:87  
        source: probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("checkpoint__done") {println($$vars)}  
                                                                                                      ^  
  
semantic error: SDT asm not understood, requires debuginfo: identifier '$$vars' at :1:87  
        source: probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("checkpoint__done") {println($$vars)}  
                                                                                                      ^  
  
Pass 2: analysis failed.  [man error::pass2]  
buffer__sync__done探针输出变量NBuffers时错误 :   
[root@db-172-16-3-150 postgresql-9.3.1]# stap --vp 10000 -e 'probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("buffer__sync__done") {println($$vars)}'  
Pass 1: parsed user script and 96 library script(s) using 152024virt/25212res/2104shr/23932data kb, in 230usr/20sys/252real ms.  
WARNING: Can't parse SDT_V3 operand 'NBuffers(%rip)': identifier '$$vars' at <input>:1:89  
 source: probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("buffer__sync__done") {println($$vars)}  
                                                                                                 ^  
semantic error: unable to find local 'arg1', [man error::dwarf] dieoffset 0x43a1d1 in /home/pg93/pgsql9.3.1/bin/postgres, near pc 0x630697 in <unknown> bufmgr.c (alternatives: $buf_id $num_to_scan $num_to_write $num_written $mask): identifier '$$vars' at :1:89  
        source: probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("buffer__sync__done") {println($$vars)}  
                                                                                                        ^  
  
semantic error: SDT asm not understood, requires debuginfo: identifier '$$vars' at :1:89  
        source: probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("buffer__sync__done") {println($$vars)}  
                                                                                                        ^  
  
Pass 2: analysis failed.  [man error::pass2]  
这两个错误都源自输出的不是探针所在函数内的本地变量, 而是全局变量  
例如checkpoint__done :   
        TRACE_POSTGRESQL_CHECKPOINT_DONE(CheckpointStats.ckpt_bufs_written,  
                                                                         NBuffers,  
                                                                         CheckpointStats.ckpt_segs_added,  
                                                                         CheckpointStats.ckpt_segs_removed,  
                                                                         CheckpointStats.ckpt_segs_recycled);  
探针所在函数内没有CheckpointStats以及NBuffers变量的定义, CheckpointStats在文件内函数外, NBuffers则在其他文件中.  
src/backend/access/transam/xlog.c  
/*  
 * Statistics for current checkpoint are collected in this global struct.  
 * Because only the background writer or a stand-alone backend can perform  
 * checkpoints, this will be unused in normal backends.  
 */  
CheckpointStatsData CheckpointStats;  
src/backend/utils/init/globals.c  
/*  
 * Primary determinants of sizes of shared-memory structures.  
 *  
 * MaxBackends is computed by PostmasterMain after modules have had a chance to  
 * register background workers.  
 */  
int                     NBuffers = 1000;  
src/include/storage/bufmgr.h  
/* in globals.c ... this duplicates miscadmin.h */  
extern PGDLLIMPORT int NBuffers;  
在buffer__sync__done探针中则是用到了NBuffers :   
src/backend/storage/buffer/bufmgr.c  
        TRACE_POSTGRESQL_BUFFER_SYNC_DONE(NBuffers, num_written, num_to_write);  
[解决办法1 : ]  
新建本地变量用来存储这些全局变量的值, 探针输出这些本地变量即可. 例如 :   
[root@db-172-16-3-150 postgresql-9.3.1]# vi src/backend/access/transam/xlog.c  
        int v1,v2,v3,v4,vnb;  
        vnb = NBuffers;  
        v1 = CheckpointStats.ckpt_bufs_written;  
        v2 = CheckpointStats.ckpt_segs_added;  
        v3 = CheckpointStats.ckpt_segs_removed;  
        v4 = CheckpointStats.ckpt_segs_recycled;  
  
        TRACE_POSTGRESQL_CHECKPOINT_DONE(v1,  
                                                                         vnb,  
                                                                         v2,  
                                                                         v3,  
                                                                         v4);  
[root@db-172-16-3-150 postgresql-9.3.1]# vi src/backend/storage/buffer/bufmgr.c  
        int vnb;  
        vnb = NBuffers;  
        TRACE_POSTGRESQL_BUFFER_SYNC_DONE(vnb, num_written, num_to_write);  
[root@db-172-16-3-150 postgresql-9.3.1]# gmake && gmake install  
[root@db-172-16-3-150 postgresql-9.3.1]# su - pg93  
pg93@db-172-16-3-150-> pg_ctl restart -m fast  
在探针$$locals本地变量中就可以看到这几个刚才定义的变量了.  
[root@db-172-16-3-150 ~]# stap -D MAXSTRINGLEN=100000 --vp 10000 -e 'probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("checkpoint__done") {println($$locals$$)}'  
Pass 1: parsed user script and 96 library script(s) using 152048virt/25252res/2116shr/23956data kb, in 230usr/10sys/252real ms.  
shutdown=? checkPoint={.redo=26648020808, .ThisTimeLineID=1, .PrevTimeLineID=1, .fullPageWrites='\001', .nextXidEpoch=0, .nextXid=130667532, .nextOid=57479, .nextMulti=1, .nextMultiOffset=0, .oldestXid=1800, .oldestXidDB=1, .oldestMulti=1, .oldestMultiDB=1, .time=1382057407, .oldestActiveXid=130667532} recptr=26648020968 Insert={.PrevRecord=26648020864, .curridx=886, .currpage=0x7f6ca01fc000, .currpos="", .RedoRecPtr=26648020808, .forcePageWrites='\000', .fullPageWrites='\001', .exclusiveBackup='\000', .nonExclusiveBackups=0, .lastBackupStart=0} rdata={.data="H?X4", .len=72, .buffer=0, .buffer_std='\001', .next=0x0} freespace=? _logSegNo=? vxids={.backendId=?, .localTransactionId=?} nvxids=0 __func__="CreateCheckPoint" v1=0 v2=0 v3=0 v4=0 vnb=?  
  
[root@db-172-16-3-150 ~]# stap -D MAXSTRINGLEN=1000000 --vp 10000 -e 'probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("buffer__sync__done") {println($$locals)}'  
Pass 1: parsed user script and 96 library script(s) using 152040virt/25244res/2116shr/23948data kb, in 240usr/10sys/251real ms.  
buf_id=0x501 num_to_scan=? num_to_write=0x21e num_written=0x21e mask=? vnb=?  
举例, 输出checkpoint的探针信息 :   
page_size=8KB, xlog_segment_size=16MB  
执行stap 以及输出 :   
[root@db-172-16-3-150 ~]# stap -D MAXSTRINGLEN=100000 --vp 10000 -e '  
probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("checkpoint__done") {  
  printf("ckpt_bufs_written:%d, ckpt_segs_added:%d, ckpt_segs_removed:%d, ckpt_segs_recycled:%d\n", $v1, $v2, $v3, $v4)  
}'  
Pass 1: parsed user script and 96 library script(s) using 152048virt/25248res/2116shr/23956data kb, in 240usr/20sys/256real ms.  
ckpt_bufs_written:1, ckpt_segs_added:0, ckpt_segs_removed:0, ckpt_segs_recycled:0  
ckpt_bufs_written:54056, ckpt_segs_added:0, ckpt_segs_removed:0, ckpt_segs_recycled:43  
SQL 如下 :   
经计算和stap输出结果吻合.  
digoal=# truncate t1;  
TRUNCATE TABLE  
digoal=# checkpoint;  
CHECKPOINT  
digoal=# select pg_current_xlog_location();  
 pg_current_xlog_location   
--------------------------  
 6/8E9ABA48  
(1 row)  
  
digoal=# insert into t1 select generate_series(1,10000000),'test';  
INSERT 0 10000000  
digoal=# select pg_current_xlog_location();  
 pg_current_xlog_location   
--------------------------  
 6/B9952F18  
(1 row)  
  
digoal=# checkpoint;  
CHECKPOINT  
digoal=# select pg_current_xlog_location();  
 pg_current_xlog_location   
--------------------------  
 6/B9952FB8  
(1 row)  
  
digoal=# \dt+ t1  
                    List of relations  
 Schema | Name | Type  |  Owner   |  Size  | Description   
--------+------+-------+----------+--------+-------------  
 public | t1   | table | postgres | 422 MB |   
(1 row)  
  
digoal=# select 54056*8/1024;  
 ?column?   
----------  
      422  
(1 row)  
  
digoal=# select pg_xlog_location_diff('6/B9952FB8', '6/8E9ABA48');  
 pg_xlog_location_diff   
-----------------------  
             721057136  
(1 row)  
  
digoal=# select 721057136/16/1024/1024;  
 ?column?   
----------  
       42  
(1 row)  
  
[解决办法2 : ]  
使用@("varname")或者$varname的方式来访问本地或全局变量, 适用所有探针, 如果探针没有定义参数, 也可以使用这种方法得到探针所在函数的本地变量或全局变量值.   
参考, man stapprobes  

http://blog.163.com/digoal@126/blog/static/16387704020138113455697/

把修改过的源码还原重新编译, 重启数据库, 使用以下stap脚本即可.  
[root@db-172-16-3-150 ~]# stap -D MAXSTRINGLEN=100000 --vp 10000 -e '  
probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("checkpoint__done") {  
  printf("ckpt_bufs_written:%d, ckpt_segs_added:%d, ckpt_segs_removed:%d, ckpt_segs_recycled:%d, NBuffers:%d\n", @var("CheckpointStats")->ckpt_bufs_written, @var("CheckpointStats")->ckpt_segs_added, @var("CheckpointStats")->ckpt_segs_removed, @var("CheckpointStats")->ckpt_segs_recycled, @var("NBuffers"))  
}'  
Pass 1: parsed user script and 96 library script(s) using 151984virt/25256res/2116shr/23892data kb, in 240usr/20sys/251real ms.  
ckpt_bufs_written:0, ckpt_segs_added:0, ckpt_segs_removed:0, ckpt_segs_recycled:0, NBuffers:262144  
ckpt_bufs_written:0, ckpt_segs_added:0, ckpt_segs_removed:0, ckpt_segs_recycled:0, NBuffers:262144  

参考

1. http://blog.163.com/digoal@126/blog/static/163877040201391622459221/

Flag Counter

digoal’s 大量PostgreSQL文章入口