Systemtap EXP: fix process probe global variables output BUG?(PostgreSQL checkpoint__done)
背景
最近在测试PostgreSQL checkpoint__done以及buffer__sync__done动态跟踪时, 输出这个探针的变量出现错误, 详见
http://blog.163.com/digoal@126/blog/static/163877040201391622459221/
已经在pgsql-bug报告中提交了这个bug, 还未得到答复.
本文将分析一下出错的原因, 以及如何解决这个问题.
checkpoint__done输出变量NBuffers以及CheckpointStats时错误 :
[root@db-172-16-3-150 postgresql-9.3.1]# stap --vp 10000 -e 'probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("checkpoint__done") {println($$vars)}'
Pass 1: parsed user script and 96 library script(s) using 152024virt/25212res/2104shr/23932data kb, in 230usr/20sys/252real ms.
WARNING: Can't parse SDT_V3 operand 'CheckpointStats+40(%rip)': identifier '$$vars' at <input>:1:87
source: probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("checkpoint__done") {println($$vars)}
^
WARNING: Can't parse SDT_V3 operand 'NBuffers(%rip)': identifier '$$vars' at :1:87
source: probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("checkpoint__done") {println($$vars)}
^
WARNING: Can't parse SDT_V3 operand 'CheckpointStats+44(%rip)': identifier '$$vars' at :1:87
source: probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("checkpoint__done") {println($$vars)}
^
WARNING: Can't parse SDT_V3 operand 'CheckpointStats+48(%rip)': identifier '$$vars' at :1:87
source: probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("checkpoint__done") {println($$vars)}
^
WARNING: Can't parse SDT_V3 operand 'CheckpointStats+52(%rip)': identifier '$$vars' at :1:87
source: probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("checkpoint__done") {println($$vars)}
^
semantic error: unable to find local 'arg1', [man error::dwarf] dieoffset 0xd00f5 in /home/pg93/pgsql9.3.1/bin/postgres, near pc 0x4b9789 in CreateCheckPoint xlog.c (alternatives: $flags $shutdown $checkPoint $recptr $Insert $rdata $freespace $_logSegNo $vxids $nvxids $__func__): identifier '$$vars' at :1:87
source: probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("checkpoint__done") {println($$vars)}
^
semantic error: SDT asm not understood, requires debuginfo: identifier '$$vars' at :1:87
source: probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("checkpoint__done") {println($$vars)}
^
Pass 2: analysis failed. [man error::pass2]
buffer__sync__done探针输出变量NBuffers时错误 :
[root@db-172-16-3-150 postgresql-9.3.1]# stap --vp 10000 -e 'probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("buffer__sync__done") {println($$vars)}'
Pass 1: parsed user script and 96 library script(s) using 152024virt/25212res/2104shr/23932data kb, in 230usr/20sys/252real ms.
WARNING: Can't parse SDT_V3 operand 'NBuffers(%rip)': identifier '$$vars' at <input>:1:89
source: probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("buffer__sync__done") {println($$vars)}
^
semantic error: unable to find local 'arg1', [man error::dwarf] dieoffset 0x43a1d1 in /home/pg93/pgsql9.3.1/bin/postgres, near pc 0x630697 in <unknown> bufmgr.c (alternatives: $buf_id $num_to_scan $num_to_write $num_written $mask): identifier '$$vars' at :1:89
source: probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("buffer__sync__done") {println($$vars)}
^
semantic error: SDT asm not understood, requires debuginfo: identifier '$$vars' at :1:89
source: probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("buffer__sync__done") {println($$vars)}
^
Pass 2: analysis failed. [man error::pass2]
这两个错误都源自输出的不是探针所在函数内的本地变量, 而是全局变量
例如checkpoint__done :
TRACE_POSTGRESQL_CHECKPOINT_DONE(CheckpointStats.ckpt_bufs_written,
NBuffers,
CheckpointStats.ckpt_segs_added,
CheckpointStats.ckpt_segs_removed,
CheckpointStats.ckpt_segs_recycled);
探针所在函数内没有CheckpointStats以及NBuffers变量的定义, CheckpointStats在文件内函数外, NBuffers则在其他文件中.
src/backend/access/transam/xlog.c
/*
* Statistics for current checkpoint are collected in this global struct.
* Because only the background writer or a stand-alone backend can perform
* checkpoints, this will be unused in normal backends.
*/
CheckpointStatsData CheckpointStats;
src/backend/utils/init/globals.c
/*
* Primary determinants of sizes of shared-memory structures.
*
* MaxBackends is computed by PostmasterMain after modules have had a chance to
* register background workers.
*/
int NBuffers = 1000;
src/include/storage/bufmgr.h
/* in globals.c ... this duplicates miscadmin.h */
extern PGDLLIMPORT int NBuffers;
在buffer__sync__done探针中则是用到了NBuffers :
src/backend/storage/buffer/bufmgr.c
TRACE_POSTGRESQL_BUFFER_SYNC_DONE(NBuffers, num_written, num_to_write);
[解决办法1 : ]
新建本地变量用来存储这些全局变量的值, 探针输出这些本地变量即可. 例如 :
[root@db-172-16-3-150 postgresql-9.3.1]# vi src/backend/access/transam/xlog.c
int v1,v2,v3,v4,vnb;
vnb = NBuffers;
v1 = CheckpointStats.ckpt_bufs_written;
v2 = CheckpointStats.ckpt_segs_added;
v3 = CheckpointStats.ckpt_segs_removed;
v4 = CheckpointStats.ckpt_segs_recycled;
TRACE_POSTGRESQL_CHECKPOINT_DONE(v1,
vnb,
v2,
v3,
v4);
[root@db-172-16-3-150 postgresql-9.3.1]# vi src/backend/storage/buffer/bufmgr.c
int vnb;
vnb = NBuffers;
TRACE_POSTGRESQL_BUFFER_SYNC_DONE(vnb, num_written, num_to_write);
[root@db-172-16-3-150 postgresql-9.3.1]# gmake && gmake install
[root@db-172-16-3-150 postgresql-9.3.1]# su - pg93
pg93@db-172-16-3-150-> pg_ctl restart -m fast
在探针$$locals本地变量中就可以看到这几个刚才定义的变量了.
[root@db-172-16-3-150 ~]# stap -D MAXSTRINGLEN=100000 --vp 10000 -e 'probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("checkpoint__done") {println($$locals$$)}'
Pass 1: parsed user script and 96 library script(s) using 152048virt/25252res/2116shr/23956data kb, in 230usr/10sys/252real ms.
shutdown=? checkPoint={.redo=26648020808, .ThisTimeLineID=1, .PrevTimeLineID=1, .fullPageWrites='\001', .nextXidEpoch=0, .nextXid=130667532, .nextOid=57479, .nextMulti=1, .nextMultiOffset=0, .oldestXid=1800, .oldestXidDB=1, .oldestMulti=1, .oldestMultiDB=1, .time=1382057407, .oldestActiveXid=130667532} recptr=26648020968 Insert={.PrevRecord=26648020864, .curridx=886, .currpage=0x7f6ca01fc000, .currpos="", .RedoRecPtr=26648020808, .forcePageWrites='\000', .fullPageWrites='\001', .exclusiveBackup='\000', .nonExclusiveBackups=0, .lastBackupStart=0} rdata={.data="H?X4", .len=72, .buffer=0, .buffer_std='\001', .next=0x0} freespace=? _logSegNo=? vxids={.backendId=?, .localTransactionId=?} nvxids=0 __func__="CreateCheckPoint" v1=0 v2=0 v3=0 v4=0 vnb=?
[root@db-172-16-3-150 ~]# stap -D MAXSTRINGLEN=1000000 --vp 10000 -e 'probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("buffer__sync__done") {println($$locals)}'
Pass 1: parsed user script and 96 library script(s) using 152040virt/25244res/2116shr/23948data kb, in 240usr/10sys/251real ms.
buf_id=0x501 num_to_scan=? num_to_write=0x21e num_written=0x21e mask=? vnb=?
举例, 输出checkpoint的探针信息 :
page_size=8KB, xlog_segment_size=16MB
执行stap 以及输出 :
[root@db-172-16-3-150 ~]# stap -D MAXSTRINGLEN=100000 --vp 10000 -e '
probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("checkpoint__done") {
printf("ckpt_bufs_written:%d, ckpt_segs_added:%d, ckpt_segs_removed:%d, ckpt_segs_recycled:%d\n", $v1, $v2, $v3, $v4)
}'
Pass 1: parsed user script and 96 library script(s) using 152048virt/25248res/2116shr/23956data kb, in 240usr/20sys/256real ms.
ckpt_bufs_written:1, ckpt_segs_added:0, ckpt_segs_removed:0, ckpt_segs_recycled:0
ckpt_bufs_written:54056, ckpt_segs_added:0, ckpt_segs_removed:0, ckpt_segs_recycled:43
SQL 如下 :
经计算和stap输出结果吻合.
digoal=# truncate t1;
TRUNCATE TABLE
digoal=# checkpoint;
CHECKPOINT
digoal=# select pg_current_xlog_location();
pg_current_xlog_location
--------------------------
6/8E9ABA48
(1 row)
digoal=# insert into t1 select generate_series(1,10000000),'test';
INSERT 0 10000000
digoal=# select pg_current_xlog_location();
pg_current_xlog_location
--------------------------
6/B9952F18
(1 row)
digoal=# checkpoint;
CHECKPOINT
digoal=# select pg_current_xlog_location();
pg_current_xlog_location
--------------------------
6/B9952FB8
(1 row)
digoal=# \dt+ t1
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------+-------+----------+--------+-------------
public | t1 | table | postgres | 422 MB |
(1 row)
digoal=# select 54056*8/1024;
?column?
----------
422
(1 row)
digoal=# select pg_xlog_location_diff('6/B9952FB8', '6/8E9ABA48');
pg_xlog_location_diff
-----------------------
721057136
(1 row)
digoal=# select 721057136/16/1024/1024;
?column?
----------
42
(1 row)
[解决办法2 : ]
使用@("varname")或者$varname的方式来访问本地或全局变量, 适用所有探针, 如果探针没有定义参数, 也可以使用这种方法得到探针所在函数的本地变量或全局变量值.
参考, man stapprobes
http://blog.163.com/digoal@126/blog/static/16387704020138113455697/
把修改过的源码还原重新编译, 重启数据库, 使用以下stap脚本即可.
[root@db-172-16-3-150 ~]# stap -D MAXSTRINGLEN=100000 --vp 10000 -e '
probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("checkpoint__done") {
printf("ckpt_bufs_written:%d, ckpt_segs_added:%d, ckpt_segs_removed:%d, ckpt_segs_recycled:%d, NBuffers:%d\n", @var("CheckpointStats")->ckpt_bufs_written, @var("CheckpointStats")->ckpt_segs_added, @var("CheckpointStats")->ckpt_segs_removed, @var("CheckpointStats")->ckpt_segs_recycled, @var("NBuffers"))
}'
Pass 1: parsed user script and 96 library script(s) using 151984virt/25256res/2116shr/23892data kb, in 240usr/20sys/251real ms.
ckpt_bufs_written:0, ckpt_segs_added:0, ckpt_segs_removed:0, ckpt_segs_recycled:0, NBuffers:262144
ckpt_bufs_written:0, ckpt_segs_added:0, ckpt_segs_removed:0, ckpt_segs_recycled:0, NBuffers:262144
参考
1. http://blog.163.com/digoal@126/blog/static/163877040201391622459221/