Systemtap EXP: PostgreSQL IN-BUILD mark Class 4 - buffer

7 minute read

背景

本文开始讲PostgreSQL内建探针的另一个分类, buffer.  
buffer分类探针中包含:   
1. buffer读开始, 结束;   
2. shared buffer flush开始, 结束(结束点指将数据交付到内核, 并非磁盘, 所以如果要计算flush时间的话, 这个探针不合适)(在FlushBuffer函数中触发)(包含了所有的shared buffer flushed out)  
3. dirty shared buffer flush开始, 结束(在BufferAlloc函数中触发, 所以和shared buffer大小有关, 如果很频繁, 说明shared buffer太小,至于要flush掉dirty的buffer才能分配出空间给其他进程使用. 或者要提高bgwriter的唤醒频率)  
实际上是调用了FlushBuffer函数.  
                                /* OK, do the I/O */  
                                TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,  
                                                                                           smgr->smgr_rnode.node.spcNode,  
                                                                                                smgr->smgr_rnode.node.dbNode,  
                                                                                          smgr->smgr_rnode.node.relNode);  
  
                                FlushBuffer(buf, NULL);  
                                LWLockRelease(buf->content_lock);  
  
                                TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,  
                                                                                           smgr->smgr_rnode.node.spcNode,  
                                                                                                smgr->smgr_rnode.node.dbNode,  
                                                                                          smgr->smgr_rnode.node.relNode);  
4. dirty wal buffer flush开始, 结束. (如果太过频繁, 说明需要加大wal_buffers参数的值)  
  
从这些buffer探针中, 可以得到哪些信息呢?  
1. 读buffer开始, 变量包含 forkNUM() , blockid, 表空间oid, 数据库oid, relation的relfilenode(pg_class.relfilenode); backend_id(表示写入local buffer)或-1(表示写入shared buffer). 用true,false区分扩展文件请求和普通读请求.  
2. 读buffer完成, 变量和读开始一样, 但是多了一个bool变量(found), 判断是否从shared pool中读取(即命中shared buffer).  
found变量:   
                if (found)  
                        pgBufferUsage.local_blks_hit++;  
                else  
                        pgBufferUsage.local_blks_read++;  
  
                if (found)  
                        pgBufferUsage.shared_blks_hit++;  
                else  
                        pgBufferUsage.shared_blks_read++;  
3. shared buffer flush开始, 结束; 探针中可以得到 forkNUM() , blockid, 表空间oid, 数据库oid, relation的relfilenode(pg_class.relfilenode);  
每flush一个块都会触发.  
4. dirty shared buffer flush开始, 结束; 同上(shared buffer flush).  
  
接下来简单的介绍一下buffer相关探针中的变量类型 :  
1. ForkNumber  
枚举类型, 用于表示该物理存储的类别, 例如是主存储, 还是fsm或者vm, 或者init(nologging表特有).  
(nologging init参考  

http://blog.163.com/digoal@126/blog/static/163877040201382341433512/

src/include/storage/relfilenode.h  
/*  
 * The physical storage of a relation consists of one or more forks. The  
 * main fork is always created, but in addition to that there can be  
 * additional forks for storing various metadata. ForkNumber is used when  
 * we need to refer to a specific fork in a relation.  
 */  
typedef enum ForkNumber  
{  
        InvalidForkNumber = -1,  
        MAIN_FORKNUM = 0,  
        FSM_FORKNUM,  
        VISIBILITYMAP_FORKNUM,  
        INIT_FORKNUM  
  
        /*  
         * NOTE: if you add a new fork, change MAX_FORKNUM below and update the  
         * forkNames array in src/common/relpath.c  
         */  
} ForkNumber;  
2. BlockNumber  
数据文件中的block id.  
src/include/storage/block.h  
/*  
 * BlockNumber:  
 *  
 * each data file (heap or index) is divided into postgres disk blocks  
 * (which may be thought of as the unit of i/o -- a postgres buffer  
 * contains exactly one disk block).  the blocks are numbered  
 * sequentially, 0 to 0xFFFFFFFE.  
 *  
 * InvalidBlockNumber is the same thing as P_NEW in buf.h.  
 *  
 * the access methods, the buffer manager and the storage manager are  
 * more or less the only pieces of code that should be accessing disk  
 * blocks directly.  
 */  
typedef uint32 BlockNumber;  
3. 数据库, 表空间, 以及relation 的oid.  
其实就是pg_database.oid, pg_tablespace.oid, pg_class.relfilenode.  

探针详细介绍 :

name parameter desc
buffer-read-start (ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool) Probe that fires when a buffer read is started. arg0 and arg1 contain the fork and block numbers of the page (but arg1 will be -1 if this is a relation extension request). arg2, arg3, and arg4 contain the tablespace, database, and relation OIDs identifying the relation. arg5 is the ID of the backend which created the temporary relation for a local buffer, or InvalidBackendId (-1) for a shared buffer. arg6 is true for a relation extension request, false for normal read.
buffer-read-done (ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool, bool) Probe that fires when a buffer read is complete. arg0 and arg1 contain the fork and block numbers of the page (if this is a relation extension request, arg1 now contains the block number of the newly added block). arg2, arg3, and arg4 contain the tablespace, database, and relation OIDs identifying the relation. arg5 is the ID of the backend which created the temporary relation for a local buffer, or InvalidBackendId (-1) for a shared buffer. arg6 is true for a relation extension request, false for normal read. arg7 is true if the buffer was found in the pool, false if not.
buffer-flush-start (ForkNumber, BlockNumber, Oid, Oid, Oid) Probe that fires before issuing any write request for a shared buffer. arg0 and arg1 contain the fork and block numbers of the page. arg2, arg3, and arg4 contain the tablespace, database, and relation OIDs identifying the relation.
buffer-flush-done (ForkNumber, BlockNumber, Oid, Oid, Oid) Probe that fires when a write request is complete. (Note that this just reflects the time to pass the data to the kernel; it’s typically not actually been written to disk yet.) The arguments are the same as for buffer-flush-start.
buffer-write-dirty-start (ForkNumber, BlockNumber, Oid, Oid, Oid) Probe that fires when a server process begins to write a dirty buffer. (If this happens often, it implies that shared_buffers is too small or the bgwriter control parameters need adjustment.) arg0 and arg1 contain the fork and block numbers of the page. arg2, arg3, and arg4 contain the tablespace, database, and relation OIDs identifying the relation.
buffer-write-dirty-done (ForkNumber, BlockNumber, Oid, Oid, Oid) Probe that fires when a dirty-buffer write is complete. The arguments are the same as for buffer-write-dirty-start.
wal-buffer-write-dirty-start () Probe that fires when a server process begins to write a dirty WAL buffer because no more WAL buffer space is available. (If this happens often, it implies that wal_buffers is too small.)
wal-buffer-write-dirty-done () Probe that fires when a dirty WAL buffer write is complete.
buffer相关的探针一般可以用于收集哪些信息呢?  
1. buffer 读取的时间消耗, 数据量, 命中率等. 可以结合其他probe例如query probe, 得到单个sql的buffer  读的详细信息.  
2. dirty buffer flush在整个buffer flush中的占比. 也可以结合其他probe 例如query probe一起使用.  
3. dirty wal buffer的频度信息.  
例子 :   
统计单SQL的buffer读, 分别统计命中shared buffer和未命中shared buffer.  
stap -e '  
global var;  
probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__start") {  
  var[pid(),0]=0  
  var[pid(),1]=0  
}  
probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("buffer__read__done") {  
  if ($arg8)  
    var[pid(),1]++  
  if (! $arg8)  
    var[pid(),0]++  
}  
probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__done") {  
  printf("query: %s\n", user_string($arg1))  
  printf("shared buffer hit: %d\n", var[pid(),1])  
  printf("shared buffer nonhit: %d\n", var[pid(),0])  
}'  
SQL :   
digoal=# explain (analyze,verbose,costs,buffers,timing) select count(*) from test where id<100;  
                                                             QUERY PLAN                                                               
   
------------------------------------------------------------------------------------------------------------------------------------  
-  
 Aggregate  (cost=92.21..92.22 rows=1 width=0) (actual time=9.951..9.952 rows=1 loops=1)  
   Output: count(*)  
   Buffers: shared hit=102 dirtied=54  
   ->  Index Only Scan using test_pkey on public.test  (cost=0.43..91.99 rows=89 width=0) (actual time=0.065..9.917 rows=96 loops=1)  
         Output: id  
         Index Cond: (test.id < 100)  
         Heap Fetches: 98  
         Buffers: shared hit=102 dirtied=54  
 Total runtime: 10.026 ms  
(9 rows)  
  
digoal=# explain (analyze,verbose,costs,buffers,timing) select count(*) from test where id<1000;  
                                                               QUERY PLAN                                                             
       
------------------------------------------------------------------------------------------------------------------------------------  
-----  
 Aggregate  (cost=910.33..910.34 rows=1 width=0) (actual time=79.756..79.756 rows=1 loops=1)  
   Output: count(*)  
   Buffers: shared hit=980 dirtied=554  
   ->  Index Only Scan using test_pkey on public.test  (cost=0.43..908.09 rows=896 width=0) (actual time=0.032..79.419 rows=944 loop  
s=1)  
         Output: id  
         Index Cond: (test.id < 1000)  
         Heap Fetches: 972  
         Buffers: shared hit=980 dirtied=554  
 Total runtime: 79.809 ms  
(9 rows)  
  
digoal=# explain (analyze,verbose,costs,buffers,timing) select count(*) from test ;  
                                                          QUERY PLAN                                                            
------------------------------------------------------------------------------------------------------------------------------  
 Aggregate  (cost=106075.99..106076.00 rows=1 width=0) (actual time=5286.628..5286.628 rows=1 loops=1)  
   Output: count(*)  
   Buffers: shared hit=47319 dirtied=28389  
   ->  Seq Scan on public.test  (cost=0.00..94324.59 rows=4700559 width=0) (actual time=0.113..4606.995 rows=4676559 loops=1)  
         Output: id, info, crt_time  
         Buffers: shared hit=47319 dirtied=28389  
 Total runtime: 5286.727 ms  
(7 rows)  
stap 输出 :   
query: explain (analyze,verbose,costs,buffers,timing) select count(*) from test where id<100;  
shared buffer hit: 202  
shared buffer nonhit: 1  
query: explain (analyze,verbose,costs,buffers,timing) select count(*) from test where id<1000;  
shared buffer hit: 984  
shared buffer nonhit: 0  
query: explain (analyze,verbose,costs,buffers,timing) select count(*) from test ;  
shared buffer hit: 47321  
shared buffer nonhit: 0  
  
[其他]  
1. 其实很多统计信息在一些系统表中就可以看到, 例如表的buffer hit信息在statio表就可以查看到, 如下 .  
digoal=# select pg_stat_reset();  
 pg_stat_reset   
---------------  
   
(1 row)  
digoal=# select * from pg_statio_all_tables where relid='t1'::regclass;  
-[ RECORD 1 ]---+-------  
relid           | 24726  
schemaname      | public  
relname         | t1  
heap_blks_read  | 0  
heap_blks_hit   | 0  
idx_blks_read   |   
idx_blks_hit    |   
toast_blks_read | 0  
toast_blks_hit  | 0  
tidx_blks_read  | 0  
tidx_blks_hit   | 0  
  
digoal=# explain (analyze,verbose,costs,buffers,timing) select count(*) from t1 ;  
-[ RECORD 1 ]--------------------------------------------------------------------------------------------------------------  
QUERY PLAN | Aggregate  (cost=21.50..21.51 rows=1 width=0) (actual time=0.320..0.320 rows=1 loops=1)  
-[ RECORD 2 ]--------------------------------------------------------------------------------------------------------------  
QUERY PLAN |   Output: count(*)  
-[ RECORD 3 ]--------------------------------------------------------------------------------------------------------------  
QUERY PLAN |   Buffers: shared hit=9  
-[ RECORD 4 ]--------------------------------------------------------------------------------------------------------------  
QUERY PLAN |   ->  Seq Scan on public.t1  (cost=0.00..19.00 rows=1000 width=0) (actual time=0.010..0.175 rows=1000 loops=1)  
-[ RECORD 5 ]--------------------------------------------------------------------------------------------------------------  
QUERY PLAN |         Output: id, info  
-[ RECORD 6 ]--------------------------------------------------------------------------------------------------------------  
QUERY PLAN |         Buffers: shared hit=9  
-[ RECORD 7 ]--------------------------------------------------------------------------------------------------------------  
QUERY PLAN | Total runtime: 0.353 ms  
  
digoal=# select * from pg_statio_all_tables where relid='t1'::regclass;  
-[ RECORD 1 ]---+-------  
relid           | 24726  
schemaname      | public  
relname         | t1  
heap_blks_read  | 0  
heap_blks_hit   | 9  
idx_blks_read   |   
idx_blks_hit    |   
toast_blks_read | 0  
toast_blks_hit  | 0  
tidx_blks_read  | 0  
tidx_blks_hit   | 0  

参考

1. http://www.postgresql.org/docs/9.3/static/dynamic-trace.html

2.

src/backend/storage/buffer/bufmgr.c  
src/backend/access/transam/xlog.c  
src/include/storage/relfilenode.h  
/*  
 * RelFileNode must provide all that we need to know to physically access  
 * a relation, with the exception of the backend ID, which can be provided  
 * separately. Note, however, that a "physical" relation is comprised of  
 * multiple files on the filesystem, as each fork is stored as a separate  
 * file, and each fork can be divided into multiple segments. See md.c.  
 *  
 * spcNode identifies the tablespace of the relation.  It corresponds to  
 * pg_tablespace.oid.  
 *  
 * dbNode identifies the database of the relation.      It is zero for  
 * "shared" relations (those common to all databases of a cluster).  
 * Nonzero dbNode values correspond to pg_database.oid.  
 *  
 * relNode identifies the specific relation.  relNode corresponds to  
 * pg_class.relfilenode (NOT pg_class.oid, because we need to be able  
 * to assign new physical files to relations in some situations).  
 * Notice that relNode is only unique within a particular database.  
 *  
 * Note: spcNode must be GLOBALTABLESPACE_OID if and only if dbNode is  
 * zero.  We support shared relations only in the "global" tablespace.  
 *  
 * Note: in pg_class we allow reltablespace == 0 to denote that the  
 * relation is stored in its database's "default" tablespace (as  
 * identified by pg_database.dattablespace).  However this shorthand  
 * is NOT allowed in RelFileNode structs --- the real tablespace ID  
 * must be supplied when setting spcNode.  
 *  
 * Note: in pg_class, relfilenode can be zero to denote that the relation  
 * is a "mapped" relation, whose current true filenode number is available  
 * from relmapper.c.  Again, this case is NOT allowed in RelFileNodes.  
 *  
 * Note: various places use RelFileNode in hashtable keys.      Therefore,  
 * there *must not* be any unused padding bytes in this struct.  That  
 * should be safe as long as all the fields are of type Oid.  
 */  
typedef struct RelFileNode  
{  
        Oid                     spcNode;                /* tablespace */  
        Oid                     dbNode;                 /* database */  
        Oid                     relNode;                /* relation */  
} RelFileNode;  

Flag Counter

digoal’s 大量PostgreSQL文章入口