PostgreSQL 代码性能诊断之 - OProfile & Systemtap

7 minute read

背景

在跑benchmark时,遇到瓶颈如何分析 问题出在哪里呢?

在编译PostgreSQL 时可以使用–enable-profiling开关,打开GCC的profile输出,用作诊断。

《Greenplum PostgreSQL –enable-profiling 产生gprof性能诊断代码》

另外,也可以使用oprofile来收集整个系统的统计信息,进行分析。

还有一种方法是使用systemtap,注意它的性能开销较大。

本文将介绍一下oprofile的用法。

Oprofile与systemtap介绍

Oprofile性能损耗小,如果CPU支持硬件监控的话(现在大多数CPU已经支持)。

但是Oprofile不能像stap样使用timer来间断输出或累计输出统计,STAP损耗较大。

Oprofile 适合做性能诊断,例如系统中最耗CPU的进程,进程中哪些函数是比较耗CPU的,函数中哪段代码是最耗CPU的。

operf开启监控, opreport, opannotate可以输出调用报告,或函数、汇编指令等统计情况。

而Stap 则更加适合做跟踪。

Oprofile例子

http://oprofile.sourceforge.net/news/

wget http://prdownloads.sourceforge.net/oprofile/oprofile-1.2.0.tar.gz
tar -zxvf oprofile-1.2.0.tar.gz 
cd oprofile-1.2.0
yum install -y popt-devel
yum install -y binutils-devel
yum install -y systemtap-sdt-devel
./configure 
make
make install

跑benchmark

./configure --prefix=/home/digoal/pgsql10.4 --enable-profiling --enable-debug --enable-dtrace 
make world -j 128
make install-world

例如使用pgbench压测PostgreSQL实例,然后在PG数据库主机采集统计信息.

采集统计信息

[root@digoal ~]# cd /data06
[root@digoal data06]#  operf --system-wide --lazy-conversion
operf: Press Ctl-c or 'kill -SIGINT 45366' to stop profiling
operf: Profiler started
^C
Profiling done.
Converting profile data to OProfile format
................

输出报告

[root@digoal data06]# opreport -l -f -w -x -t 1 
Using /data06/oprofile_data/samples/ for samples directory.
CPU: Intel Core/i7, speed 1995.14 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
vma      samples  %        app name                 symbol name
007827a0 2091381  26.6819  /opt/pgsql9.4.1/bin/postgres HeapTupleSatisfiesVacuum
00490300 988600   12.6126  /opt/pgsql9.4.1/bin/postgres heap_page_prune
0078a8c0 698665    8.9136  /opt/pgsql9.4.1/bin/postgres pg_qsort
0058afb0 676022    8.6247  /opt/pgsql9.4.1/bin/postgres vac_cmp_itemptr
0058baf0 385039    4.9123  /opt/pgsql9.4.1/bin/postgres lazy_vacuum_rel
004c4d00 365497    4.6630  /opt/pgsql9.4.1/bin/postgres XLogInsert
00675420 229805    2.9319  /opt/pgsql9.4.1/bin/postgres itemoffcompare
00675d20 184668    2.3560  /opt/pgsql9.4.1/bin/postgres PageRepairFragmentation
0078a7e0 169808    2.1664  /opt/pgsql9.4.1/bin/postgres swapfunc
00655590 147647    1.8837  /opt/pgsql9.4.1/bin/postgres BufferGetBlockNumber
00488940 139389    1.7783  /opt/pgsql9.4.1/bin/postgres heap_prepare_freeze_tuple
007624d0 86239     1.1002  /opt/pgsql9.4.1/bin/postgres hash_search_with_hash_value

指定binary,输出符号表

[root@digoal data06]# opreport -l -f -g -w -x -t 1 /opt/pgsql/bin/postgres
Using /data06/oprofile_data/samples/ for samples directory.
CPU: Intel Core/i7, speed 1995.14 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
vma      samples  %        linenr info                 symbol name
007827a0 2091381  26.7572  /opt/soft_bak/postgresql-9.4.1/src/backend/utils/time/tqual.c:1116 HeapTupleSatisfiesVacuum
00490300 988600   12.6482  /opt/soft_bak/postgresql-9.4.1/src/backend/access/heap/pruneheap.c:174 heap_page_prune
0078a8c0 698665    8.9387  /opt/soft_bak/postgresql-9.4.1/src/port/qsort.c:104 pg_qsort
0058afb0 676022    8.6491  /opt/soft_bak/postgresql-9.4.1/src/backend/commands/vacuumlazy.c:1728 vac_cmp_itemptr
0058baf0 385039    4.9262  /opt/soft_bak/postgresql-9.4.1/src/backend/commands/vacuumlazy.c:172 lazy_vacuum_rel
004c4d00 365497    4.6762  /opt/soft_bak/postgresql-9.4.1/src/backend/access/transam/xlog.c:844 XLogInsert
00675420 229805    2.9401  /opt/soft_bak/postgresql-9.4.1/src/backend/storage/page/bufpage.c:415 itemoffcompare
00675d20 184668    2.3626  /opt/soft_bak/postgresql-9.4.1/src/backend/storage/page/bufpage.c:433 PageRepairFragmentation
0078a7e0 169808    2.1725  /opt/soft_bak/postgresql-9.4.1/src/port/qsort.c:78 swapfunc
00655590 147647    1.8890  /opt/soft_bak/postgresql-9.4.1/src/backend/storage/buffer/bufmgr.c:1898 BufferGetBlockNumber
00488940 139389    1.7833  /opt/soft_bak/postgresql-9.4.1/src/backend/access/heap/heapam.c:5756 heap_prepare_freeze_tuple
007624d0 86239     1.1033  /opt/soft_bak/postgresql-9.4.1/src/backend/utils/hash/dynahash.c:824 hash_search_with_hash_value

opannotate查看对应的调用,可以看到最耗费CPU的调用是哪些。

[root@digoal data06]# opannotate -x -s -t 1 /opt/pgsql/bin/postgres -i HeapTupleSatisfiesVacuum|less
Using /data06/oprofile_data/samples/ for session-dir
/* 
 * Command line: opannotate -x -s -t 1 /opt/pgsql/bin/postgres -i HeapTupleSatisfiesVacuum 
 * 
 * Interpretation of command line:
 * Output annotated source file with samples
 * Output files where samples count reach 1% of the samples
 * 
 * CPU: Intel Core/i7, speed 1995.14 MHz (estimated)
 * Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
 */
/* 
 * Total samples for file : "/opt/soft_bak/postgresql-9.4.1/src/backend/utils/time/tqual.c"
 * 
 * 2091381 100.000
 */


               :/*-------------------------------------------------------------------------
               : *
               : * tqual.c
               : *        POSTGRES "time qualification" code, ie, tuple visibility rules.
               : *
               : * NOTE: all the HeapTupleSatisfies routines will update the tuple's
               : * "hint" status bits if we see that the inserting or deleting transaction
               : * has now committed or aborted (and it is safe to set the hint bits).
               : * If the hint bits are changed, MarkBufferDirtyHint is called on
               : * the passed-in buffer.  The caller must hold not only a pin, but at least
               : * shared buffer content lock on the buffer containing the tuple.
               : *
               : * NOTE: must check TransactionIdIsInProgress (which looks in PGXACT array)
。。。。。。
1879024 89.8461 :       if (!HeapTupleHeaderXminCommitted(tuple))
               :        {
    63  0.0030 :                if (HeapTupleHeaderXminInvalid(tuple))
               :                        return HEAPTUPLE_DEAD;
               :                /* Used by pre-9.0 binary upgrades */
    18 8.6e-04 :                else if (tuple->t_infomask & HEAP_MOVED_OFF)
               :                {
               :                        TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
               :
。。。。。。

最耗费的出现在代码中的这段调用。

if (!HeapTupleHeaderXminCommitted(tuple))

Oprofile支持的事件,使用opcontrol –list-events查看:

[root@digoal data06]# opcontrol --list-events
oprofile: available events for CPU type "Intel Core/i7"

See Intel Architecture Developer's Manual Volume 3B, Appendix A and
Intel Architecture Optimization Reference Manual

For architectures using unit masks, you may be able to specify
unit masks by name.  See 'opcontrol' or 'operf' man page for more details.

CPU_CLK_UNHALTED: (counter: all)
        Clock cycles when not halted (min count: 6000)
UNHALTED_REFERENCE_CYCLES: (counter: all)
        Unhalted reference cycles (min count: 6000)
        Unit masks (default 0x1)
        ----------
        0x01: No unit mask
......

事件配置:

       --events / -e event1[,event2[,...]]
              This option is for passing a comma-separated list of event specifications for profiling. Each event spec
              is of the form:
                 name:count[:unitmask[:kernel[:user]]]
              You can specify unit mask values using either a numerical value (hex values must begin with "0x")  or  a
              symbolic  name  (if  the name=<um_name> field is shown in the ophelp output). For some named unit masks,
              the hex value is not unique; thus, OProfile tools enforce specifying such unit masks value by name.

              Event names for some IBM PowerPC systems include a _GRP<n> (group number) suffix. You  can  pass  either
              the  full event name or the base event name (i.e., without the suffix) to operf.  If the base event name
              is passed, operf will automatically choose an appropriate group number suffix for the event; thus, OPro-
              file post-processing tools will always show real event names that include the group number suffix.

              When  no event specification is given, the default event for the running processor type will be used for
              profiling.  Use ophelp to list the available events for your processor type.

以下摘自redhat admin doc

OProfile is a low overhead, system-wide performance monitoring tool. 

It uses the performance monitoring hardware on the processor to retrieve information about the kernel and executables on the system,

such as when memory is referenced, the number of L2 cache requests, and the number of hardware interrupts received. 

On a Red Hat Enterprise Linux system, the oprofile package must be installed to use this tool.
Many processors include dedicated performance monitoring hardware. This hardware makes it possible to detect when certain events happen (such as the requested data not being in cache). 

The hardware normally takes the form of one or more counters that are incremented each time an event takes place. 

When the counter value increments, an interrupt is generated, making it possible to control the amount of detail (and therefore, overhead) produced by performance monitoring.

OProfile uses this hardware (or a timer-based substitute in cases where performance monitoring hardware is not present) to collect samples of performance-related data each time a counter generates an interrupt. 

These samples are periodically written out to disk; later, the data contained in these samples can then be used to generate reports on system-level and application-level performance.

Be aware of the following limitations when using OProfile:

Use of shared libraries — Samples for code in shared libraries are not attributed to the particular application unless the --separate=library option is used.

Performance monitoring samples are inexact — When a performance monitoring register triggers a sample, the interrupt handling is not precise like a divide by zero exception. 

Due to the out-of-order execution of instructions by the processor, the sample may be recorded on a nearby instruction.

opreport does not associate samples for inline functions properly — opreport uses a simple address range mechanism to determine which function an address is in. 

Inline function samples are not attributed to the inline function but rather to the function the inline function was inserted into.

OProfile accumulates data from multiple runs — OProfile is a system-wide profiler and expects processes to start up and shut down multiple times. 

Thus, samples from multiple runs accumulate. Use the command opcontrol --reset to clear out the samples from previous runs.

Hardware performance counters do not work on guest virtual machines — Because the hardware performance counters are not available on virtual systems, 

you need to use the timer mode. Enter the command opcontrol --deinit, and then execute modprobe oprofile timer=1 to enable the timer mode.

Non-CPU-limited performance problems — OProfile is oriented to finding problems with CPU-limited processes. OProfile does not identify processes that are asleep because they are waiting on locks or for some other event to occur (for example an I/O device to finish an operation).

SystemTap is a tracing and probing tool that allows users to study and monitor the activities of the operating system in fine detail. 

It provides information similar to the output of tools like netstat, ps, top, and iostat; however, 

SystemTap is designed to provide more filtering and analysis options for the collected information.
While using OProfile is suggested in cases of collecting data on where and why the processor spends time in a particular area of code, 

it is less usable when finding out why the processor stays idle.
You might want to use SystemTap when instrumenting specific places in code. Because SystemTap allows you to run the code instrumentation without having to stop and restart the instrumented code, 

it is particularly useful for instrumenting the kernel and daemons.

参考

1. http://oprofile.sourceforge.net/

2. https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/System_Administrators_Guide/ch-OProfile.html

Flag Counter

digoal’s 大量PostgreSQL文章入口