Linux 时钟精度与 PostgreSQL auto_explain (explain timing on Linux时钟开销估算)

6 minute read

背景

我们在诊断SQL的执行计划时，通常会用explain analyze，analyze有几个开关，其中一个是timing，它会帮你记录下SQL每个NODE的执行时间。

但是这部分是有一定的性能开销的，而且这个开销与操作系统的时钟获取接口有关。

有时，你会发现explain analyze的执行时间远大于真实的执行时间，这是为什么呢?

时钟硬件与时间精确度

常见时钟方法精度 tsc > hpet ( 100 纳秒(十亿分之一秒) ) > acpi_pm ( 300 纳秒(十亿分之一秒) )

Clock hardware and timing accuracy

Collecting accurate timing information is normally done on computers using hardware clocks with various levels of accuracy.   
With some hardware the operating systems can pass the system clock time almost directly to programs.   
A system clock can also be derived from a chip that simply provides timing interrupts, periodic ticks at some known time interval.   
In either case, operating system kernels provide a clock source that hides these details.   
But the accuracy of that clock source and how quickly it can return results varies based on the underlying hardware.  
  
Inaccurate time keeping can result in system instability. Test any change to the clock source very carefully.   
Operating system defaults are sometimes made to favor reliability over best accuracy. And if you are using a virtual machine, look into the recommended time sources compatible with it.   
Virtual hardware faces additional difficulties when emulating timers, and there are often per operating system settings suggested by vendors.  
  
The Time Stamp Counter (TSC) clock source is the most accurate one available on current generation CPUs.   
It's the preferred way to track the system time when it's supported by the operating system and the TSC clock is reliable.   
There are several ways that TSC can fail to provide an accurate timing source, making it unreliable.   
Older systems can have a TSC clock that varies based on the CPU temperature, making it unusable for timing.   
  
以前就有遇到过机器时钟跳来跳去的问题，我记得是IBMX3950的堆叠服务器  
Trying to use TSC on some older multicore CPUs can give a reported time that's inconsistent among multiple cores.   
This can result in the time going backwards, a problem this program checks for. And even the newest systems can fail to provide accurate TSC timing with very aggressive power saving configurations.  
  
Newer operating systems may check for the known TSC problems and switch to a slower, more stable clock source when they are seen.   
If your system supports TSC time but doesn't default to that, it may be disabled for a good reason.   
And some operating systems may not detect all the possible problems correctly, or will allow using TSC even in situations where it's known to be inaccurate.  
  
The High Precision Event Timer (HPET) is the preferred timer on systems where it's available and TSC is not accurate.   
The timer chip itself is programmable to allow up to 100 nanosecond resolution, but you may not see that much accuracy in your system clock.  
  
Advanced Configuration and Power Interface (ACPI) provides a Power Management (PM) Timer, which Linux refers to as the acpi_pm.   
The clock derived from acpi_pm will at best provide 300 nanosecond resolution.  
  
Timers used on older PC hardware include the 8254 Programmable Interval Timer (PIT),   
the real-time clock (RTC), the Advanced Programmable Interrupt Controller (APIC) timer,   
and the Cyclone timer.   
These timers aim for millisecond resolution.  

explain analyze代码

当开启了explain analyze timing开关时，会设置instrument_option |= INSTRUMENT_TIMER;

src/backend/commands/explain.c

ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,  
                           const char *queryString, ParamListInfo params,  
                           const instr_time *planduration)  
{  
        DestReceiver *dest;  
        QueryDesc  *queryDesc;  
        instr_time      starttime;  
        double          totaltime = 0;  
        int                     eflags;  
        int                     instrument_option = 0;  
  
        if (es->analyze && es->timing)  
                instrument_option |= INSTRUMENT_TIMER;  
        else if (es->analyze)  
                instrument_option |= INSTRUMENT_ROWS;  
  
        if (es->buffers)  
                instrument_option |= INSTRUMENT_BUFFERS;  

这个设置会直接影响EXECUTE时的时钟统计，这部分还不是很友好，其实没有必要每条TUPLE都统计这部分时间，只要统计进出NODE的时候的时间即可

src/backend/executor/instrument.c

/* Entry to a plan node */  
void  
InstrStartNode(Instrumentation *instr)  
{  
        if (instr->need_timer)  
        {  
                if (INSTR_TIME_IS_ZERO(instr->starttime))  
                        INSTR_TIME_SET_CURRENT(instr->starttime);  
                else  
                        elog(ERROR, "InstrStartNode called twice in a row");  
        }  
  
        /* save buffer usage totals at node entry, if needed */  
        if (instr->need_bufusage)  
                instr->bufusage_start = pgBufferUsage;  
}  
  
/* Exit from a plan node */  
void  
InstrStopNode(Instrumentation *instr, double nTuples)  
{  
        instr_time      endtime;  
  
        /* count the returned tuples */  
        instr->tuplecount += nTuples;  
  
        /* let's update the time only if the timer was requested */  
        if (instr->need_timer)  
        {  
                if (INSTR_TIME_IS_ZERO(instr->starttime))  
                        elog(ERROR, "InstrStopNode called without start");  
  
                INSTR_TIME_SET_CURRENT(endtime);  
                INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->starttime);  
  
                INSTR_TIME_SET_ZERO(instr->starttime);  
        }  
  
        /* Add delta of buffer usage since entry to node's totals */  
        if (instr->need_bufusage)  
                BufferUsageAccumDiff(&instr->bufusage,  
                                                         &pgBufferUsage, &instr->bufusage_start);  
  
        /* Is this the first tuple of this cycle? */  
        if (!instr->running)  
        {  
                instr->running = true;  
                instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);  
        }  
}  

src/include/portability/instr_time.h

#define INSTR_TIME_SET_CURRENT(t)       gettimeofday(&(t), NULL)

单条QUERY ， EXECUTE可能被多次调用，所以如果要统计时间。

src/backend/executor/execMain.c

void  
ExecutorRun(QueryDesc *queryDesc,  
                        ScanDirection direction, uint64 count)  
{  
        if (ExecutorRun_hook)  
                (*ExecutorRun_hook) (queryDesc, direction, count);  
        else  
                standard_ExecutorRun(queryDesc, direction, count);  
}  
  
void  
standard_ExecutorRun(QueryDesc *queryDesc,  
                                         ScanDirection direction, uint64 count)  
{  
...  
  
        /* Allow instrumentation of Executor overall runtime */  
        if (queryDesc->totaltime)  
                InstrStartNode(queryDesc->totaltime);  // 时钟   
...  
        if (queryDesc->totaltime)  
                InstrStopNode(queryDesc->totaltime, estate->es_processed);  

如果需要处理的记录数非常多，由于频繁调用gettimeofday，就导致了explain analyze timing时间拉长的问题。

例子

1000万纪录的count(*)操作，我们看看实际的执行时间，以及开启analyze timing后，不同时钟硬件的性能影响

create table tbl_time(id int);  
insert into tbl_time select generate_series(1,10000000);  
  
\timing  
postgres=# SELECT COUNT(*) FROM tbl_time;  
  count     
----------  
 10000000  
(1 row)  
Time: 1171.956 ms  

使用tsc hpet acpi_pm三种时钟硬件测试性能影响

#cat /sys/devices/system/clocksource/clocksource0/available_clocksource  
tsc hpet acpi_pm   

1. tsc

#echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource   
  
postgres=# explain analyze SELECT COUNT(*) FROM tbl_time;  
                                                         QUERY PLAN                                                            
-----------------------------------------------------------------------------------------------------------------------------  
 Aggregate  (cost=169247.71..169247.72 rows=1 width=8) (actual time=2113.432..2113.432 rows=1 loops=1)  
   ->  Seq Scan on tbl_time  (cost=0.00..144247.77 rows=9999977 width=0) (actual time=0.013..1128.860 rows=10000000 loops=1)  
 Planning time: 0.062 ms  
 Execution time: 2113.514 ms  
(4 rows)  

timing off

postgres=# explain (analyze on,verbose on,timing off,costs on,buffers on,summary on) select count(*) from item1;
                                             QUERY PLAN                                              
-----------------------------------------------------------------------------------------------------
 Aggregate  (cost=254.75..254.75 rows=1 width=8) (actual rows=1 loops=1)
   Output: count(*)
   Buffers: shared hit=1334
   ->  Seq Scan on public.item1  (cost=0.00..38.00 rows=100000 width=0) (actual rows=100000 loops=1)
         Output: i_id, i_im_id, i_name, i_price, i_data
         Buffers: shared hit=1334
 Planning Time: 0.044 ms
 Execution Time: 10.328 ms
(8 rows)
 
timing on

postgres=# explain (analyze on,verbose on,timing on,costs on,buffers on,summary on) select count(*) from item1;
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=254.75..254.75 rows=1 width=8) (actual time=16.712..16.712 rows=1 loops=1)
   Output: count(*)
   Buffers: shared hit=1334
   ->  Seq Scan on public.item1  (cost=0.00..38.00 rows=100000 width=0) (actual time=0.007..10.242 rows=100000 loops=1)
         Output: i_id, i_im_id, i_name, i_price, i_data
         Buffers: shared hit=1334
 Planning Time: 0.043 ms
 Execution Time: 16.730 ms
(8 rows)

2. hpet

#echo hpet > /sys/devices/system/clocksource/clocksource0/current_clocksource   
  
postgres=# explain analyze SELECT COUNT(*) FROM tbl_time;  
                                                         QUERY PLAN                                                            
-----------------------------------------------------------------------------------------------------------------------------  
 Aggregate  (cost=169247.71..169247.72 rows=1 width=8) (actual time=13968.218..13968.218 rows=1 loops=1)  
   ->  Seq Scan on tbl_time  (cost=0.00..144247.77 rows=9999977 width=0) (actual time=0.018..7067.711 rows=10000000 loops=1)  
 Planning time: 0.059 ms  
 Execution time: 13968.271 ms  
(4 rows)  

3. acpi_pm

#echo acpi_pm > /sys/devices/system/clocksource/clocksource0/current_clocksource   
  
postgres=# explain analyze SELECT COUNT(*) FROM tbl_time;  
                                                         QUERY PLAN                                                            
-----------------------------------------------------------------------------------------------------------------------------  
 Aggregate  (cost=169247.71..169247.72 rows=1 width=8) (actual time=19641.242..19641.243 rows=1 loops=1)  
   ->  Seq Scan on tbl_time  (cost=0.00..144247.77 rows=9999977 width=0) (actual time=0.018..9896.285 rows=10000000 loops=1)  
 Planning time: 0.060 ms  
 Execution time: 19641.296 ms  
(4 rows)  

使用pg_test_timing测试不同的时钟硬件在使用gettimeofday时带来的性能影响

pg_test_timing是用来测试不同硬件时钟的损耗的，使用gettimeofday进行测试

下面是测试

1. tsc

pg_test_timing   
Testing timing overhead for 3 seconds.  
Per loop time including overhead: 47.70 nsec  
Histogram of timing durations:  
< usec   % of total      count  
     1     95.23986   59893249  
     2      4.75540    2990515  
     4      0.00384       2414  
     8      0.00077        485  
    16      0.00013         79  
    32      0.00000          3  
    64      0.00000          1  

2. hpet

pg_test_timing   
Testing timing overhead for 3 seconds.  
Per loop time including overhead: 696.44 nsec  
Histogram of timing durations:  
< usec   % of total      count  
     1     31.81944    1370669  
     2     67.06767    2889038  
     4      1.03890      44752  
     8      0.05959       2567  
    16      0.01418        611  
    32      0.00016          7  
    64      0.00005          2  

3. acpi_pm

pg_test_timing   
Testing timing overhead for 3 seconds.  
Per loop time including overhead: 919.07 nsec  
Histogram of timing durations:  
< usec   % of total      count  
     1     12.25423     399999  
     2     84.17305    2747553  
     4      3.45019     112620  
     8      0.08648       2823  
    16      0.03468       1132  
    32      0.00132         43  
    64      0.00003          1  
   128      0.00003          1  

根据以上测试，可以预估前面EXPLAIN ANALYZE带来的问题

acpi_pm  
<1到底是多少不知道了，<1越多，误差越大  
10000000*(1*0.1225+2*0.8417)/1000000    
  
hpet  
<1到底是多少不知道了，<1越多，误差越大  
10000000*(1*0.3182+2*0.6706)/1000000    
  
tsc   
不太好估算，因为<1到底是多少不知道了  
10000000*(1*0.9524+2*0.0475)/1000000    

auto_explain log timing也有类似问题

当开启auto_explain的timing计数后，由于时钟开销的问题，可能严重的影响性能，如本文所示，即使使用TSC时钟，对于扫描记录数很多的时候，性能下降了一倍。

虽然auto_explain有一个超时阈值，但是当你开启了timing的记录后，就会导致它把所有的NODE执行时间都记录下来，因为执行结束前，并不知道总时间会不会超，所以每条QUERY的ANALYZE TIMING都会被开启。

建议如果不是特殊需求，不要开启auto_explain timing选项。

参考

https://www.ibm.com/developerworks/cn/linux/1308_liuming_linuxtime4/

https://www.postgresql.org/docs/9.6/static/pgtesttiming.html

https://www.postgresql.org/docs/9.6/static/auto-explain.html

https://www.postgresql.org/message-id/flat/31856.1400021891%40sss.pgh.pa.us

digoal’s 大量PostgreSQL文章入口

Twitter Facebook Google+ LinkedIn

Digoal.zhou

Linux 时钟精度与 PostgreSQL auto_explain (explain timing on Linux时钟开销估算)

背景

时钟硬件与时间精确度

explain analyze代码

例子

使用pg_test_timing测试不同的时钟硬件在使用gettimeofday时带来的性能影响

auto_explain log timing也有类似问题

参考

digoal’s 大量PostgreSQL文章入口

You May Also Enjoy

PostgreSQL(PPAS 兼容Oracle) 从零开始入门手册 - 珍藏版

PostgreSQL pipelinedb 流计算插件 - IoT应用 - 实时轨迹聚合

PostgreSQL plpgsql 存储过程、函数 - 状态、异常变量打印、异常捕获… - GET [STACKED] DIAGNOSTICS

PostgreSQL datediff 日期间隔（单位转换）兼容SQL用法